开源模型应用落地-Qwen2.5-7B-Instruct与vllm实现推理加速的正确姿势-Gradio

一、前言

     目前,Qwen模型已经升级到了2.5版本。无论是语言模型还是多模态模型,它们都是在大规模的多语言和多模态数据上进行预训练的,并通过高质量的数据进行后期微调,以更好地符合人类的需求。

    Gradio作为一个强大的工具,极大地加速了机器学习模型的体验和测试过程。它通过提供一个用户友好的界面,让开发者和用户能够直观地与模型进行交互。无论是文本生成、图像识别还是音频处理,用户只需简单地输入相关内容,就能实时观察到模型的响应,这种即时反馈大大提高了模型评估的效率。同时,Gradio支持多种输入输出形式,使得开发者能够更全面地测试模型在不同场景下的表现。

    本篇将介绍如何使用Gradio快速体验Qwen2.5-7B-Instructvllm集成推理的效果。


二、术语

2.1. vLLM

    vLLM是一个开源的大模型推理加速框架,通过PagedAttention高效地管理attention中缓存的张量,实现了比HuggingFace Transformers高14-24倍的吞吐量。

2.2. Qwen2.5

    Qwen2.5系列模型都在最新的大规模数据集上进行了预训练,该数据集包含多达 18T tokens。相较于 Qwen2,Qwen2.5 获得了显著更多的知识(MMLU:85+),并在编程能力(HumanEval 85+)和数学能力(MATH 80+)方面有了大幅提升。

    此外,新模型在指令执行、生成长文本(超过 8K 标记)、理解结构化数据(例如表格)以及生成结构化输出特别是 JSON 方面取得了显著改进。 Qwen2.5 模型总体上对各种system prompt更具适应性,增强了角色扮演实现和聊天机器人的条件设置功能。

    与 Qwen2 类似,Qwen2.5 语言模型支持高达 128K tokens,并能生成最多 8K tokens的内容。它们同样保持了对包括中文、英文、法文、西班牙文、葡萄牙文、德文、意大利文、俄文、日文、韩文、越南文、泰文、阿拉伯文等 29 种以上语言的支持。 我们在下表中提供了有关模型的基本信息。

    专业领域的专家语言模型,即用于编程的 Qwen2.5-Coder 和用于数学的 Qwen2.5-Math,相比其前身 CodeQwen1.5 和 Qwen2-Math 有了实质性的改进。 具体来说,Qwen2.5-Coder 在包含 5.5 T tokens 编程相关数据上进行了训练,使即使较小的编程专用模型也能在编程评估基准测试中表现出媲美大型语言模型的竞争力。 同时,Qwen2.5-Math 支持 中文 和 英文,并整合了多种推理方法,包括CoT(Chain of Thought)、PoT(Program of Thought)和 TIR(Tool-Integrated Reasoning)。

2.3. Qwen2.5-7B-Instruct

    是通义千问团队推出的语言模型,拥有70亿参数,经过指令微调,能更好地理解和遵循指令。作为 Qwen2.5 系列的一部分,它在 18T tokens 数据上预训练,性能显著提升,具有多方面能力,包括语言理解、任务适应性、多语言支持等,同时也具备一定的长文本处理能力,适用于多种自然语言处理任务,为用户提供高质量的语言服务。

2.4. Gradio

    是一个用于构建交互式界面的Python库。它使得在Python中创建快速原型、构建和共享机器学习模型变得更加容易。

    Gradio的主要功能是为机器学习模型提供一个即时的Web界面,使用户能够与模型进行交互,输入数据并查看结果,而无需编写复杂的前端代码。它提供了一个简单的API,可以将输入和输出绑定到模型的函数或方法,并自动生成用户界面。


三、前提条件 

3.1. 基础环境及前置条件

 1)操作系统:centos7

 2)Tesla V100-SXM2-32GB  CUDA Version: 12.2

3)创建虚拟环境及安装依赖库

conda create --name test python=3.10
conda activate test
pip install gradio openai

  gradio和opein版本:

4)使用Docker部署Qwen2.5-7B-Instruct模型       

开源模型应用落地-Qwen2.5-7B-Instruct与vllm实现推理加速的正确姿势-Docker(二)https://charles.blog.csdn.net/article/details/142727087    启动结果:

(vllm) [root@gpu test]# docker run --runtime nvidia --gpus "device=0" \
>     -p 9000:9000 \
>     --ipc=host \
> -v /data/model/qwen2.5-7b-instruct:/qwen2.5-7b-instruct \
> -it --rm \
>     vllm/vllm-openai:latest \
>     --model /qwen2.5-7b-instruct --dtype float16 --max-parallel-loading-workers 1  --max-model-len 10240 --enforce-eager --host 0.0.0.0 --port 9000 --enable-auto-tool-choice --tool-call-parser hermes

INFO 10-17 01:17:57 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-17 01:17:57 api_server.py:527] args: Namespace(host='0.0.0.0', port=9000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=True, tool_call_parser='hermes', model='/qwen2.5-7b-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=10240, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=1, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-17 01:17:57 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/f8b05bc2-c5c9-4dda-8856-470440465a3d for IPC Path.
INFO 10-17 01:17:57 api_server.py:177] Started engine process with PID 22
WARNING 10-17 01:17:57 config.py:1656] Casting torch.bfloat16 to torch.float16.
WARNING 10-17 01:17:57 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 10-17 01:18:02 config.py:1656] Casting torch.bfloat16 to torch.float16.
WARNING 10-17 01:18:02 config.py:389] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-17 01:18:02 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/qwen2.5-7b-instruct', speculative_config=None, tokenizer='/qwen2.5-7b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=10240, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/qwen2.5-7b-instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-17 01:18:03 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-17 01:18:03 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-17 01:18:04 model_runner.py:1014] Starting to load model /qwen2.5-7b-instruct...
INFO 10-17 01:18:04 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-17 01:18:04 selector.py:116] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.49s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.55s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.49s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.51s/it]

INFO 10-17 01:18:10 model_runner.py:1025] Loading model weights took 14.2487 GB
INFO 10-17 01:18:12 gpu_executor.py:122] # GPU blocks: 13708, # CPU blocks: 4681
INFO 10-17 01:18:17 api_server.py:230] vLLM to use /tmp/tmp2em3j59_ as PROMETHEUS_MULTIPROC_DIR
INFO 10-17 01:18:17 serving_chat.py:77] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
WARNING 10-17 01:18:17 serving_embedding.py:189] embedding_mode is False. Embedding API will not work.
INFO 10-17 01:18:17 launcher.py:19] Available routes are:
INFO 10-17 01:18:17 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 10-17 01:18:17 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 10-17 01:18:17 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 10-17 01:18:17 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 10-17 01:18:17 launcher.py:27] Route: /health, Methods: GET
INFO 10-17 01:18:17 launcher.py:27] Route: /tokenize, Methods: POST
INFO 10-17 01:18:17 launcher.py:27] Route: /detokenize, Methods: POST
INFO 10-17 01:18:17 launcher.py:27] Route: /v1/models, Methods: GET
INFO 10-17 01:18:17 launcher.py:27] Route: /version, Methods: GET
INFO 10-17 01:18:17 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 10-17 01:18:17 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 10-17 01:18:17 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
INFO 10-17 01:18:27 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-17 01:18:37 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

四、技术实现

4.1. 代码实现

# -*- coding: utf-8 -*-

import gradio as gr
from openai import OpenAI

host = '0.0.0.0'
port = 7860

api_url = 'http://localhost:9000/v1'
model_path = '/qwen2.5-7b-instruct'
temperature = 0.45
top_p = 0.9
max_tokens = 8192
stop_token_ids=''

openai_api_key = "EMPTY"
openai_api_base = api_url


def predict(message, history):
    history_openai_format = [{
        "role": "system",
        "content": "You are a great ai assistant."
    }]
    for human, assistant in history:
        history_openai_format.append({"role": "user", "content": human})
        history_openai_format.append({
            "role": "assistant",
            "content": assistant
        })
    history_openai_format.append({"role": "user", "content": message})

    stream = client.chat.completions.create(
        model=model_path,
        messages=history_openai_format,
        temperature=temperature,
        top_p = top_p,
        max_tokens = max_tokens,
        stream=True,
        extra_body={
            'repetition_penalty':
            1,
            'stop_token_ids': [
                int(id.strip()) for id in stop_token_ids
                if id.strip()
            ] if stop_token_ids else []
        })

    partial_message = ""
    for chunk in stream:
        partial_message += (chunk.choices[0].delta.content or "")
        yield partial_message

if __name__ == '__main__':
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    gr.ChatInterface(predict).queue().launch(server_name=host, server_port=port, share=False)

4.2. 功能测试

浏览器访问代码指定的IP和Port

推理测试:

vllm日志输出:

INFO 10-20 23:18:24 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-20 23:18:34 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-20 23:18:44 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-20 23:18:54 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-20 23:19:04 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-20 23:19:14 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-20 23:19:24 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-20 23:19:30 logger.py:36] Received request chat-8282e2823afa4d1c81bc44a56b299fa2: prompt: '<|im_start|>system\nYou are a great ai assistant.<|im_end|>\n<|im_start|>user\n广州有什么好玩的景点?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.45, top_p=0.9, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8192, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 2244, 16391, 17847, 13, 151645, 198, 151644, 872, 198, 101980, 104139, 108257, 9370, 105869, 11319, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO:     172.17.0.1:40858 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 10-20 23:19:30 engine.py:288] Added request chat-8282e2823afa4d1c81bc44a56b299fa2.
INFO 10-20 23:19:30 metrics.py:351] Avg prompt throughput: 3.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 10-20 23:19:35 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 10-20 23:19:46 logger.py:36] Received request chat-5528c3aa4fa54c53aeef76b266d2d476: prompt: '<|im_start|>system\nYou are a great ai assistant.<|im_end|>\n<|im_start|>user\n广州有什么好玩的景点?<|im_end|>\n<|im_start|>assistant\n广州是一座历史悠久、文化丰富的城市,拥有许多值得一游的景点。以下是一些广州著名的景点:\n\n1. 白云山:位于广州市区北部,是广州市民休闲娱乐的好去处,山顶可以俯瞰广州全景。\n\n2. 越秀公园:位于广州市中心,是广州市民休闲的好地方,公园内有五羊雕像、镇海楼等著名景点。\n\n3. 广州塔(小蛮腰):是广州的地标性建筑之一,塔身高454米,可以乘坐电梯到达观景台,欣赏广州的美景。\n\n4. 陈家祠:是一座具有岭南特色的传统建筑群,展示了广东地区传统文化和建筑艺术的魅力。\n\n5. 番禺长隆旅游度假区:拥有大型主题公园、海洋世界、野生动物园等,适合家庭游玩。\n\n6. 五仙观:是广州著名的道教寺庙之一,可以了解道教文化和历史。\n\n7. 广州博物馆:展示了广州的历史文化、民俗风情等,是了解广州历史文化的好地方。\n\n8. 海心沙:位于珠江新城,是一个集休闲、娱乐、文化于一体的大型公共空间,可以欣赏珠江美景。\n\n9. 广州动物园:拥有各种珍稀动物,是亲子游的好去处。\n\n10. 深洋古港遗址公园:展示了广州作为古代海上丝绸之路的重要港口的历史文化。\n\n以上只是广州众多景点中的一部分,广州还有许多其他值得游览的地方,可以根据个人兴趣进行选择。<|im_end|>\n<|im_start|>user\n白云山要门票吗?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.45, top_p=0.9, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8192, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 2244, 16391, 17847, 13, 151645, 198, 151644, 872, 198, 101980, 104139, 108257, 9370, 105869, 11319, 151645, 198, 151644, 77091, 198, 101980, 115164, 116498, 5373, 99348, 104653, 99490, 3837, 103926, 100694, 108137, 82894, 9370, 105869, 1773, 87752, 99639, 97084, 101980, 105891, 105869, 48443, 16, 13, 68294, 121, 99718, 57811, 5122, 103987, 106805, 23836, 106758, 3837, 20412, 106805, 69721, 104443, 100415, 102513, 85336, 44290, 3837, 113096, 73670, 110607, 121751, 101980, 110535, 3407, 17, 13, 8908, 114, 232, 100395, 102077, 5122, 103987, 106805, 99488, 3837, 20412, 106805, 69721, 104443, 102513, 100371, 3837, 102077, 31843, 18830, 75108, 101187, 118211, 5373, 99523, 55135, 99432, 49567, 102280, 105869, 3407, 18, 13, 74577, 123, 54039, 101105, 9909, 30709, 106156, 102113, 7552, 5122, 20412, 101980, 9370, 112765, 33071, 99893, 100653, 3837, 101105, 107958, 19, 20, 19, 72261, 3837, 73670, 106825, 105038, 104658, 99237, 85254, 53938, 3837, 105012, 101980, 9370, 108559, 3407, 19, 13, 220, 100348, 45629, 111082, 5122, 115164, 100629, 116488, 106498, 100169, 99893, 99430, 3837, 108869, 102053, 100361, 106361, 33108, 99893, 100377, 108847, 3407, 20, 13, 10236, 243, 103, 120106, 45861, 100767, 116616, 23836, 5122, 103926, 101951, 100220, 102077, 5373, 104419, 99489, 5373, 112776, 99354, 49567, 3837, 100231, 101064, 109280, 3407, 21, 13, 220, 75108, 100717, 99237, 5122, 20412, 101980, 105891, 115721, 114308, 100653, 3837, 73670, 99794, 115721, 108444, 100022, 3407, 22, 13, 74577, 123, 54039, 104646, 5122, 108869, 101980, 104754, 99348, 5373, 109493, 107259, 49567, 3837, 20412, 99794, 101980, 110142, 102513, 100371, 3407, 23, 13, 98313, 115, 63109, 99617, 5122, 103987, 115094, 105856, 3837, 101909, 42067, 104443, 5373, 100415, 5373, 99348, 110128, 101951, 100070, 101054, 3837, 73670, 105012, 115094, 108559, 3407, 24, 13, 74577, 123, 54039, 117431, 5122, 103926, 100646, 100861, 101474, 101239, 3837, 20412, 107963, 82894, 102513, 85336, 44290, 3407, 16, 15, 13, 6567, 115, 109, 99840, 99470, 99734, 107512, 102077, 5122, 108869, 101980, 100622, 102640, 106150, 111643, 101945, 106165, 104754, 99348, 3407, 70589, 100009, 101980, 104087, 105869, 15946, 106979, 3837, 101980, 100626, 100694, 92894, 100760, 107871, 103958, 3837, 112184, 99605, 100565, 71817, 50404, 1773, 151645, 198, 151644, 872, 198, 107965, 57811, 30534, 107250, 101037, 11319, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO:     172.17.0.1:40862 - "POST /v1/chat/completions HTTP/1.1" 200 OK

五、附带说明

5.1. Gradio界面无法打开

 1. 服务监听地址不能是127.0.0.1

2. 检查服务器的安全策略或防火墙配置

 服务端:lsof -i:8989 查看端口是否正常监听

 客户端:telnet ip 8989 查看是否可以正常连接

5.3. Gradio增加认证机制

在上面示例代码的launch方法中增加:auth=("zhangsan", '123456')

gr.ChatInterface(predict).queue().launch(server_name=host, server_port=port, auth=("zhangsan", '123456'),share=False)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/896552.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

Windows--使用node.js的免安装版本

原文网址&#xff1a;Windows--使用node.js的免安装版本_IT利刃出鞘的博客-CSDN博客 简介 本文介绍Windows下如何使用node.js的免安装版本。 下载 1.访问官网 https://nodejs.org/en 记住这个版本号&#xff0c;这个是长期支持的版本。 2.找到压缩包 点击其他下载&#…

Verilog基础:层次化标识符的使用

相关阅读 Verilog基础https://blog.csdn.net/weixin_45791458/category_12263729.html?spm1001.2014.3001.5482 一、前言 Verilog HDL中的标识符(identifier)是一个为了引用而给一个Verilog对象起的名字&#xff0c;分为两大类&#xff1a;普通标识符大类和层次化标识符大类。…

【OpenCV】人脸识别方法

代码已上传GitHub&#xff1a;plumqm/OpenCV-Projects at master EigenFace、FisherFace、LBPHFace 这三种方法的代码区别不大所以就一段代码示例。 EigenFace与FisherFace 1. 将人脸图像展开为一维向量&#xff0c;组成训练数据集 2. PCA&#xff08;EigenFace&#xff09;或…

Spring MVC 原理与源码

Spring MVC 整体代码量有 5w 行&#xff0c;通过本专栏&#xff0c;可以快速的研读核心部分的代码&#xff0c;节省你的时间。 DispatcherServlet 的流程处理如下图&#xff1a; 但是随着前后端分离&#xff0c;后端大多提供 Restful API &#xff0c;里面的 ViewResolver 和 …

word怎么清除格式,Word一键清除所有格式教程

你是否曾在编辑Word文档时遇到过复制内容时格式混乱的情况?别担心&#xff0c;这只需要清除一下格式就可以了&#xff0c;很多朋友还不知道word怎么清除格式&#xff0c;下面小编就来给大家讲一讲word一键清除所有格式的方法教程&#xff0c;操作非常简单&#xff0c;有需要的…

02电力电子技术简介

电力电子技术简介 第一章主要是做通识性的介绍&#xff0c;介绍电力电子涉及的基本概念、学习方法和关联学科。最重要的是希望大家能理解电力电子在现实生活中的广泛应用。这一章简介主要分三部分来介绍。首先是概要性的通盘介绍。然后会通过力电子技术性的内容来了解一些拓扑…

用Python将Office文档(Word、Excel、PowerPoint)批量转换为PDF

在处理不同格式的Office文档&#xff08;如Word、Excel和PowerPoint&#xff09;时&#xff0c;将其转换为PDF格式是常见的需求。这种转换不仅确保了文件在不同设备和操作系统间的一致性显示&#xff0c;而且有助于保护原始内容不被轻易修改&#xff0c;非常适合于正式报告、提…

InnoDB引擎(架构,事务原理,MVCC详细解读)

目录 架构分析 逻辑存储结构​ 内存结构​ Buffer Pool​ ChaneBuffer 自适应哈希​ LogBuffer​ 磁盘结构​ 后台线程​ 事务原理​ redolog日志 undolog日志​ MVCC​ 三个隐藏字段​ undolog版本链 readview​ RC(读已提交原理分析)​ RR(可重复读原理分析…

yolov8-ultralytics-利用TP、TN、FP、FN添加mIoU指标

在文件ultralytics/utils/metrics.py中的ConfusionMatrix类里 tp_fp 函数下方添加函数tp_fp_fn&#xff1a; def tp_fp_fn(self):"""Returns true positives, false positives and false negative."""tp self.matrix.diagonal()fp self.matri…

深入理解计算机系统--计算机系统漫游

对于一段最基础代码的文件hello.c&#xff0c;解释程序的运行 #include <stdio.h>int main() {printf ( "Hello, world\n") ;return 0; }1.1、信息就是位上下文 源程序是由值 0 和 1 组成的位&#xff08;比特&#xff09;序列&#xff0c;8 个位被组织成一组…

HCIA复习实验

实验要求 实验拓扑以及实验分析 第一步先划分网段 先对内网划分 192.168.1.0/24划分 192.168.1.0/26---骨干主线路 192.168.1.64/26---骨干备线路 ---192.168.1.128/25--vlan2 3汇总---便于减少路由表条目---在大型网络方便 192.168.1.128/26---vlan2 192.168.1.192/26---vla…

OpenCV视觉分析之运动分析(2)背景减除类:BackgroundSubtractorKNN的使用

操作系统&#xff1a;ubuntu22.04 OpenCV版本&#xff1a;OpenCV4.9 IDE:Visual Studio Code 编程语言&#xff1a;C11 算法描述 K-最近邻&#xff08;K-nearest neighbours, KNN&#xff09;基于的背景/前景分割算法。 该类实现了如 319中所述的 K-最近邻背景减除。如果前景…

-webkit-box-orient属性丢失?

在实际项目场景当中&#xff0c;我们经常会遇到需要对超长文本溢出省略的场景&#xff1a; 我们经常会这样写—— 单行省略&#xff1a; overflow: hidden; //文本溢出隐藏text-overflow: ellipsis; //文本溢出显示省略号white-space: nowrap; //不换行 多行省略&#xff1a…

VUE3.0基础入门笔记

一、响应式基础 1.ref()&#xff1a;声明基本类型,引用类型&#xff0c;函数需接收参数&#xff0c;并将其包裹在一个带有 .value 属性的对象中&#xff0c;在模板中使用 ref 时&#xff0c;我们不需要附加 .value,当在模板中使用时&#xff0c;ref 会自动解包。 <templat…

计算机毕业设计 基于 Python的考研学习系统的设计与实现 Python毕业设计选题 前后端分离 附源码 讲解 文档

&#x1f34a;作者&#xff1a;计算机编程-吉哥 &#x1f34a;简介&#xff1a;专业从事JavaWeb程序开发&#xff0c;微信小程序开发&#xff0c;定制化项目、 源码、代码讲解、文档撰写、ppt制作。做自己喜欢的事&#xff0c;生活就是快乐的。 &#x1f34a;心愿&#xff1a;点…

CISP/NISP二级练习题-第一卷

目录 另外免费为大家准备了刷题小程序和docx文档&#xff0c;有需要的可以私信获取 1&#xff0e;不同的信息安全风险评估方法可能得到不同的风险评估结果&#xff0c;所以组织 机构应当根据各自的实际情况选择适当的风险评估方法。下面的描述中错误的是 &#xff08;&#…

【Linux】进程的挂起状态

挂起状态的前提条件 当 内存资源严重不足 时&#xff0c;操作系统会考虑将部分进程换出到磁盘上的交换空间&#xff08;swap 分区&#xff09;。这通常发生在以下几种情况下&#xff1a; 内存不足&#xff1a; 当物理内存接近耗尽时&#xff0c;操作系统会选择将一部分暂时不需…

明源云ERP报表服务GetErpConfig.aspx接口存在敏感信息泄露

一、漏洞简介 在访问 /service/Mysoft.Report.Web.Service.Base/GetErpConfig.aspx?erpKeyerp60 路径时&#xff0c;返回了包含敏感信息的响应。这些信息包括但不限于数据库连接字符串、用户名、密码、加密密钥等。这些敏感信息的暴露可能导致以下风险&#xff1a;数据库访问…

Linux 常用命令(一)

目录 ll命令&#xff1a;显示指定文件的详细属性信息 ls&#xff1a;显示目录中文件及其属性信息 mkdir命令&#xff1a;创建目录文件 touch&#xff1a;创建空文件与修改时间戳 rm命令&#xff1a;删除文件或目录 cd命令&#xff1a;切换目录 chmod命令&#xff1a;改变…

llama.cpp 去掉打印,只显示推理结果

llama.cpp 去掉打印&#xff0c;只显示推理结果 1 llama.cpp/common/log.h #define LOG_INF(...) LOG_TMPL(GGML_LOG_LEVEL_INFO, 0, __VA_ARGS__) #define LOG_WRN(...) LOG_TMPL(GGML_LOG_LEVEL_WARN, 0, __VA_ARGS__) #define LOG_ERR(…