一只云原生萌新 - 4090 VLLM部署Qwen2.5-14B-Instruct-AWQ

模型

Qwen2.5-14B-Instruct-AWQ 是阿里云 Qwen2.5 系列中的一个14B（140亿参数）指令微调大语言模型的4bit AWQ量化版本，专为高效推理部署优化，在保持较强的中文、英文理解与生成能力的同时，大幅降低显存占用与计算成本，适合在单卡GPU（如4090/5090）上进行高性能推理部署，广泛用于对话、代码生成、信息抽取和Agent应用场景。

前置环境

租用4090gpu https://ppio.com cuda12.8.1版本

安装conda

下载

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

安装

安装过程中注意：
一路回车
看到 Do you accept the license? 输入 yes
安装路径默认即可（或 /root/miniconda3）

bash Miniconda3-latest-Linux-x86_64.sh

安装完成后执行：

source ~/.bashrc

安装虚拟环境

创建conda依赖文件 vllm064.yml

name: vllm064
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2026.3.19=h06a4308_0
  - ld_impl_linux-64=2.44=h9e0c5a2_3
  - libffi=3.3=he6710b0_2
  - libgcc=15.2.0=h69a1729_7
  - libgcc-ng=15.2.0=h166f726_7
  - libgomp=15.2.0=h4751f2c_7
  - libstdcxx=15.2.0=h39759b7_7
  - libstdcxx-ng=15.2.0=hc03a8fd_7
  - libuuid=1.41.5=h5eee18b_0
  - libxcb=1.17.0=h9b100fa_0
  - libzlib=1.3.1=hb25bd0a_0
  - ncurses=6.5=h7934f7d_0
  - openssl=1.1.1w=h7f8727e_0
  - packaging=26.0=py310h06a4308_0
  - pip=26.0.1=pyhc872135_1
  - pthread-stubs=0.3=h0ce48e5_1
  - python=3.10.0=h12debd9_5
  - readline=8.3=hc2a1206_0
  - setuptools=82.0.1=py310h06a4308_0
  - sqlite=3.51.2=h3e8d24a_0
  - tk=8.6.15=h54e0aa7_0
  - wheel=0.46.3=py310h06a4308_0
  - xorg-libx11=1.8.12=h9b100fa_1
  - xorg-libxau=1.0.12=h9b100fa_0
  - xorg-libxdmcp=1.1.5=h9b100fa_0
  - xorg-xorgproto=2024.1=h5eee18b_1
  - xz=5.8.2=h448239c_0
  - zlib=1.3.1=hb25bd0a_0
  - pip:
      - aiohappyeyeballs==2.6.1
      - aiohttp==3.13.5
      - aiosignal==1.4.0
      - annotated-doc==0.0.4
      - annotated-types==0.7.0
      - anyio==4.13.0
      - async-timeout==5.0.1
      - attrs==26.1.0
      - certifi==2026.2.25
      - charset-normalizer==3.4.7
      - click==8.3.2
      - cloudpickle==3.1.2
      - compressed-tensors==0.8.0
      - datasets==4.8.4
      - dill==0.4.1
      - diskcache==5.6.3
      - distro==1.9.0
      - einops==0.8.2
      - exceptiongroup==1.3.1
      - fastapi==0.135.3
      - filelock==3.25.2
      - frozenlist==1.8.0
      - fsspec==2026.2.0
      - gguf==0.10.0
      - h11==0.16.0
      - hf-xet==1.4.3
      - httpcore==1.0.9
      - httptools==0.7.1
      - httpx==0.28.1
      - huggingface-hub==0.36.2
      - idna==3.11
      - importlib-metadata==9.0.0
      - interegular==0.3.3
      - jinja2==3.1.6
      - jiter==0.14.0
      - jsonschema==4.26.0
      - jsonschema-specifications==2025.9.1
      - lark==1.3.1
      - llvmlite==0.47.0
      - lm-format-enforcer==0.10.6
      - markdown-it-py==4.0.0
      - markupsafe==3.0.3
      - mdurl==0.1.2
      - mistral-common==1.11.0
      - mpmath==1.3.0
      - msgpack==1.1.2
      - msgspec==0.21.1
      - multidict==6.7.1
      - multiprocess==0.70.19
      - nest-asyncio==1.6.0
      - networkx==3.4.2
      - numba==0.65.0
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.4.5.8
      - nvidia-cuda-cupti-cu12==12.4.127
      - nvidia-cuda-nvrtc-cu12==12.4.127
      - nvidia-cuda-runtime-cu12==12.4.127
      - nvidia-cudnn-cu12==9.1.0.70
      - nvidia-cufft-cu12==11.2.1.3
      - nvidia-curand-cu12==10.3.5.147
      - nvidia-cusolver-cu12==11.6.1.9
      - nvidia-cusparse-cu12==12.3.1.170
      - nvidia-ml-py==13.595.45
      - nvidia-nccl-cu12==2.21.5
      - nvidia-nvjitlink-cu12==12.4.127
      - nvidia-nvtx-cu12==12.4.127
      - openai==2.31.0
      - opencv-python-headless==4.11.0.86
      - outlines==0.0.46
      - pandas==2.3.3
      - partial-json-parser==0.2.1.1.post7
      - pillow==12.2.0
      - prometheus-client==0.25.0
      - prometheus-fastapi-instrumentator==7.1.0
      - propcache==0.4.1
      - protobuf==7.34.1
      - psutil==7.2.2
      - py-cpuinfo==9.0.0
      - pyairports==0.0.1
      - pyarrow==23.0.1
      - pycountry==26.2.16
      - pydantic==2.13.0
      - pydantic-core==2.46.0
      - pydantic-extra-types==2.11.1
      - pygments==2.20.0
      - python-dateutil==2.9.0.post0
      - python-dotenv==1.2.2
      - pytz==2026.1.post1
      - pyyaml==6.0.3
      - pyzmq==27.1.0
      - ray==2.54.1
      - referencing==0.37.0
      - regex==2026.4.4
      - requests==2.33.1
      - rich==15.0.0
      - rpds-py==0.30.0
      - safetensors==0.7.0
      - sentencepiece==0.2.1
      - shellingham==1.5.4
      - six==1.17.0
      - sniffio==1.3.1
      - starlette==0.52.1
      - sympy==1.13.1
      - tiktoken==0.12.0
      - tokenizers==0.20.3
      - torch==2.5.1
      - torchvision==0.20.1
      - tqdm==4.67.3
      - transformers==4.46.3
      - triton==3.1.0
      - typer==0.24.1
      - typing-extensions==4.15.0
      - typing-inspection==0.4.2
      - tzdata==2026.1
      - urllib3==2.6.3
      - uvicorn==0.44.0
      - uvloop==0.22.1
      - vllm==0.6.4
      - watchfiles==1.1.1
      - websockets==16.0
      - xformers==0.0.28.post3
      - xxhash==3.6.0
      - yarl==1.23.0
      - zipp==3.23.1
prefix: /root/miniconda3/envs/vllm064

安装依赖

conda env create -f vllm064.yml

激活

conda activate vllm064

下载模型

进入存放模型的目录下执行：

git lfs install ; git clone https://www.modelscope.cn/qwen/Qwen2.5-14B-Instruct-AWQ.git

PS: 如果提示git命令错误，需要执行：

apt update && apt install -y git-lfs

验证：

运行模型

python -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 6001 \
  --api-key sk-123456 \
  --model /data/models/Qwen2.5-14B-Instruct-AWQ \
  --quantization awq_marlin \
  --dtype half \
  --kv-cache-dtype fp8 \
  --max-num-seqs 16

参数解释：

验证

服务启动成功：

推理测试

(base) root@3a78681d220345c1:/data/models/Qwen2.5-14B-Instruct-AWQ# curl http://localhost:6001/v1/chat/completions \
  -H "Authorization: Bearer sk-123456" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/Qwen2.5-14B-Instruct-AWQ",
    "messages": [{"role": "user", "content": "你好，你的名字是什么"}]
  }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   665  100   526  100   139   1566    414 --:--:-- --:--:-- --:--:--  1979
{
  "id": "chatcmpl-81b96595f7384d23be69ab508cd94aaf",
  "object": "chat.completion",
  "created": 1776155169,
  "model": "/data/models/Qwen2.5-14B-Instruct-AWQ",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "你好，我是Qwen，是由阿里云开发的语言模型。你可以叫我Qwen。很高兴能为你提供帮助！",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 34,
    "total_tokens": 59,
    "completion_tokens": 25,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}