模型
Qwen2.5-14B-Instruct-AWQ 是阿里云 Qwen2.5 系列中的一个14B(140亿参数)指令微调大语言模型的4bit AWQ量化版本,专为高效推理部署优化,在保持较强的中文、英文理解与生成能力的同时,大幅降低显存占用与计算成本,适合在单卡GPU(如4090/5090)上进行高性能推理部署,广泛用于对话、代码生成、信息抽取和Agent应用场景。
前置环境
租用4090gpu https://ppio.com cuda12.8.1版本

安装conda
下载
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
安装
安装过程中注意:
一路回车
看到 Do you accept the license? 输入 yes
安装路径默认即可(或 /root/miniconda3)
bash Miniconda3-latest-Linux-x86_64.sh
安装完成后执行:
source ~/.bashrc
安装虚拟环境
创建conda依赖文件 vllm064.yml
name: vllm064
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- bzip2=1.0.8=h5eee18b_6
- ca-certificates=2026.3.19=h06a4308_0
- ld_impl_linux-64=2.44=h9e0c5a2_3
- libffi=3.3=he6710b0_2
- libgcc=15.2.0=h69a1729_7
- libgcc-ng=15.2.0=h166f726_7
- libgomp=15.2.0=h4751f2c_7
- libstdcxx=15.2.0=h39759b7_7
- libstdcxx-ng=15.2.0=hc03a8fd_7
- libuuid=1.41.5=h5eee18b_0
- libxcb=1.17.0=h9b100fa_0
- libzlib=1.3.1=hb25bd0a_0
- ncurses=6.5=h7934f7d_0
- openssl=1.1.1w=h7f8727e_0
- packaging=26.0=py310h06a4308_0
- pip=26.0.1=pyhc872135_1
- pthread-stubs=0.3=h0ce48e5_1
- python=3.10.0=h12debd9_5
- readline=8.3=hc2a1206_0
- setuptools=82.0.1=py310h06a4308_0
- sqlite=3.51.2=h3e8d24a_0
- tk=8.6.15=h54e0aa7_0
- wheel=0.46.3=py310h06a4308_0
- xorg-libx11=1.8.12=h9b100fa_1
- xorg-libxau=1.0.12=h9b100fa_0
- xorg-libxdmcp=1.1.5=h9b100fa_0
- xorg-xorgproto=2024.1=h5eee18b_1
- xz=5.8.2=h448239c_0
- zlib=1.3.1=hb25bd0a_0
- pip:
- aiohappyeyeballs==2.6.1
- aiohttp==3.13.5
- aiosignal==1.4.0
- annotated-doc==0.0.4
- annotated-types==0.7.0
- anyio==4.13.0
- async-timeout==5.0.1
- attrs==26.1.0
- certifi==2026.2.25
- charset-normalizer==3.4.7
- click==8.3.2
- cloudpickle==3.1.2
- compressed-tensors==0.8.0
- datasets==4.8.4
- dill==0.4.1
- diskcache==5.6.3
- distro==1.9.0
- einops==0.8.2
- exceptiongroup==1.3.1
- fastapi==0.135.3
- filelock==3.25.2
- frozenlist==1.8.0
- fsspec==2026.2.0
- gguf==0.10.0
- h11==0.16.0
- hf-xet==1.4.3
- httpcore==1.0.9
- httptools==0.7.1
- httpx==0.28.1
- huggingface-hub==0.36.2
- idna==3.11
- importlib-metadata==9.0.0
- interegular==0.3.3
- jinja2==3.1.6
- jiter==0.14.0
- jsonschema==4.26.0
- jsonschema-specifications==2025.9.1
- lark==1.3.1
- llvmlite==0.47.0
- lm-format-enforcer==0.10.6
- markdown-it-py==4.0.0
- markupsafe==3.0.3
- mdurl==0.1.2
- mistral-common==1.11.0
- mpmath==1.3.0
- msgpack==1.1.2
- msgspec==0.21.1
- multidict==6.7.1
- multiprocess==0.70.19
- nest-asyncio==1.6.0
- networkx==3.4.2
- numba==0.65.0
- numpy==1.26.4
- nvidia-cublas-cu12==12.4.5.8
- nvidia-cuda-cupti-cu12==12.4.127
- nvidia-cuda-nvrtc-cu12==12.4.127
- nvidia-cuda-runtime-cu12==12.4.127
- nvidia-cudnn-cu12==9.1.0.70
- nvidia-cufft-cu12==11.2.1.3
- nvidia-curand-cu12==10.3.5.147
- nvidia-cusolver-cu12==11.6.1.9
- nvidia-cusparse-cu12==12.3.1.170
- nvidia-ml-py==13.595.45
- nvidia-nccl-cu12==2.21.5
- nvidia-nvjitlink-cu12==12.4.127
- nvidia-nvtx-cu12==12.4.127
- openai==2.31.0
- opencv-python-headless==4.11.0.86
- outlines==0.0.46
- pandas==2.3.3
- partial-json-parser==0.2.1.1.post7
- pillow==12.2.0
- prometheus-client==0.25.0
- prometheus-fastapi-instrumentator==7.1.0
- propcache==0.4.1
- protobuf==7.34.1
- psutil==7.2.2
- py-cpuinfo==9.0.0
- pyairports==0.0.1
- pyarrow==23.0.1
- pycountry==26.2.16
- pydantic==2.13.0
- pydantic-core==2.46.0
- pydantic-extra-types==2.11.1
- pygments==2.20.0
- python-dateutil==2.9.0.post0
- python-dotenv==1.2.2
- pytz==2026.1.post1
- pyyaml==6.0.3
- pyzmq==27.1.0
- ray==2.54.1
- referencing==0.37.0
- regex==2026.4.4
- requests==2.33.1
- rich==15.0.0
- rpds-py==0.30.0
- safetensors==0.7.0
- sentencepiece==0.2.1
- shellingham==1.5.4
- six==1.17.0
- sniffio==1.3.1
- starlette==0.52.1
- sympy==1.13.1
- tiktoken==0.12.0
- tokenizers==0.20.3
- torch==2.5.1
- torchvision==0.20.1
- tqdm==4.67.3
- transformers==4.46.3
- triton==3.1.0
- typer==0.24.1
- typing-extensions==4.15.0
- typing-inspection==0.4.2
- tzdata==2026.1
- urllib3==2.6.3
- uvicorn==0.44.0
- uvloop==0.22.1
- vllm==0.6.4
- watchfiles==1.1.1
- websockets==16.0
- xformers==0.0.28.post3
- xxhash==3.6.0
- yarl==1.23.0
- zipp==3.23.1
prefix: /root/miniconda3/envs/vllm064
安装依赖
conda env create -f vllm064.yml
激活
conda activate vllm064
下载模型
进入存放模型的目录下执行:
git lfs install ; git clone https://www.modelscope.cn/qwen/Qwen2.5-14B-Instruct-AWQ.git
PS: 如果提示git命令错误,需要执行:
apt update && apt install -y git-lfs
验证:

运行模型
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 6001 \
--api-key sk-123456 \
--model /data/models/Qwen2.5-14B-Instruct-AWQ \
--quantization awq_marlin \
--dtype half \
--kv-cache-dtype fp8 \
--max-num-seqs 16
参数解释:

验证
服务启动成功:

推理测试
(base) root@3a78681d220345c1:/data/models/Qwen2.5-14B-Instruct-AWQ# curl http://localhost:6001/v1/chat/completions \
-H "Authorization: Bearer sk-123456" \
-H "Content-Type: application/json" \
-d '{
"model": "/data/models/Qwen2.5-14B-Instruct-AWQ",
"messages": [{"role": "user", "content": "你好,你的名字是什么"}]
}' | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 665 100 526 100 139 1566 414 --:--:-- --:--:-- --:--:-- 1979
{
"id": "chatcmpl-81b96595f7384d23be69ab508cd94aaf",
"object": "chat.completion",
"created": 1776155169,
"model": "/data/models/Qwen2.5-14B-Instruct-AWQ",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "你好,我是Qwen,是由阿里云开发的语言模型。你可以叫我Qwen。很高兴能为你提供帮助!",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 34,
"total_tokens": 59,
"completion_tokens": 25,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}