Models & inferencePython

vLLM

vllm-project/vllm

84k

High-throughput serving engine for LLMs.

A fast, memory-efficient inference and serving engine using PagedAttention — the production server behind many open-model deployments.