"how numbers are stored and used in computers"

vLLM

vLLM is an open-source, high-performance inference engine for large language models. It provides fast, memory-efficient, and scalable inference, making it ideal for serving production-level workloads with modern transformer-based models like GPT, LLaMA, Mistral, and more.

Getting started

First, install vLLM:

code.txt
1pip install vllm

Then, run the following command to start the API server:

code.txt
1python -m vllm.entrypoints.openai.api_server \
2    --model meta-llama/Llama-2-7b-chat-hf \
3    --port 8000

This launches vLLM with the OpenAI API-compatible interface on port 8000 using the LLaMA 2 7B model from Hugging Face. You can test that the server is running by sending a request to the API:

code.py
1import openai
2
3openai.api_key = "EMPTY"  # No key required locally
4openai.api_base = "http://localhost:8000/v1"
5
6response = openai.ChatCompletion.create(
7    model="meta-llama/Llama-2-7b-chat-hf",
8    messages=[
9        {"role": "system", "content": "You are a helpful assistant."},
10        {"role": "user", "content": "Explain quantum computing in simple terms."}
11    ],
12    temperature=0.7,
13    max_tokens=256,
14    stream=True,  # Optional: Enable streaming response
15)
16
17for chunk in response:
18    print(chunk["choices"][0]["delta"].get("content", ""), end="")

This works just like OpenAI's ChatGPT API, making it easy to switch integrations to and from the public API endpoints without any code changes.

Usage

code.py
1import vllm

Performance

vLLM is designed to be highly efficient, with low latency and high throughput. It uses a variety of techniques to optimize performance, including:

Sparse activation: vLLM uses a sparse activation pattern to reduce memory usage and improve performance.
Weight sharing: vLLM uses weight sharing to reduce memory usage and improve performance.
Quantization: vLLM uses quantization to reduce memory usage and improve performance.