vLLM
OpenAI-compatible LLM inference
on IBM POWER. No GPU required.
Pre-built ppc64le packages. Skip the source build, skip the GPU bill — POWER10 and POWER11 do the work with MMA acceleration. Same OpenAI API your apps already speak.
The hardware is already there.
Stop renting GPUs to talk to it.
Your POWER systems already run mission-critical workloads. Adding LLM inference shouldn't mean a separate GPU farm, a new vendor relationship, and a compliance review. vLLM on ppc64le runs the same OpenAI-compatible API every framework speaks — on the CPUs you already paid for.
CPU-Only Inference
No GPU, no CUDA, no driver headaches. Run on POWER9, POWER10 and POWER11 with bfloat16 weights. POWER10+ uses MMA (Matrix Math Assist) for substantial speedups on 7B+ models.
OpenAI-Compatible API
Drop-in replacement for the OpenAI endpoint. Every SDK, framework and tool that speaks OpenAI works unchanged — including lpai, LangChain, LlamaIndex.
Pre-Built. No Compiling.
Native .deb and .rpm packages for ppc64le. apt install python3-vllm and you're running. No source builds, no missing wheels.
Pick a model.
Match it to your workload.
Every model below has been tested with CPU inference on real POWER systems. RAM figures are bfloat16, single instance.
Fast classification, filtering, real-time log monitoring. Perfect for lpai classify and lpai watch.
Improved quality while staying fast. Sweet spot for routing, simple summaries, and structured extraction tasks.
Diagnosis, masking, error decoding, multi-step reasoning. The all-rounder for serious sysadmin work — and where MMA acceleration starts to shine.
Specialized for code-related tasks. RPG IV analysis, COBOL translation, refactoring suggestions, test case generation.
One server.
Every client speaks it.
One package. Pick your distro.
PyTorch dependency
vLLM requires PyTorch for ppc64le. Install the CPU build from the official PyTorch index:
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cpu
Quick start
# Start vLLM with a small model python3 -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-0.5B-Instruct \ --device cpu --dtype bfloat16 --port 8000 # Query it (OpenAI-compatible) curl -s localhost:8000/v1/chat/completions \ -d '{"model":"Qwen/Qwen2.5-0.5B-Instruct", "messages":[{"role":"user","content":"Hello"}]}'
Pair it with lpai.
vLLM is the recommended local backend for lpai — the AI-powered sysadmin toolkit for POWER. Together they let you classify logs, diagnose incidents, decode error codes and audit security — entirely on your own hardware, with zero data leaving the machine.
22 commands, 40 code translation pairs, 5 compliance frameworks. All powered by the model you choose, hosted by vLLM, running on POWER.
# Install both sudo apt install python3-vllm lpai # Start vLLM in background python3 -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-7B-Instruct \ --device cpu --dtype bfloat16 & # Use lpai — all data stays on your machine journalctl --since today | lpai classify lpai decode "CPF4131" cat report.txt | lpai mask > safe.txt ✓ Zero network, zero cloud, full POWER.
Need help sizing vLLM for your POWER fleet?
SIXE can help you pick models, tune memory and integrate with your existing stack.
Run LLMs
on POWER.
Subscribe for releases, model recommendations and POWER community news.