Implementing AI Agents without GPUs: High-Performance Inference on Intel Xeon 6

Enterprises deploying AI agents and Retrieval-Augmented Generation (RAG) systems have traditionally assumed GPU infrastructure is a prerequisite for production-viable inference. This paper challenges that assumption. Using the Lenovo ThinkSystem SR650 V4 powered by dual Intel® Xeon® 6 processors we demonstrate a fully functional, production-ready AI agent with RAG capabilities running entirely on CPU.

Intel’s Advanced Matrix Extensions (AMX), built into every core of the Xeon 6 platform, deliver hardware-level acceleration for BF16 and INT8 matrix operations, closing a substantial portion of the performance gap with GPU-based inference for smaller and mid-sized language models. Paired with the vLLM inference server and LangChain’s agent framework, the SR650 V4 provides a compelling, cost-effective platform for organizations seeking to adopt generative AI without the capital expenditure, power overhead, and supply chain constraints associated with GPU infrastructure.

This paper presents a basic implementation walkthrough and benchmarking results across representative models. The results confirm that CPU-based AI inference on Xeon 6 is a viable path to production for a meaningful class of enterprise AI workloads and opens the possibility of hosting a small to medium sized number of AI agents.

Introduction

AI Agents have emerged as the next evolution for large language models (LLM), enabling models to make their own function calls independent of user input. Additionally, by augmenting the models with external knowledge retrieval in the form of Retrieval Augmented Generation (RAG) these agents can improve their flexibility and reduce hallucinations. RAG systems are increasingly used in applications such as enterprise search, customer support automation, technical documentation assistants, compliance analysis, and internal knowledge management. Traditionally, deploying performant RAG systems at scale has relied heavily on GPU-based infrastructure, introducing challenges related to cost, power consumption, resource availability, and operational complexity.

This paper demonstrates how efficient, production-ready RAG inference can be achieved without GPUs, using Intel Xeon processors. Leveraging Advanced Matrix Extensions (AMX) Intel Xeon processors deliver high-throughput, low-latency inference for both embedding generation and LLM-based text generation.

Through a practical reference architecture built with vLLM and LangChain, this paper illustrates how organizations can deploy scalable, cost-efficient RAG systems on Intel Xeon processors while maintaining strong performance and enterprise-grade reliability. Performance insights, architectural considerations, and deployment guidance are provided to demonstrate how CPU-optimized RAG inference reduces total cost of ownership, simplifies infrastructure, and enables broader adoption of generative AI across on-premises and hybrid environments, proving that high-quality RAG inference is no longer reliant on GPUs.

Environment Setup

This section walks through how to set up the necessary environment configuration required to deploy the AI agent.

Create Virtual Environment
Setup Docker

Create Virtual Environment

It is recommended to install the uv package to create a virtual python environment to avoid dependency issues and quickly download needed python libraries.

Run the following in your Linux terminal:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv agent-env
source agent-env/bin/activate

Setup Docker

Follow these steps to install Docker and build a CPU-optimized vLLM image.

Install Docker by following the instructions here and verifying it works by running the hello-world image: https://docs.docker.com/engine/install/

Clone the vLLM github repo.

git https://github.com/vllm-project/vllm.git

Build CPU-optimized vLLM image and set flags to take advantage of Intel’s AMX and AVX acceleration. vLLM will use OneDNN on the backend automatically to best optimize performance.

cd vllm
sudo docker buildx build --platform=linux/amd64 --build-arg VLLM_CPU_AMXBF16=1 --build-arg VLLM_CPU_AVX512=1 --build-arg VLLM_CPU_AVX512BF16=1 --build-arg VLLM_CPU_AVX512VNNI=1 --target vllm-openai -t vllm-cpu-amx -f docker/Dockerfile.cpu --load .

Start the vLLM Docker container and download LLM model from HuggingFace, add a HuggingFace token if your desired model is gated such as in the case of Llama3 models.

docker run -it --name vllm-amx --rm -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN="your_hf_token_here" \
  vllm-cpu-amx meta-llama/Llama-3.1-8B-Instruct \
  -- \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json

Check that the vLLM container is running and available for inferencing calls after allowing it time to startup.
```
curl http://localhost:8000/v1/models
```

Agentic AI & RAG Setup

With the vLLM server running, the following steps build the full RAG-enabled agent pipeline using LangChain.

Download the dependencies

Install the necessary python libraries.

uv pip install -q langchain langchain-openai langchain-huggingface langchain-core langchain-community langchain-text-splitters beautifulsoup4

Import the packages

Inside of a python script we can import the needed packages.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import os
from langchain.chat_models import init_chat_model
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.tools import tool
from langchain.agents import create_agent

Initialize the Embedding Model and Vector Store
The embedding model converts text into 768-dimensional vectors used for semantic similarity search. It runs entirely on CPU alongside the LLM server, with no resource contention between the two processes since embedding generation is fast and infrequent relative to LLM decoding.
```
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vector_store = InMemoryVectorStore(embeddings)
```

Load and Chunk the Source Documents

Source documents are loaded, parsed, and split into overlapping chunks before being embedded and stored. The chunk overlap preserves sentence-level context across split boundaries, improving retrieval quality for documents with dense information. For this example, we use the Beautiful Soup library to extract the main text of a simple blog post concerning LLMs.

# Load a web page, extracting only the main post content
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load() 

# Split into 1000 character chunks with 200 character overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True,
)
all_splits = text_splitter.split_documents(docs)
 
# Embed all chunks and index them in a vector store
document_ids = vector_store.add_documents(documents=all_splits)

In an enterprise deployment, the WebBaseLoader would be replaced with loaders suited to the document corpus: PyPDFLoader for internal PDFs, DirectoryLoader for a file system, or a custom loader targeting a CMS, SharePoint, or database.

Configure the LLM Client

The LangChain LLM client connects to the local vLLM server. Because vLLM exposes a standard OpenAI REST API, no custom integration code is needed and the vLLM server can be easily updated and scaled independently of the agent code.

VLLM_BASE_URL = "http://localhost:8000/v1"
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

llm = ChatOpenAI(
    openai_api_base=VLLM_BASE_URL,
    openai_api_key="EMPTY",  # vLLM doesn't require a real key
    model_name=MODEL_NAME,
    temperature=0.1,
    max_tokens=512,
)

Define the Retrieval Tool
The retrieval tool is the bridge between the agent’s reasoning loop and the vector store. When the LLM determines that external context is needed, it emits a structured tool call that LangChain executes automatically, returning the top matching chunks as context for the next generation step.
```
@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information to help answer a query."""
    retrieved_docs = vector_store.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs
```
The k=2 parameter returns the two most semantically similar chunks. Increasing k broadens coverage at the cost of a longer prompt and higher per-token latency. More sophisticated tools such as a database query function, calculator, or API client can be added to the tools list to extend the agent’s capabilities without modifying any other part of the pipeline.

Construct the Agent and Run Inference

With the LLM client and retrieval tool defined, the agent is assembled with a single call. LangChain’s create_agent factory builds a ReAct-style agent (Reasoning + Acting) that manages the think-act-observe loop automatically.

tools = [retrieve_context]
prompt = (
    "You have access to a tool that retrieves context from a blog post. "
    "Use the tool to help answer user queries."
)
agent = create_agent(llm, tools, system_prompt=prompt)
 
query = (
    "What is the standard method for Task Decomposition?\n\n"
    "Once you get the answer, look up common extensions of that method."
)
 
# stream() emits each reasoning step as it is produced,
# enabling real-time display of tool calls and intermediate responses.
for event in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    event["messages"][-1].pretty_print()

Xeon 6 vLLM Inferencing Performance

Benchmarking was performed across three 7-8B parameter models using the Lenovo ThinkSystem SR650 V4 powered by Intel’s Xeon 6 processors:

Llama-3.1-8B-Instruct
Qwen3-8B
Mistral-7B

Table 1. Output token generation rate 512 Input / 512 Output
Users	Llama-8B			Qwen3-8B			Mistral-7B
	Tok/s	Tok/s/user	TTFT (ms)	Tok/s	Tok/s/user	TTFT (ms)	Tok/s	Tok/s/user	TTFT (ms)
1	10.16	10.16	197.02	9.57	9.57	195.27	11.63	11.63	187.35
2	20.08	10.04	276.59	18.78	9.39	282.47	22.93	11.46	277.62
4	39.31	9.82	484.56	36.85	9.21	475.27	44.94	11.23	467.69
8	72.90	9.11	772.23	70.41	8.80	815.79	82.74	10.34	769.82
16	138.25	8.64	1425.55	128.71	8.04	1482.89	156.41	9.77	1413.50
32	200.39	6.26	2130.28	191.87	5.99	2216.58	222.34	6.94	2147.51
64	295.69	4.62	3515.06	276.29	4.31	3664.53	338.21	5.28	3449.08
128	405.50	3.16	6238.52	385.30	3.01	6508.17	464.16	3.62	6206.06

Table 2. Output token generation rate 512 Input / 1024 Output
Users	Llama-8B			Qwen3-8B			Mistral-7B
	Tok/s	Tok/s/user	TTFT (ms)	Tok/s	Tok/s/user	TTFT (ms)	Tok/s	Tok/s/user	TTFT (ms)
1	10.15	10.15	199.74	9.57	9.57	195.63	11.63	11.63	184.90
2	20.00	10	298.26	18.73	9.36	301.45	22.90	11.45	277.03
4	39.17	9.79	488.66	36.67	9.16	475.66	44.76	11.19	469.86
8	72.42	9.05	776.91	69.81	8.72	819.00	81.98	10.24	772.95
16	136.36	8.52	1431.88	126.65	7.91	1480.38	153.90	9.61	1415.63
32	195.61	6.11	2129.49	186.57	5.83	2240.58	216.23	6.75	2135.45
64	286.28	4.47	3481.94	266.47	4.16	3624.84	325.14	5.08	3446.06
128	386.00	3.01	6321.50	364.87	2.85	6512.12	439.44	3.43	6227.17

At a single concurrent user, all three models achieved sub-200ms time-to-first-token (TTFT) regardless of input length, with Mistral-7B reaching 11.63 tok/s, Llama-3.1-8B at 10.16 tok/s, and Qwen3-8B at 9.57 tok/s on the 512-input/512-output benchmark. These latencies are well within the thresholds required for interactive agent use cases and still easily scale to 16-32 users without excessive speed degradation.

Throughput scales efficiently under concurrent load making it optimal for use for multiple concurrent users or a small agent swarm. In the 512/512 configuration, aggregate output token generation for Mistral-7B grew from 11.63 tok/s at one user to 464.16 tok/s at 128 concurrent users (a roughly 40x increase) demonstrating that the platform's Intel AMX-accelerated compute cores are effectively utilized as the request queue deepens. Llama and Qwen exhibited comparable scaling curves, reaching 405.50 tok/s and 385.30 tok/s respectively at 128 users.

Table 3. Xeon 6 vLLM Inferencing Performance
Users	Llama-8B			Qwen3-8B			Mistral-7B
	Tok/s	Tok/s/user	TTFT (ms)	Tok/s	Tok/s/user	TTFT (ms)	Tok/s	Tok/s/user	TTFT (ms)
1	10.13	10.13	197.53	9.55	9.55	200.46	11.58	11.58	187.39
2	19.83	9.91	303.16	18.57	9.28	310.26	22.69	11.34	289.92
4	38.54	9.63	504.56	36.05	9.01	490.03	43.95	10.98	484
8	70.4	8.8	794.53	67.81	8.47	855.69	79.64	9.95	799.66
16	129.47	8.09	1498.18	120.05	7.50	1561.77	145.27	9.07	1494.71
32	181.22	5.66	2252.81	172.04	5.37	2364.43	198.39	6.19	2243.36
64	257.64	4.02	3644.35	240.05	3.75	3851.81	289.51	4.52	3622.54
128	335.25	2.61	6595.97	314.73	2.45	6881.8	373.42	2.91	6489.93

Conclusions

The results presented in this paper demonstrate that GPU infrastructure is no longer a prerequisite for deploying production-ready AI agents. By combining the Intel Xeon 6740P's Advanced Matrix Extensions with the vLLM inference server and LangChain's agent framework, the Lenovo ThinkSystem SR650 V4 delivers consistent, low-latency inference across a range of 7-8B parameter models all on CPU. The implementation walkthrough further illustrates that the architecture is straightforward to deploy, requiring no custom integration code and remaining fully compatible with the broader vLLM, LangChain, and HuggingFace ecosystems.

For enterprises evaluating their AI infrastructure strategy, the SR650 V4 represents a compelling path to adoption. CPU-only inference eliminates the supply chain constraints and high capital expenditure associated with GPU procurement, while Intel AMX acceleration ensures that performance scales predictably with concurrent load. Whether deployed as a standalone agent host or scaled horizontally across nodes to serve larger user populations, the Lenovo ThinkSystem SR650 V4 powered by Intel Xeon 6 offers a cost-effective, operationally simple, and enterprise-grade foundation for the next generation of AI-powered applications.

Hardware Details

The following table lists the configuration of the server we used in our tests.

Table 4. Server configuration
Feature	Description
Server model	Lenovo ThinkSystem SR650 V4
Processor	2x Intel Xeon 6740P 48C 270W 2.1GHz
Installed Memory	16x Samsung 64GB TruDDR5 6400MHz (2Rx4) 10x4 16Gbit RDIMM
Disk	4x ThinkSystem 2.5" U.2 PM9D3a 1.92TB Read Intensive NVMe PCIe 5.0 x4 HS SSD
OS	Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-60-generic x86_64)

Author

Eric Page is an AI Engineer at Lenovo. He has 6 years of practical experience developing Machine Learning solutions for various applications ranging from weather-forecasting to pose-estimation. He enjoys solving practical problems using data and AI/ML.

Related product families

Product families related to this document are the following:

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
ThinkSystem®

The following terms are trademarks of other companies:

Intel®, the Intel logo and Xeon® are trademarks of Intel Corporation or its subsidiaries.

Linux® is the trademark of Linus Torvalds in the U.S. and other countries.

SharePoint is a trademark of Microsoft Corporation in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

Lenovo Press

Lenovo Press

Implementing AI Agents without GPUs: High-Performance Inference on Intel Xeon 6

Planning / Implementation

Author

Published

Form Number

PDF size

Abstract

Introduction

Environment Setup

Create Virtual Environment

Setup Docker

Agentic AI & RAG Setup

Xeon 6 vLLM Inferencing Performance

Conclusions

Hardware Details

Author

Related product families

Trademarks

Cookies & Privacy

Cookie Preferences