Ollama and Local LLMs: Running AI Agents Without the Cloud
By Diesel
toolsollamalocal-llmedge
Every API call to OpenAI or Anthropic is data leaving your network, latency you can't control, and a bill that scales with usage. For most production applications, that trade-off is fine. The models are better, the infrastructure is someone else's problem, and the cost per token is reasonable.
But there are cases where local models win. And Ollama has made running them so trivially easy that the barrier to trying is basically zero.
## When Local Actually Makes Sense
Let me be honest about this upfront. Claude Opus, GPT-4o, and Gemini Ultra are better than any model you'll run locally. If your task requires the absolute best reasoning, writing, or complex multi-step planning, cloud APIs are the answer. Local models don't compete on raw capability.
Where local models win:
**Privacy.** Your data never leaves your machine. For healthcare, legal, financial, or classified contexts, this isn't a preference. It's a requirement. Running Llama locally means your patient records, legal briefs, and financial data never touch a third-party API.
**Latency.** No network round trip. For applications where you need sub-100ms responses (autocomplete, inline suggestions, real-time processing), a local model running on a good GPU beats any API call over the internet. The related post on [cost optimisation](/blog/cost-optimization-ai-agents) goes further on this point.
**Cost at scale.** If you're making millions of inference calls per day, the math changes. The upfront hardware cost amortizes over volume. At some scale, local inference is cheaper than API pricing.
**Offline operation.** Edge devices, air-gapped environments, field deployments. When there's no reliable internet, local is the only option.
**Development speed.** No rate limits. No API keys. No cost anxiety while prototyping. Run experiments all day without watching a billing dashboard.
## Ollama: The Docker of Local LLMs
```bash
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain binary search"
# Run as a server
ollama serve
# API now available at http://localhost:11434
```
That's it. Ollama manages model downloads, quantization variants, GPU allocation, and serving. It exposes an OpenAI-compatible API, which means every tool that speaks OpenAI can point at Ollama instead.
The model library includes Llama 3.1, Mistral, Mixtral, Phi-3, CodeLlama, Qwen, DeepSeek, and dozens more. Different sizes, different quantizations, different trade-offs between quality and speed.
## Choosing the Right Model
Model selection for local inference is about matching the model to your hardware and your task. Here's what I've found works:
**8B parameter models** (Llama 3.1 8B, Mistral 7B, Phi-3 Mini): Run on 8GB VRAM. Fast inference. Good for classification, extraction, simple Q&A, and code completion. Not great for complex reasoning or long-form generation.
**14B-34B models** (Qwen2 14B, CodeLlama 34B): Need 16-24GB VRAM. Meaningfully better at coding tasks and nuanced understanding. The sweet spot for development machines with decent GPUs.
**70B+ models** (Llama 3.1 70B, Mixtral 8x22B): Need 48GB+ VRAM or CPU offloading (slow). Approaching cloud model quality for many tasks. Require serious hardware.
For Apple Silicon Macs, Ollama uses the unified memory architecture well. An M2 Pro with 32GB can run 14B models at reasonable speed. An M3 Max with 64GB handles 70B models.
```bash
# Check what fits your hardware
ollama run llama3.1:8b # ~4.7GB, runs on anything modern
ollama run llama3.1:70b-q4_0 # ~40GB, needs serious RAM
ollama run mistral:7b # ~4.1GB, fast and capable
ollama run codellama:13b # ~7.3GB, strong at code
```
## Building Agents with Local Models
The OpenAI-compatible API means you can plug Ollama into any agent framework. LangChain, LangGraph, CrewAI, Mastra. Point them at localhost:11434 instead of api.openai.com.
```python
from langchain_community.llms import Ollama
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import Tool
llm = Ollama(model="llama3.1:8b", base_url="http://localhost:11434")
tools = [
Tool(name="search", func=search_func, description="Search documents"),
Tool(name="calculate", func=calc_func, description="Do math"),
]
agent = create_react_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, max_iterations=5)
result = executor.invoke({"input": "What was our Q4 revenue growth?"})
```
Or with the Vercel AI SDK:
```typescript
import { generateText } from 'ai';
import { ollama } from 'ollama-ai-provider';
const result = await generateText({
model: ollama('llama3.1:8b'),
prompt: 'Explain the SOLID principles',
});
```
Same code, different model. That's the power of provider abstraction.
## The Honest Limitations of Local Agents
Tool use with local models is hit and miss. Cloud models like Claude and GPT-4o have been specifically trained and fine-tuned for structured tool calling. They reliably generate valid JSON for tool arguments, understand when to call tools vs when to respond directly, and handle multi-tool workflows cleanly.
Smaller local models struggle with this. They'll generate malformed JSON, call the wrong tool, or forget to use tools entirely. The larger models (70B+) are much better, but still less reliable than the cloud models. This connects directly to [edge deployment patterns](/blog/agent-deployment-patterns).
Here's what I've found works:
**Use structured output formats.** Instead of relying on the model to figure out tool calling, use a system prompt that specifies a strict output format. Parse it deterministically.
**Keep tool sets small.** Two or three tools work well with 8B models. Ten tools and the model gets confused about which one to use. Cloud models handle large tool sets. Local models need constraints.
**Use bigger models for the agent brain.** Run an 8B model for the fast stuff (classification, extraction, embeddings) and a 70B model for the reasoning and planning. Two Ollama models running concurrently, different jobs.
**Fine-tune for your use case.** If your agent does one specific thing, fine-tune a smaller model for that exact task. A fine-tuned 8B model for your specific workflow will outperform a general 70B model. Tools like Unsloth make LoRA fine-tuning accessible.
## The MLX Angle (Apple Silicon)
If you're on Apple Silicon, MLX deserves a mention. Apple's ML framework runs models natively on the Metal GPU with better performance than Ollama for some configurations. The `mlx-lm` package makes it straightforward. For a deeper look, see [DSPy to optimise local model prompts](/blog/dspy-programming-llms).
```bash
pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit --prompt "Hello"
```
MLX models are optimized for Apple's unified memory. Quantized models run faster than you'd expect. For development on a Mac, it's worth benchmarking MLX against Ollama for your specific model and workload.
Ollama uses MLX under the hood on Mac anyway for some models, but direct MLX usage gives you more control over quantization and memory allocation.
## Hybrid Architecture: The Smart Play
The approach I use most often: cloud models for the heavy reasoning, local models for everything else.
Your agent's planning and decision-making loop hits Claude or GPT-4o. That's where you need the best reasoning. Your embedding generation, classification, extraction, and simple Q&A hit a local Ollama model. That's where you need speed and privacy, not peak intelligence.
```python
# Planning (cloud - best reasoning)
plan = cloud_model.generate("Given these tools and this goal, what steps?")
# Execution (local - fast, private, cheap)
for step in plan.steps:
if step.type == "classify":
result = local_model.generate(f"Classify: {step.input}")
elif step.type == "extract":
result = local_model.generate(f"Extract from: {step.input}")
elif step.type == "reason":
result = cloud_model.generate(f"Analyze: {step.input}")
```
This gives you the best of both worlds. Cloud quality where it matters. Local speed and privacy where it doesn't. And your API bill drops significantly because the bulk of your calls are free.
## Where This Is All Going
Quantization keeps getting better. Models keep getting more efficient. Hardware keeps getting faster. The gap between local and cloud models is narrowing, not widening.
Today, an 8B model on a decent laptop handles 80% of what most applications need. In a year, that number will be higher. The cloud providers know this. That's why they're competing on price and adding features that local inference can't easily replicate (massive context windows, multi-modal, real-time search).
Run Ollama. Experiment with local models. Build the hybrid architecture. Even if you use cloud APIs for production, understanding local inference makes you a better AI engineer. And when the requirements say "data can't leave our network," you'll already know what to do.