DSPy: Programming LLMs Instead of Prompting Them

Prompt engineering is the duct tape of AI development. You write a carefully worded instruction, test it, realize it fails on edge cases, add more instructions, test again, realize those new instructions broke the cases that worked before, add examples, restructure the whole thing, and eventually arrive at a fragile 2000-word prompt that works 90% of the time and nobody wants to touch.

DSPy says: stop doing that.

Instead of writing prompts, you write programs. You declare what your module should do, give it training examples, and let DSPy optimize the prompts for you. It's prompt engineering automated by actual engineering.

The Core Idea

In DSPy, you define signatures (what goes in, what comes out) and modules (how to process them). The framework compiles your program into optimized prompts by testing against examples and iterating automatically.

import dspy

# Configure the language model
lm = dspy.LM("anthropic/claude-sonnet-4-20250514")
dspy.configure(lm=lm)

# Define a signature
class SentimentAnalysis(dspy.Signature):
    """Classify the sentiment of a product review."""
    review: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
    confidence: float = dspy.OutputField(desc="confidence score 0.0 to 1.0")

# Use it
classify = dspy.Predict(SentimentAnalysis)
result = classify(review="This laptop is incredible, best purchase I've made all year")
print(result.sentiment)    # "positive"
print(result.confidence)   # 0.95

No prompt writing. You declared the inputs and outputs. DSPy figured out how to ask the model. If you need better results, you don't rewrite the prompt. You optimize the module with examples.

Why This Matters

Here's the problem with prompt engineering. Every time you change your model, your prompts might break. Every time you add a new capability, you risk degrading existing ones. Every time someone on your team edits the system prompt "just a little bit," they might tank accuracy on cases they didn't test.

Prompts are code without tests. DSPy treats them as compiled artifacts with proper evaluation and optimization.

# Define a more complex module
class AnswerWithEvidence(dspy.Signature):
    """Answer a question using evidence from provided context."""
    context: str = dspy.InputField(desc="relevant documents")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="concise factual answer")
    evidence: list[str] = dspy.OutputField(desc="quotes supporting the answer")

# Chain of Thought automatically adds reasoning
answer = dspy.ChainOfThought(AnswerWithEvidence)
result = answer(
    context="The company reported $4.2B in Q4 revenue, up 15% YoY...",
    question="What was the Q4 revenue?"
)

dspy.ChainOfThought wraps any signature with automatic step-by-step reasoning. You didn't write "think step by step" in a prompt. The framework handles it. And when a better prompting strategy comes along, you swap the module type without touching your signature definitions.

The Optimization Loop

This is where DSPy gets genuinely powerful. You provide training examples, an evaluation metric, and an optimizer. DSPy tests different prompt strategies, selects the best examples for few-shot prompting, and produces an optimized version of your module. It is worth reading about evaluation metrics DSPy optimises against alongside this.

# Training examples
trainset = [
    dspy.Example(
        review="Absolute garbage, broke after one week",
        sentiment="negative",
        confidence=0.95
    ),
    dspy.Example(
        review="It's okay, nothing special but works fine",
        sentiment="neutral",
        confidence=0.7
    ),
    dspy.Example(
        review="Best purchase I've ever made, life-changing",
        sentiment="positive",
        confidence=0.98
    ),
    # ... more examples
]

# Evaluation metric
def accuracy_metric(example, pred, trace=None):
    return example.sentiment == pred.sentiment

# Optimize
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=accuracy_metric, max_bootstrapped_demos=4)
optimized_classify = optimizer.compile(classify, trainset=trainset)

# Now optimized_classify uses the best-performing prompt strategy
result = optimized_classify(review="Terrible customer service")

BootstrapFewShot is one of several optimizers. It generates candidate few-shot examples from your training data by running the model, keeping the outputs that pass your metric, and selecting the most informative ones for the final prompt. Other optimizers include:

MIPRO: Uses Bayesian optimization to search the prompt space systematically.
BootstrapFewShotWithRandomSearch: Like BootstrapFewShot but explores more candidates.
KNNFewShot: Selects examples dynamically based on similarity to the input.

The key insight is that the optimizer treats the prompt as a search problem. It has a space of possible prompts (different examples, different instructions, different ordering). It has an objective function (your metric). It searches for the prompt that maximizes your metric. That's optimization, not engineering.

Building Pipelines

DSPy modules compose. You can build multi-step pipelines where each step is a DSPy module, and the optimizer optimizes the entire pipeline end-to-end.

class ResearchPipeline(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=5)
        self.summarize = dspy.ChainOfThought("context -> summary")
        self.answer = dspy.ChainOfThought(AnswerWithEvidence)

    def forward(self, question):
        # Retrieve relevant documents
        context = self.retrieve(question).passages

        # Summarize the context
        summary = self.summarize(context="\n".join(context))

        # Answer with evidence
        result = self.answer(
            context=summary.summary,
            question=question
        )
        return result

pipeline = ResearchPipeline()
result = pipeline("What caused the 2024 CrowdStrike outage?")

Each module in the pipeline has its own optimizable prompt. When you optimize the pipeline, DSPy optimizes all of them together, including the interactions between steps. A few-shot example that helps the summarizer might change what the answer module receives, which changes what examples work best for the answer module. DSPy handles this co-optimization.

Where DSPy Shines

Evaluation-driven development. DSPy forces you to define metrics upfront. What does "good" look like? How do you measure it? Once you have that, optimization is mechanical. Most teams skip this step and rely on vibes. "The output looks pretty good." That's not engineering. DSPy makes it engineering.

Reproducibility. An optimized DSPy module is deterministic in its prompt construction. Save it, load it, get the same prompts. No more "it worked on my machine" because someone had a different system prompt.

Model portability. Optimize for Claude, then swap to GPT-4o, then reoptimize. The signatures and modules don't change. Only the compiled prompts change. Your application logic is decoupled from the model.

Automatic few-shot selection. Instead of manually picking examples and hoping they cover the right edge cases, DSPy selects the examples that empirically work best for your metric on your data. It will often find example combinations you wouldn't have thought to try. The related post on RAG quality it can improve goes further on this point.

Where DSPy Struggles

Learning curve. The mental model is different from anything else in the AI ecosystem. You're not writing prompts. You're not even writing code that calls an LLM in the way you're used to. You're declaring modules and letting a compiler figure out the prompts. It takes time to trust the process.

Optimization cost. Running the optimizer means running your model dozens or hundreds of times on your training set. For expensive models, this adds up. The optimization itself can cost more than a month of inference if you're not careful with your evaluation budget.

Debugging opaqueness. When a DSPy module produces bad output, the prompt that caused it was auto-generated. Reading through a DSPy-compiled prompt to understand why it failed is like reading compiler output. Possible, but not fun.

Not great for creative tasks. DSPy excels at tasks with clear right/wrong answers (classification, extraction, factual Q&A). For creative writing, open-ended generation, or tasks where "good" is subjective, defining a useful metric is hard. And without a good metric, the optimizer can't optimize.

Small community. Compared to LangChain, DSPy's community is tiny. Fewer tutorials, fewer examples, fewer people to ask when you're stuck. The documentation is academic in tone (it came out of Stanford's NLP lab), which isn't everyone's cup of tea. This connects directly to optimising local models.

When to Use DSPy

Use DSPy when you have a task with a measurable success metric and training data. Classification, extraction, summarization with ground truth, factual Q&A with answer keys. These are DSPy's sweet spot.

Don't use DSPy for quick prototypes, creative generation, or one-off tasks. The overhead of defining signatures, collecting training data, and running optimizers isn't worth it if you just need a prompt that works well enough for a demo.

The honest take: DSPy is what prompt engineering should evolve into. But the ecosystem is still maturing, the learning curve is real, and most teams will get more immediate value from a well-written prompt than from a DSPy optimization pipeline.

That said, if your LLM application has gone from prototype to production and you're spending significant time maintaining and tuning prompts, DSPy is the adult tool for that problem. It replaces hand-tuning with systematic optimization. And unlike your hand-tuned prompts, it gets better with more data instead of more fragile.