Building a Code Generation Agent with Claude and Tool Use

Here's what makes a code generation agent different from "ask Claude to write code." The agent writes it, runs it, reads the error, thinks about what went wrong, fixes it, and runs it again. That loop is everything. Without it, you're just a copy-paste intermediary between an LLM and a terminal.

We're building this with Claude's native tool use. No LangChain, no LangGraph, no framework. Just the Anthropic SDK and a couple of tools. You'll see how little code it actually takes.

The Tools

Our agent needs three capabilities: write files, read files, and execute code. That's it.

import anthropic
import subprocess
import os
import tempfile

client = anthropic.Anthropic()

tools = [
    {
        "name": "write_file",
        "description": "Write content to a file. Creates the file if it doesn't exist, overwrites if it does.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "File path relative to the working directory"
                },
                "content": {
                    "type": "string",
                    "description": "The content to write to the file"
                }
            },
            "required": ["path", "content"]
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a file.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "File path relative to the working directory"
                }
            },
            "required": ["path"]
        }
    },
    {
        "name": "run_command",
        "description": "Execute a shell command and return stdout and stderr. Use for running code, installing packages, or checking results.",
        "input_schema": {
            "type": "object",
            "properties": {
                "command": {
                    "type": "string",
                    "description": "The shell command to execute"
                }
            },
            "required": ["command"]
        }
    }
]

Tool Execution

Each tool maps to a Python function. The run_command function includes a timeout because you don't want an infinite loop eating your machine.

WORK_DIR = tempfile.mkdtemp(prefix="agent_workspace_")

def execute_tool(name: str, input_data: dict) -> str:
    """Execute a tool and return the result as a string."""
    if name == "write_file":
        path = os.path.join(WORK_DIR, input_data["path"])
        os.makedirs(os.path.dirname(path), exist_ok=True)
        with open(path, "w") as f:
            f.write(input_data["content"])
        return f"File written: {input_data['path']}" The related post on [automated code review](/blog/automated-code-review-ai-agents) goes further on this point.

    elif name == "read_file":
        path = os.path.join(WORK_DIR, input_data["path"])
        if not os.path.exists(path):
            return f"Error: File not found: {input_data['path']}"
        with open(path, "r") as f:
            return f.read()

    elif name == "run_command":
        try:
            result = subprocess.run(
                input_data["command"],
                shell=True,
                capture_output=True,
                text=True,
                timeout=30,
                cwd=WORK_DIR,
            )
            output = ""
            if result.stdout:
                output += f"STDOUT:\n{result.stdout}"
            if result.stderr:
                output += f"\nSTDERR:\n{result.stderr}"
            output += f"\nExit code: {result.returncode}"
            return output or "Command completed with no output"
        except subprocess.TimeoutExpired:
            return "Error: Command timed out after 30 seconds"

    return f"Unknown tool: {name}"

Everything runs in a temp directory. The agent can't touch your actual filesystem. That's a guardrail, not a suggestion. The related post on Claude's Agent SDK goes further on this point.

The Agent Loop

This is the core. Send a message, check if Claude wants to use tools, execute them, send results back, repeat until Claude gives a final text response.

def run_agent(task: str, max_iterations: int = 10) -> str:
    """Run the code generation agent."""
    messages = [{"role": "user", "content": task}]

    system_prompt = """You are a code generation agent. You write code, run it,
and fix any errors iteratively until it works correctly.

Your workflow:
1. Understand the requirements
2. Write the code using write_file
3. Run it using run_command
4. If there are errors, read the output, diagnose the issue, fix the code
5. Repeat until the code runs successfully
6. Verify the output is correct

Always test your code. Never claim it works without running it.
If you need packages, install them with pip before running."""

    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )

        # Check if we're done (no tool use, just text)
        if response.stop_reason == "end_turn":
            # Extract the final text response
            text_parts = [
                block.text for block in response.content
                if block.type == "text"
            ]
            return "\n".join(text_parts)

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                print(f"  [{iteration+1}] Calling: {block.name}({list(block.input.keys())})")
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        # Add assistant response and tool results to conversation
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

    return "Agent reached maximum iterations without completing the task."

Ten iterations is the cap. Most tasks complete in 2-4. If it's still failing after 10, there's a deeper problem that more iterations won't solve.

Using It

result = run_agent("""
Create a Python script that:
1. Reads a CSV file with columns: name, email, department
2. Groups employees by department
3. Generates a summary report showing:
   - Number of employees per department
   - Alphabetical list of names in each department
4. Outputs the report as formatted text

Create a sample CSV file with at least 10 entries to test with.
""")

print(result)

Watch what happens. Claude writes the CSV first, then the script, then runs it. If the script has a bug (missing import, wrong column name), it reads the error, fixes it, and runs again. That fix-and-retry loop is what makes it an agent.

Adding Verification

Don't trust the agent. Verify.

def run_agent_with_verification(task: str, verification: str) -> dict:
    """Run the agent, then verify the result independently."""
    # Phase 1: Generate
    result = run_agent(task)

    # Phase 2: Verify
    verify_result = run_agent(f"""
    The following code was generated for this task: {task}

    Your job is NOT to rewrite it. Your job is to verify it.
    1. Read the generated files
    2. Run the code
    3. Check: {verification}
    4. Report any issues found

    If everything is correct, say "VERIFIED". If not, explain the failures.
    """)

    return {
        "generation_result": result,
        "verification_result": verify_result,
        "verified": "VERIFIED" in verify_result.upper(),
    }

# Use it
output = run_agent_with_verification(
    task="Write a function that sorts a list using merge sort",
    verification="Test with edge cases: empty list, single element, already sorted, reverse sorted, duplicates"
)

Two separate agent runs. The generator creates. The verifier tests. If the verifier finds issues, you can loop back to the generator with the feedback. Independent verification catches bugs that self-testing misses.

Streaming the Agent's Work

Users want to see what's happening, not stare at a spinner.

def run_agent_streaming(task: str, max_iterations: int = 10):
    """Stream the agent's work in real-time."""
    messages = [{"role": "user", "content": task}]

    system_prompt = """You are a code generation agent. Write code, run it,
fix errors iteratively until it works. Always test your code.""" This connects directly to [tool use](/blog/tool-use-in-ai-agents).

    for iteration in range(max_iterations):
        print(f"\n--- Iteration {iteration + 1} ---")

        with client.messages.stream(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        ) as stream:
            response = stream.get_final_message()

            # Print any text content as it arrives
            for block in response.content:
                if block.type == "text":
                    print(block.text)

        if response.stop_reason == "end_turn":
            return

        # Process tools
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                print(f"\n> Tool: {block.name}")
                result = execute_tool(block.name, block.input)
                print(f"> Result: {result[:200]}...")
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

Safety Boundaries

The temp directory is your first boundary. Here's the second: command allowlisting.

ALLOWED_COMMANDS = {"python", "pip", "node", "npm", "cat", "ls", "echo"}

def safe_execute_command(command: str) -> str:
    """Only allow whitelisted commands."""
    base_command = command.strip().split()[0]

    if base_command not in ALLOWED_COMMANDS:
        return f"Error: Command '{base_command}' is not allowed. Permitted: {ALLOWED_COMMANDS}"

    # Block dangerous patterns regardless of command
    dangerous = ["rm -rf", "sudo", "> /dev", "curl | sh", "wget | sh"]
    if any(d in command for d in dangerous):
        return "Error: Dangerous command pattern detected"

    return execute_tool("run_command", {"command": command})

What You've Built

A code generation agent that writes, tests, debugs, and verifies its own code. No framework. Just the Anthropic SDK, three tools, and a loop.

The pattern is universal. Swap the tools for database operations and you have a data analysis agent. Swap them for API calls and you have an integration agent. The agent loop stays the same. Observe the result, decide what to do next, act, repeat.

That's the whole trick. Everything else is just choosing the right tools.