CI Pipeline Agent Testing
Use Caged sandboxes in CI to run AI agents that fix failing tests, generate code, or validate changes — all isolated and cost-controlled.Prerequisites
- A
CAGED_API_KEYstored as a GitHub Actions secret - Your agent’s API key (e.g.,
ANTHROPIC_API_KEY) also in GitHub secrets
GitHub Actions Workflow
# .github/workflows/agent-fix.yml
name: AI Agent Fix
on:
workflow_dispatch:
inputs:
task:
description: "What should the agent do?"
required: true
default: "Fix all failing tests"
jobs:
agent-sandbox:
runs-on: ubuntu-latest
steps:
- name: Install Caged CLI
run: curl -fsSL https://get.caged.dev | sh
- name: Login to Caged
run: |
mkdir -p ~/.config/caged
echo '{"api_url":"https://api.caged.dev","api_key":"${{ secrets.CAGED_API_KEY }}"}' > ~/.config/caged/config.json
- name: Create sandbox and run agent
id: sandbox
run: |
# Create sandbox with the repo
SANDBOX_ID=$(caged run \
--template node-20 \
--cpus 2 \
--memory 2048 \
--repo ${{ github.server_url }}/${{ github.repository }} \
--budget 5 \
--env "ANTHROPIC_API_KEY=${{ secrets.ANTHROPIC_API_KEY }}" \
--json | jq -r '.id')
echo "sandbox_id=$SANDBOX_ID" >> "$GITHUB_OUTPUT"
echo "Created sandbox: $SANDBOX_ID"
- name: Run agent task
run: |
caged exec ${{ steps.sandbox.outputs.sandbox_id }} \
"claude '${{ github.event.inputs.task }}'"
- name: Extract results
run: |
# Get the git diff from the sandbox
caged exec ${{ steps.sandbox.outputs.sandbox_id }} \
"git diff" > agent-changes.patch
# Check if there are changes
if [ -s agent-changes.patch ]; then
echo "Agent made changes — patch file saved"
else
echo "No changes made"
fi
- name: Cleanup sandbox
if: always()
run: caged destroy ${{ steps.sandbox.outputs.sandbox_id }} --force
Auto-Fix on Test Failure
Run an agent automatically when tests fail:# .github/workflows/auto-fix.yml
name: Auto-Fix Tests
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm test
id: tests
continue-on-error: true
- name: Agent Fix (on failure)
if: steps.tests.outcome == 'failure'
run: |
curl -fsSL https://get.caged.dev | sh
mkdir -p ~/.config/caged
echo '{"api_url":"https://api.caged.dev","api_key":"${{ secrets.CAGED_API_KEY }}"}' > ~/.config/caged/config.json
SANDBOX_ID=$(caged run \
--template node-20 \
--cpus 2 \
--memory 2048 \
--repo ${{ github.server_url }}/${{ github.repository }} \
--budget 3 \
--env "ANTHROPIC_API_KEY=${{ secrets.ANTHROPIC_API_KEY }}" \
--json | jq -r '.id')
caged exec $SANDBOX_ID "npm install && claude 'Run the tests, read the failures, and fix them. Commit your fixes.'"
# Create PR with fixes
caged exec $SANDBOX_ID "git push origin HEAD:fix/auto-agent-$(date +%s)"
caged destroy $SANDBOX_ID --force
PR Review Agent
Run an agent to review every pull request:# .github/workflows/pr-review.yml
name: AI PR Review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
steps:
- name: Install Caged CLI
run: curl -fsSL https://get.caged.dev | sh
- name: Setup credentials
run: |
mkdir -p ~/.config/caged
echo '{"api_url":"https://api.caged.dev","api_key":"${{ secrets.CAGED_API_KEY }}"}' > ~/.config/caged/config.json
- name: Run review agent
run: |
SANDBOX_ID=$(caged run \
--template node-20 \
--cpus 2 \
--memory 1024 \
--repo ${{ github.server_url }}/${{ github.repository }} \
--budget 2 \
--env "ANTHROPIC_API_KEY=${{ secrets.ANTHROPIC_API_KEY }}" \
--json | jq -r '.id')
# Checkout the PR branch
caged exec $SANDBOX_ID \
"git fetch origin pull/${{ github.event.number }}/head:pr && git checkout pr"
# Run review
REVIEW=$(caged exec $SANDBOX_ID \
"claude 'Review this branch vs main. List issues, suggestions, and a summary. Output as markdown.'")
# Post as PR comment
gh pr comment ${{ github.event.number }} --body "$REVIEW"
caged destroy $SANDBOX_ID --force
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Tips
Set tight budgets in CI: Agent tasks in CI should be predictable. Use
--budget 2-5 to catch runaway loops early.Always cleanup: Use
if: always() on the destroy step so sandboxes don’t leak on workflow failure.Use
--json flag: Parse sandbox IDs programmatically with jq for reliable scripting.