Skip to main content

Trust Scoring

Every agent session receives a trust score (0–100) that reflects how safe the agent’s actions were. Trust scores help you identify risky behavior before it causes damage.

How It Works

Caged monitors all agent actions inside the sandbox:
  1. File operations — what files are created, modified, deleted
  2. Terminal commands — every command executed
  3. Network activity — outbound connections and data transfer
  4. System access — attempts to read system files or escalate privileges
Each action is evaluated against a set of behavioral rules. Risky actions reduce the trust score.

Scoring Rules

No Penalty (Score: 100)

  • Editing source code files
  • Running test suites
  • Installing packages from known registries
  • Git operations (commit, push, pull)
  • Reading documentation files

Minor Penalty (-5 to -10)

  • Deleting more than 10 files at once
  • Installing packages from unknown registries
  • Large file downloads (>100MB)
  • Modifying configuration files outside the project

Moderate Penalty (-10 to -20)

  • Outbound network calls to unknown hosts
  • Running processes as root
  • Accessing environment variables containing sensitive names
  • Creating SSH keys or certificates

Severe Penalty (-20 to -30)

  • Reading /etc/passwd, /etc/shadow, or other system credential files
  • Running curl | sh or similar remote execution patterns
  • Modifying system binaries or libraries
  • Attempting to access the host network namespace
  • Exfiltrating data (large outbound transfers to unknown hosts)

Trust Levels

ScoreLevelAction
90–100ExcellentNo action needed
70–89GoodReview flagged actions
50–69CautionAlert sent, manual review recommended
30–49WarningAlert sent, sandbox may be paused
0–29CriticalSandbox is automatically paused

Alerts

Configure trust-based alerts in your alert rules:
# Get alerted when trust drops below 70
curl -X PUT https://api.caged.dev/v1/alerts/rules/rule-trust-warn \
  -H "Authorization: Bearer caged_sk_..." \
  -d '{"threshold": 0.7, "channels": ["email", "slack"]}'

Viewing Trust Details

CLI

caged trust cage-a1b2c3d4
Trust Score: 78/100 (Good)

Deductions:
  -10  Outbound connection to unknown host (185.199.108.133)
  -7   Deleted 15 files in /tmp/
  -5   Installed package from git URL

Dashboard

The session detail page shows:
  • Overall trust score with trend
  • Timeline of trust-impacting events
  • Detailed breakdown of each deduction

Customizing Rules

You can adjust trust scoring thresholds per-sandbox:
# .caged.yaml
trust:
  min_score: 50          # Pause sandbox if trust drops below this
  allow_root: true       # Don't penalize root access
  allowed_hosts:
    - api.openai.com     # Don't penalize connections to these
    - registry.npmjs.org

Best Practices

  • Set budget guards alongside trust scoring — they complement each other
  • Review sessions with scores below 80 to understand agent behavior
  • Use allowlist network mode for maximum trust score predictability
  • Custom agents should avoid patterns that trigger deductions