# Production trust loop

How to run FreeAgentMaker agents **reliably** in production: **evaluate → baseline → monitor → fix → repeat**.

## 1. Golden tests (eval)

1. Add **test cases** under **Model & settings → Test suite & evaluation** (`blueprint.testCases[]`).
2. Optional: tag cases with **`suite`** (e.g. `jordan-support`, `scout-website-qa`) and filter when running eval.
3. Click **Run evaluation** and fix failures before publishing.

## 2. Save a baseline

After you have a **green** run:

1. Check **Save as baseline** and run evaluation again (or run once with it checked).
2. The server stores per-agent, per-user results (Postgres `agent_test_baseline` or file DB `agentTestBaseline`).

**Check status in the app:** the dashboard loads **`GET /api/agents/:id/test-baseline`** and shows baseline age and case count under Model & settings.

## 3. Compare on every change

Before shipping prompt or tool changes:

1. Check **Compare to baseline** and **Run evaluation**.
2. Review **regressions** vs **improvements** in the status line; the results panel lists which cases regressed or improved (input preview). Export JSON/Markdown if you need a paper trail.

Lean teams: see **`docs/FOUNDER-PATH.md`** for a short activation order (pack → test → go live → baseline → publish).

Enable **Run evaluation after saving blueprint** to get a quick **`evalOnSave`** summary on each save (uses server `OPENAI_API_KEY`).

## 4. CI / cron (optional)

Call the same endpoint your UI uses:

```http
POST /api/agents/YOUR_AGENT_ID/run-tests
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json

{"suite":"scout-website-qa","compareBaseline":true}
```

- Create an API key in the app (**run** scope).
- Store results in your pipeline; fail the job if `passed < total` or `baselineCompare.regressions.length > 0`.

### GitHub Actions example

```yaml
name: Agent eval
on:
  schedule:
    - cron: "0 6 * * *" # 06:00 UTC daily
  workflow_dispatch:

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - name: Run golden tests
        env:
          TOKEN: ${{ secrets.FAM_AGENT_API_TOKEN }}
          AGENT_ID: ${{ secrets.FAM_AGENT_ID }}
          API_BASE: https://your-domain.com
        run: |
          curl -sS -X POST "$API_BASE/api/agents/$AGENT_ID/run-tests" \
            -H "Authorization: Bearer $TOKEN" \
            -H "Content-Type: application/json" \
            -d '{"suite":"scout-website-qa","compareBaseline":true}' \
            -o eval-result.json
          node -e "
            const j=require('./eval-result.json');
            if (j.total && j.passed < j.total) process.exit(1);
            if (j.baselineCompare?.regressions?.length) process.exit(1);
          "
```

## 5. When runs fail in production

1. **Dashboard → Recent failed runs** — each row links to **runbooks** and this doc.
2. **Reliability & incidents → Copy support bundle** — redacted JSON for support (`GET /api/agents/:id/support-bundle`).
3. **Operator runbooks:** [`/runbooks.html`](/runbooks.html) (Fathom, email-inbound, integration health, outbound email, eval).

## 6. Reference templates

| Template        | Suite (examples)      | Notes                                      |
|----------------|------------------------|--------------------------------------------|
| **Jordan** (Support) | `jordan-support`       | Full playbook + KB; see `JORDAN-PLAYBOOK.md` |
| **Scout** (Website QA) | `scout-website-qa`     | Trust-loop starter tasks + smoke cases     |

---

**Summary:** *Eval → baseline → compare on change → optional CI → runbooks + support bundle when things break.*