Interpreting results

Separate retrieval quality from autonomous coding success.

What to measure

Evaluation claims should be tied to artifacts, suite version, commit, model/provider, and configuration. Pair no-memory and memory-enabled variants when possible.

How to use it

Run a dry run first, then a real suite only after reviewing scripts and fixtures. Keep raw outputs and compare item-level results.

Verify

memory eval run --suite evals/examples/memory-smoke --condition full-memory --profile offline --dry-run

Read Run evaluations, Metrics, and Limitations.

Interpreting results

What to measure

How to use it

Verify

Next

On this page