Evaluation
Evaluation starts with deterministic local gates. Memory benchmarks are useful only after the adapter contract and V1 data paths are stable.
Full Gate
Section titled “Full Gate”Run the full local gate before finalizing broad implementation work:
pnpm run cipnpm run ci runs the documentation freshness gate, linting, typechecking,
tests, builds, adapter-boundary checks, deterministic memory evals, and
report-only benchmark timings.
Python checks are adapter-boundary checks only. TypeScript owns memory ranking, schema, persistence, routing, verification policy, and product behavior.
Narrow Checks
Section titled “Narrow Checks”| Change area | First check |
|---|---|
| Core/router behavior | pnpm --filter @agent-memory-os/core test |
| SQLite/schema behavior | pnpm --filter @agent-memory-os/sqlite test |
| CLI contracts | pnpm --filter @agent-memory-os/cli test |
| Eval or retrieval policy | pnpm eval and pnpm bench:ci |
| Hermes adapter | pnpm adapter:check |
| Documentation only | pnpm lint:repo; add pnpm docs:build for Starlight content, config, navigation, or rendered Mermaid diagram changes. |
The standalone commands remain available when you need to isolate failures:
pnpm lintpnpm lint:packagespnpm lint:repopnpm typecheckpnpm testpnpm evalpnpm bench:cipnpm buildpnpm adapter:checkpnpm docs:checkpnpm docs:buildpnpm release:checkDocumentation Freshness
Section titled “Documentation Freshness”Run the docs guard before finalizing implementation changes:
pnpm docs:checkThe guard fails when code, adapter, tooling, workflow, or policy changes land
without the matching documentation surface in the same diff. It checks changed
paths, validates .devin/wiki.json, and rejects stale canonical references to
deleted root docs/*.md files or Starlight links to source-file URLs.
Expected behavior:
- core, SQLite, eval, router, context-pack, verification, schema, and memory
behavior changes require Starlight docs under
apps/docs/src/content/docs/; - CLI surface changes require the CLI reference and a human-facing or agent-facing summary when user-visible behavior changes;
- Hermes adapter, root plugin shim, and setup-skill changes require Hermes adapter or install docs;
- tooling and docs-guard policy changes require
AGENTS.md, environment docs, or this evaluation page; - Starlight content links must use published route paths, not
.mdsource-file URLs or related typo variants; .devin/wiki.jsonmust stay valid, use unique page titles, and reference existing priority files.
The guard does not update DeepWiki. DeepWiki is a generated external index: keep
.devin/wiki.json current, then audit the refreshed wiki after merge before
claiming the public DeepWiki page is current.
Release Automation
Section titled “Release Automation”Releases are Changesets-gated repository releases. Feature, fix, docs, adapter,
or tooling improvements that should appear in release notes need a changeset for
the root agent-memory-os package:
pnpm changesetUse patch for fixes and docs/tooling polish, minor for new capabilities, and
major only for breaking public behavior. PRs that intentionally should not
produce a release can use [no release] in the PR title or body.
The Release Metadata workflow runs pnpm release:check on pull requests. It
emits warnings for release-relevant changes without a changeset, but it does not
fail CI.
After the CI workflow succeeds on main, the Release workflow runs
changesets/action. Pending changesets create or update the Version Agent Memory OS PR. When that version PR merges, pnpm release:github reads the root
CHANGELOG.md, uses the current root package.json version, and creates a
single GitHub Release titled Agent Memory OS vX.Y.Z. The repo does not publish
packages to npm.
CLI Smoke Test
Section titled “CLI Smoke Test”pnpm --filter @agent-memory-os/cli buildtmp_dir="$(mktemp -d)"echo "{\"dbPath\":\"$tmp_dir/memory.sqlite\"}" \ | node packages/cli/dist/index.js seed-sampleecho "{\"dbPath\":\"$tmp_dir/memory.sqlite\",\"query\":\"Biome formatter\",\"budgetTokens\":400}" \ | node packages/cli/dist/index.js context-packExpected result:
- the first command returns sample core, event, fact, session-state, and workspace-resource records;
- the second command returns
contextPack; - the context pack includes the seeded Biome memory and V1.1 projections;
- citations are present on included items.
Adapter Smoke Test
Section titled “Adapter Smoke Test”python3 -m py_compile __init__.pypython3 -m py_compile adapters/hermes/plugins/memory/meta_memory/__init__.pypython3 -m unittest discover adapters/hermes/plugins/memory/meta_memory/testpnpm adapter:checkThese commands are local adapter-boundary smoke proof. They cover root and
nested plugin.yaml manifest fixtures, absence of stale JSON metadata, root
shim delegation, initialize(...) bridge setup, register(ctx) wiring,
setup-skill registration when available, local pre_tool_call,
post_tool_call, and on_session_end hook fixtures, Hermes parameters tool
schemas, JSON-string tool results, progressive configuration status, and CLI
delegation.
Before real Hermes integration testing, verify the target Hermes plugin contract against a live checkout. Local fixture coverage does not prove Hermes runtime discovery, enablement, or hook execution.
Evals And Benchmarks
Section titled “Evals And Benchmarks”Run the quality eval:
pnpm evalExpected result:
- Vitest runs the benchmark fixture assertions.
packages/evals/reports/eval.jsonis written.- All deterministic quality metrics stay at their expected values.
Run timing benchmarks:
pnpm benchpnpm bench:ciExpected result:
- Vitest runs timing benchmarks for fixture seeding, SQLite FTS search, and context-pack generation.
packages/evals/reports/benchmarks.jsonis written.pnpm bench:ciis report-only and should fail only when the benchmark harness itself fails.
Major memory additions must add or update at least one benchmark case when they change retrieval, context packing, write preservation, verification policy, or local storage behavior that affects memory quality. V2 Core additions should include a deterministic recall-safety case for citation, temporal relation, or drift warnings.
Current Coverage
Section titled “Current Coverage”Current local tests cover:
- FTS search returns evidence and facts.
- FTS search returns session state and workspace resources.
- Context packing respects token budgets and
auto/task/workspacepolicy boundaries. - Context packs preserve risky items while adding V2 Core recall warnings.
- SQLite stores and probes scoped temporal relations.
- SQLite emits missing-source citation warnings.
- CLI
add-relation,probe-relations, and workspace drift warnings. - CLI seed and context-pack smoke behavior.
- Hermes adapter metadata uses
plugin.yaml. - Adapter
initialize(...)andregister(ctx)wire the provider, V1/V1.1/V2 Core tools, setup skill, and lifecycle hooks in a local fixture. - Adapter
pre_tool_callandpost_tool_callrecord tool-call and tool-result evidence without blocking Hermes tool execution when hook writes cannot run. - Adapter
statusreports missing CLI, env CLI, repo-local CLI fallback, and selected database path without calling the CLI. - Adapter tool schemas use the Hermes
parametersshape. - Adapter tool handlers return JSON strings.
- Adapter tool-call arguments cannot override configured
META_MEMORY_DB. - Adapter passes
workspacePaththrough tocontext_pack. - Adapter CLI failure handling.
Future Gaps
Section titled “Future Gaps”Future local feature work should add tests for:
- broader write path coverage for derived facts beyond current projection writes;
- adapter subprocess timeouts fail clearly;
- real Hermes plugin discovery, hook execution, and tool schema registration for a target Hermes version.
Future benchmark tracks should cover conversational recall, temporal reasoning, validity-window conflict handling, reflection, handoff recovery, and richer coding-agent environment experience.
The key acceptance criterion is not raw recall alone. The system must avoid stale, irrelevant, unsupported, or branch-invalid injection, and it must separate write-side preservation failures from retrieval failures.