Skip to the content.

GenAI Workflow Architecture

Status: active meta workflow Role: Runbook / operational Last updated: 2026-03-21 Last verified: 2026-03-21 workflow-doc review + package/CI inventory check Purpose: meta workflow contract for how GenAI-driven implementation work should interact with repo safeties, harnesses, and verification evidence Source-of-truth: meta workflow contract; defer implementation truth to code, AGENTS.md, and ai_workflow.md.

Purpose

This document defines how GenAI-driven implementation work should operate in this repository.

It exists to keep three things explicit:

This document extends the existing rulegen/POS quality loop. It does not replace AGENTS.md or ai_workflow.md.

Existing Infrastructure To Preserve

These files are already part of the working system and remain canonical:

  1. ../../AGENTS.md
    • Required rulegen/POS benchmark -> gate -> triage loop.
  2. ai_workflow.md
    • Rulegen/POS-specific iteration workflow.
  3. feature_state_matrix.md
    • Dated feature-state ledger and mismatch tracker.
  4. ../../scripts/testing/rulegen_pair_audit_cycle.py
    • Existing focused wrapper around the canonical rulegen loop.
  5. ../../scripts/testing/rulegen_auto_audit.py
    • Change-aware wrapper that infers touched pairs and manages dated plus *_latest artifacts.

If a future meta workflow conflicts with those files, update the conflict explicitly rather than silently assuming the newer text wins.

Operating Principles

  1. Source code remains the primary truth.
  2. Docs must distinguish planned, scaffolded, implemented, default-on, and verified.
  3. Quality-affecting changes require explicit artifacts, not just narrative claims.
  4. Wrapper scripts may orchestrate the canonical loop, but they must not hide or replace the underlying commands.
  5. Known doc/code mismatches stay visible until they are resolved and re-verified.
  6. GenAI should propose and implement changes; harnesses and independent review should certify them.

Agent Roles

Use small, bounded roles rather than one broad autonomous agent.

1. Researcher

2. Planner

3. Implementer

4. Reviewer

5. Evaluator

6. Curator

When To Use Separate Model Instances

Use a fresh model instance when independence matters more than shared context.

Separate instance required

  1. Review after non-trivial implementation.
  2. Benchmark triage and benchmark-label suggestions.
  3. Benchmark policy, baseline, or grader changes.
  4. Prompt or workflow-contract changes that affect many future runs.
  5. High-risk ranking or SRS logic changes.
  6. Any model-assisted judging/grading flow.

Same instance is acceptable

  1. Small local refactors with deterministic tests.
  2. One-file doc updates.
  3. Narrow tooling changes with obvious CLI verification.
  4. Cases where strict automated checks fully determine success.

Harness Architecture

Treat the harness as product infrastructure.

Every meaningful quality run should answer:

  1. What changed?
  2. What was expected to improve?
  3. Did quality move?
  4. Did regressions appear?
  5. Can the result be trusted?

Layer 1: Fast deterministic checks

Layer 2: Capability evals

Layer 3: Regression evals

Layer 4: Harness health checks

Layer 5: Artifact and transcript review

Rulegen Workflow

Canonical commands

The canonical rulegen loop remains the benchmark -> quality gate -> triage sequence in ../../AGENTS.md.

Preferred wrappers

Use these wrappers when they fit the change:

python3 scripts/testing/rulegen_pair_audit_cycle.py --pairs en-es
python3 scripts/testing/rulegen_auto_audit.py --base-ref origin/main
python3 scripts/testing/rulegen_auto_audit.py --pairs en-es --reverse-check-profile experiment --strict-gate

Wrapper policy:

Use the canonical commands directly when:

Artifact Policy

For generated quality artifacts:

  1. Write immutable dated outputs for every meaningful run.
  2. Update *_latest aliases only from the same run.
  3. Store run provenance with the artifacts when possible.
  4. Treat runs as non-comparable when policy, baseline, benchmark labels, or grading semantics changed.

Recommended artifact pattern:

Repository Safety Commands

Use these commands for general repo safeties before feature work expands:

npm --prefix scripts run check
npm --prefix scripts run build

Current intent:

When a workflow stage is consumed by another agent step, prefer the JSON-report variants over scraping terminal output. When branch-scope change reports are dominated by earlier branch work, use check:changed:local or check:changed:staged for the current edit loop and keep check:changed as the broader integration signal.

Feature-State Discipline

Use feature_state_matrix.md as the repo ledger for workflow state.

Update it when:

  1. a feature moves between planned, scaffolded, implemented, default-on, or verified,
  2. default behavior changes,
  3. an artifact becomes the new verification point,
  4. a doc/code mismatch is discovered or resolved,
  5. a strategy becomes executable rather than planning-only.

Do not collapse these states into a single “done” label.

Human Escalation Points

Require explicit human review or sign-off for:

  1. benchmark baseline changes,
  2. quality policy threshold changes,
  3. benchmark label updates that redefine expected quality,
  4. release-default toggles,
  5. harness grader changes,
  6. destructive migrations or data resets.

Current Repository Mismatches To Preserve Explicitly

These are active mismatches, not wording accidents:

  1. Reverse-check is implemented and tunable, but not yet default-on.
  2. SRS docs define due-aware serving, but current helper publication and runtime gating are not yet verified as due-aware end to end.
  3. Rulegen docs describe runtime confidence filtering, but extension-side helper-rule confidence gating is not yet verified in code.
  4. SRS planner docs describe multiple strategies, but executable behavior is still dominated by frequency_bootstrap.

These items should remain visible in feature_state_matrix.md until code, docs, and artifacts converge.

Near-Term Meta Priorities

  1. Keep reverse-check experiments in the standard audit loop while the feature is being tuned.
  2. Keep the SRS synthetic quality harness in the loop for scheduler/admission/publication changes and expand it as due-aware serving becomes executable.
  3. Preserve a fresh-reviewer step for ranking, SRS scheduler, and harness changes.
  4. Keep feature-state dates exact so future agents can recover current behavior quickly.