GenAI Workflow Architecture
Status: active meta workflow
Role: Runbook / operational
Last updated: 2026-03-21
Last verified: 2026-03-21 workflow-doc review + package/CI inventory check
Purpose: meta workflow contract for how GenAI-driven implementation work should interact with repo safeties, harnesses, and verification evidence
Source-of-truth: meta workflow contract; defer implementation truth to code, AGENTS.md, and ai_workflow.md.
Purpose
This document defines how GenAI-driven implementation work should operate in this repository.
It exists to keep three things explicit:
- which workflow is canonical,
- when independent model instances are required,
- which evidence is needed before a feature can be treated as shipped.
This document extends the existing rulegen/POS quality loop. It does not replace AGENTS.md or ai_workflow.md.
Existing Infrastructure To Preserve
These files are already part of the working system and remain canonical:
../../AGENTS.md- Required rulegen/POS benchmark -> gate -> triage loop.
ai_workflow.md- Rulegen/POS-specific iteration workflow.
feature_state_matrix.md- Dated feature-state ledger and mismatch tracker.
../../scripts/testing/rulegen_pair_audit_cycle.py- Existing focused wrapper around the canonical rulegen loop.
../../scripts/testing/rulegen_auto_audit.py- Change-aware wrapper that infers touched pairs and manages dated plus
*_latestartifacts.
- Change-aware wrapper that infers touched pairs and manages dated plus
If a future meta workflow conflicts with those files, update the conflict explicitly rather than silently assuming the newer text wins.
Operating Principles
- Source code remains the primary truth.
- Docs must distinguish
planned,scaffolded,implemented,default-on, andverified. - Quality-affecting changes require explicit artifacts, not just narrative claims.
- Wrapper scripts may orchestrate the canonical loop, but they must not hide or replace the underlying commands.
- Known doc/code mismatches stay visible until they are resolved and re-verified.
- GenAI should propose and implement changes; harnesses and independent review should certify them.
Agent Roles
Use small, bounded roles rather than one broad autonomous agent.
1. Researcher
- Reads only the necessary docs, code, and current artifacts.
- Produces:
- current behavior summary,
- relevant source-of-truth files,
- last verified artifact paths,
- open contradictions.
2. Planner
- Converts the state summary into a bounded change plan.
- Must define:
- acceptance criteria,
- touched language pairs,
- required harness runs,
- rollback conditions,
- whether a fresh reviewer instance is required.
3. Implementer
- Makes the code or doc changes only.
- Must not silently update:
- benchmark baselines,
- benchmark labels,
- grading policy,
- release defaults.
4. Reviewer
- Uses a fresh model instance for non-trivial work.
- Focuses on:
- bugs,
- regressions,
- missing tests,
- unsafe assumptions,
- harness blind spots.
5. Evaluator
- Runs the harness and reports what changed.
- Must separate:
- capability movement,
- regression risk,
- harness ambiguity,
- non-comparable runs.
6. Curator
- Converts failures and newly learned facts into durable repo state.
- Typical outputs:
- benchmark case updates,
- triage notes,
feature_state_matrix.mdupdates,- follow-up workstream docs.
When To Use Separate Model Instances
Use a fresh model instance when independence matters more than shared context.
Separate instance required
- Review after non-trivial implementation.
- Benchmark triage and benchmark-label suggestions.
- Benchmark policy, baseline, or grader changes.
- Prompt or workflow-contract changes that affect many future runs.
- High-risk ranking or SRS logic changes.
- Any model-assisted judging/grading flow.
Same instance is acceptable
- Small local refactors with deterministic tests.
- One-file doc updates.
- Narrow tooling changes with obvious CLI verification.
- Cases where strict automated checks fully determine success.
Harness Architecture
Treat the harness as product infrastructure.
Every meaningful quality run should answer:
- What changed?
- What was expected to improve?
- Did quality move?
- Did regressions appear?
- Can the result be trusted?
Layer 1: Fast deterministic checks
- Targeted unit tests for changed modules.
- Syntax checks.
- Serialization and schema checks.
Layer 2: Capability evals
- Benchmark sweeps for touched pairs.
- Reverse-check
off/onexperiments when ranking behavior changes. - Pair-specific summary metrics and top-run comparisons.
Layer 3: Regression evals
- Fixed must-pass cases for previously resolved failures.
- Stable case sets kept separate from exploratory capability sweeps.
Layer 4: Harness health checks
- Saturation warnings.
- Ambiguous or low-sensitivity sweeps.
- Changed grader/policy/baseline detection.
Layer 5: Artifact and transcript review
- Read failure cases and suspicious wins directly.
- Promote durable failures into benchmark or workstream artifacts.
Rulegen Workflow
Canonical commands
The canonical rulegen loop remains the benchmark -> quality gate -> triage sequence in ../../AGENTS.md.
Preferred wrappers
Use these wrappers when they fit the change:
python3 scripts/testing/rulegen_pair_audit_cycle.py --pairs en-es
python3 scripts/testing/rulegen_auto_audit.py --base-ref origin/main
python3 scripts/testing/rulegen_auto_audit.py --pairs en-es --reverse-check-profile experiment --strict-gate
Wrapper policy:
rulegen_pair_audit_cycle.pyremains the focused orchestration layer.rulegen_auto_audit.pyadds:- changed-file pair inference,
- dated artifact paths,
*_latestalias updates,- manifest output for run provenance.
rulegen_benchmark_summary.py,rulegen_quality_gate_summary.py, andrulegen_benchmark_triage_summary.pyare the preferred human-facing renderers for the latest rulegen artifacts in CI and handoff docs.srs_quality_harness.pyis the current synthetic SRS quality surface for bootstrap/publication/runtime behavior.srs_quality_summary.pyis the preferred human-facing renderer for SRS quality JSON in CI and handoff docs.
Use the canonical commands directly when:
- wrapper defaults are not appropriate,
- you need full control over artifacts or sweep inputs,
- pair inference would be ambiguous.
Artifact Policy
For generated quality artifacts:
- Write immutable dated outputs for every meaningful run.
- Update
*_latestaliases only from the same run. - Store run provenance with the artifacts when possible.
- Treat runs as non-comparable when policy, baseline, benchmark labels, or grading semantics changed.
Recommended artifact pattern:
- dated files:
..._2026-03-11.json - latest aliases:
..._latest.json - manifest:
rulegen_auto_audit_<pair_suffix>_2026-03-11.json
Repository Safety Commands
Use these commands for general repo safeties before feature work expands:
npm --prefix scripts run check
npm --prefix scripts run build
Current intent:
checkis stable and non-mutating.check:changedis the preferred branch-scope workflow command before heavier quality work.check:docs/check:docs:reportare the dedicated canonical-doc integrity surfaces;check:changedreruns that audit when canonical docs or referenced source files change materially.buildis a local build smoke for maintained build surfaces.build:reportis the full build contract; hosted macOS CI should use it directly instead of a custom job-local command.build:ci/build:ci:reportare the explicit non-macOS hosted-runner build surfaces; they use the same script with explicit unsupported-surface skips instead of a separate CI-only build policy.checknow includes the strict Windows parity audit;check:windows:parityremains the dedicated inventory/report command,check:changedruns it automatically when parity-related files change, and Windows CI uses the strict variant to fail regressions.checknow enforces repo-wide Ruff style directly.check:style,check:style:report, andcheck:style:summaryremain useful when you want a style-only loop or dedicated style artifacts.check:changednow tracks both total changed files and substantive changed files; heavy follow-on loops should key off the substantive set so Python AST-equivalent churn, JSON pretty-print churn, and Markdown/text reflow do not force unnecessary audits.check:changedalso carries changed-only project-health baseline gating, including new/regressed warning debt, instead of treating maintainability drift as a separate review surface.check:report,check:changed:report, andbuild:reportare the machine-readable workflow surfaces for automation and agent hand-offs.check:stateaudits the feature ledger so status claims stay backed by dated evidence paths, and compares againstHEADto catch status/default-behavior transitions without matching verification updates.check:summaryrenders a stable Markdown handoff from the JSON workflow reports and is the preferred human-facing summary layer.- local
pre-pushshould mirrorcheck, not a separate ad hoc command set. - Keep repo-wide style clean enough that
checkcan continue to enforce it without turning the default gate noisy again. - The GitHub Pages docs site should use a repo-owned workflow (
.github/workflows/pages.yml) when deployment behavior needs explicit PR validation and repo-controlled build semantics; local parity for that workflow iscd docs && bundle exec jekyll build --trace.
When a workflow stage is consumed by another agent step, prefer the JSON-report variants over scraping terminal output.
When branch-scope change reports are dominated by earlier branch work, use check:changed:local or check:changed:staged for the current edit loop and keep check:changed as the broader integration signal.
Feature-State Discipline
Use feature_state_matrix.md as the repo ledger for workflow state.
Update it when:
- a feature moves between
planned,scaffolded,implemented,default-on, orverified, - default behavior changes,
- an artifact becomes the new verification point,
- a doc/code mismatch is discovered or resolved,
- a strategy becomes executable rather than planning-only.
Do not collapse these states into a single “done” label.
Human Escalation Points
Require explicit human review or sign-off for:
- benchmark baseline changes,
- quality policy threshold changes,
- benchmark label updates that redefine expected quality,
- release-default toggles,
- harness grader changes,
- destructive migrations or data resets.
Current Repository Mismatches To Preserve Explicitly
These are active mismatches, not wording accidents:
- Reverse-check is implemented and tunable, but not yet default-on.
- SRS docs define due-aware serving, but current helper publication and runtime gating are not yet verified as due-aware end to end.
- Rulegen docs describe runtime confidence filtering, but extension-side helper-rule confidence gating is not yet verified in code.
- SRS planner docs describe multiple strategies, but executable behavior is still dominated by
frequency_bootstrap.
These items should remain visible in feature_state_matrix.md until code, docs, and artifacts converge.
Near-Term Meta Priorities
- Keep reverse-check experiments in the standard audit loop while the feature is being tuned.
- Keep the SRS synthetic quality harness in the loop for scheduler/admission/publication changes and expand it as due-aware serving becomes executable.
- Preserve a fresh-reviewer step for ranking, SRS scheduler, and harness changes.
- Keep feature-state dates exact so future agents can recover current behavior quickly.