GenAI Workflow Architecture

Status: active meta workflow Role: Runbook / operational Last updated: 2026-05-15 Last verified: 2026-05-15 Lane 4 validation-gate command routing review against scripts/package.json, CI workflow, and build/release docs Purpose: meta workflow contract for how GenAI-driven implementation work should interact with repo safeties, harnesses, and verification evidence Source-of-truth: meta workflow contract; defer implementation truth to code, AGENTS.md, and ai_workflow.md.

Purpose

This document defines how GenAI-driven implementation work should operate in this repository.

It exists to keep three things explicit:

which workflow is canonical,
when independent model instances are required,
which evidence is needed before a feature can be treated as shipped.

This document extends the existing rulegen/POS quality loop. It does not replace AGENTS.md or ai_workflow.md.

Existing Infrastructure To Preserve

These files are already part of the working system and remain canonical:

../../AGENTS.md
- Required rulegen/POS benchmark -> gate -> triage loop.
ai_workflow.md
- Rulegen/POS-specific iteration workflow.
feature_state_matrix.md
- Dated feature-state ledger and mismatch tracker.
../../scripts/testing/rulegen_pair_audit_cycle.py
- Existing focused wrapper around the canonical rulegen loop.
../../scripts/testing/rulegen_auto_audit.py
- Change-aware wrapper that infers touched pairs and manages dated plus *_latest artifacts.

If a future meta workflow conflicts with those files, update the conflict explicitly rather than silently assuming the newer text wins.

Operating Principles

Source code remains the primary truth.
Docs must distinguish planned, scaffolded, implemented, default-on, and verified.
Quality-affecting changes require explicit artifacts, not just narrative claims.
Wrapper scripts may orchestrate the canonical loop, but they must not hide or replace the underlying commands.
Known doc/code mismatches stay visible until they are resolved and re-verified.
GenAI should propose and implement changes; harnesses and independent review should certify them.

Agent Roles

Use small, bounded roles rather than one broad autonomous agent.

1. Researcher

Reads only the necessary docs, code, and current artifacts.
Produces:
- current behavior summary,
- relevant source-of-truth files,
- last verified artifact paths,
- open contradictions.

2. Planner

Converts the state summary into a bounded change plan.
Must define:
- acceptance criteria,
- touched language pairs,
- required harness runs,
- rollback conditions,
- whether a fresh reviewer instance is required.

3. Implementer

Makes the code or doc changes only.
Must not silently update:
- benchmark baselines,
- benchmark labels,
- grading policy,
- release defaults.

4. Reviewer

Uses a fresh model instance for non-trivial work.
Focuses on:
- bugs,
- regressions,
- missing tests,
- unsafe assumptions,
- harness blind spots.

5. Evaluator

Runs the harness and reports what changed.
Must separate:
- capability movement,
- regression risk,
- harness ambiguity,
- non-comparable runs.

6. Curator

Converts failures and newly learned facts into durable repo state.
Typical outputs:
- benchmark case updates,
- triage notes,
- feature_state_matrix.md updates,
- follow-up workstream docs.

When To Use Separate Model Instances

Use a fresh model instance when independence matters more than shared context.

Separate instance required

Review after non-trivial implementation.
Benchmark triage and benchmark-label suggestions.
Benchmark policy, baseline, or grader changes.
Prompt or workflow-contract changes that affect many future runs.
High-risk ranking or SRS logic changes.
Any model-assisted judging/grading flow.

Same instance is acceptable

Small local refactors with deterministic tests.
One-file doc updates.
Narrow tooling changes with obvious CLI verification.
Cases where strict automated checks fully determine success.

Harness Architecture

Treat the harness as product infrastructure.

Every meaningful quality run should answer:

What changed?
What was expected to improve?
Did quality move?
Did regressions appear?
Can the result be trusted?

Layer 1: Fast deterministic checks

Targeted unit tests for changed modules.
Syntax checks.
Serialization and schema checks.

Layer 2: Capability evals

Benchmark sweeps for touched pairs.
Reverse-check off/on experiments when ranking behavior changes.
Pair-specific summary metrics and top-run comparisons.

Layer 3: Regression evals

Fixed must-pass cases for previously resolved failures.
Stable case sets kept separate from exploratory capability sweeps.

Layer 4: Harness health checks

Saturation warnings.
Ambiguous or low-sensitivity sweeps.
Changed grader/policy/baseline detection.

Layer 5: Artifact and transcript review

Read failure cases and suspicious wins directly.
Promote durable failures into benchmark or workstream artifacts.

Rulegen Workflow

Canonical commands

The canonical rulegen loop remains the benchmark -> quality gate -> triage sequence in ../../AGENTS.md.

Preferred wrappers

Use these wrappers when they fit the change:

python3 scripts/testing/rulegen_pair_audit_cycle.py --pairs en-es
python3 scripts/testing/rulegen_auto_audit.py --base-ref origin/main
python3 scripts/testing/rulegen_auto_audit.py --pairs en-es --reverse-check-profile experiment --strict-gate

Wrapper policy:

rulegen_pair_audit_cycle.py remains the focused orchestration layer.
rulegen_auto_audit.py adds:
- changed-file pair inference,
- dated artifact paths,
- *_latest alias updates,
- manifest output for run provenance.
rulegen_benchmark_summary.py, rulegen_quality_gate_summary.py, and rulegen_benchmark_triage_summary.py are the preferred human-facing renderers for the latest rulegen artifacts in CI and handoff docs.
srs_quality_harness.py is the current synthetic SRS quality surface for bootstrap/publication/runtime behavior.
srs_quality_summary.py is the preferred human-facing renderer for SRS quality JSON in CI and handoff docs.

Use the canonical commands directly when:

wrapper defaults are not appropriate,
you need full control over artifacts or sweep inputs,
pair inference would be ambiguous.

Artifact Policy

For generated quality artifacts:

Write immutable dated outputs for every meaningful run.
Update *_latest aliases only from the same run.
Store run provenance with the artifacts when possible.
Treat runs as non-comparable when policy, baseline, benchmark labels, or grading semantics changed.

Recommended artifact pattern:

dated files: ..._2026-03-11.json
latest aliases: ..._latest.json
manifest: rulegen_auto_audit_<pair_suffix>_2026-03-11.json

Repository Safety Commands

Use these commands for general repo safeties before feature work expands:

npm --prefix scripts run check
npm --prefix scripts run build

Current intent:

check is stable and non-mutating.
check:changed is the preferred branch-scope workflow command before heavier quality work.
check:docs / check:docs:report are the dedicated canonical-doc integrity surfaces; check:changed reruns that audit when canonical docs or referenced source files change materially.
build is a local build smoke for maintained build surfaces.
build:report is the full build contract; hosted macOS and Windows CI should use it directly instead of a custom job-local command.
build:ci / build:ci:report are the explicit unsupported-host build surfaces; they use the same script with explicit unsupported-surface skips instead of a separate CI-only build policy.
check now includes the strict Windows parity audit; check:windows:parity remains the dedicated inventory/report command, check:changed runs it automatically when parity-related files change, and Windows CI uses the strict variant to fail regressions.
check now enforces repo-wide Ruff style directly. check:style, check:style:report, and check:style:summary remain useful when you want a style-only loop or dedicated style artifacts.
check:changed now tracks both total changed files and substantive changed files; heavy follow-on loops should key off the substantive set so Python AST-equivalent churn, JSON pretty-print churn, and Markdown/text reflow do not force unnecessary audits.
check:changed also carries changed-only project-health baseline gating, including new/regressed warning debt, instead of treating maintainability drift as a separate review surface.
check:report, check:changed:report, and build:report are the machine-readable workflow surfaces for automation and agent hand-offs.
check:state audits the feature ledger so status claims stay backed by dated evidence paths, and compares against HEAD to catch status/default-behavior transitions without matching verification updates.
check:summary renders a stable Markdown handoff from the JSON workflow reports and is the preferred human-facing summary layer.
local pre-push should mirror check, not a separate ad hoc command set.
Keep repo-wide style clean enough that check can continue to enforce it without turning the default gate noisy again.
The GitHub Pages docs site should use a repo-owned workflow (.github/workflows/pages.yml) when deployment behavior needs explicit PR validation and repo-controlled build semantics; local parity for that workflow is cd docs && bundle exec jekyll build --trace.

When a workflow stage is consumed by another agent step, prefer the JSON-report variants over scraping terminal output. When branch-scope change reports are dominated by earlier branch work, use check:changed:local or check:changed:staged for the current edit loop and keep check:changed as the broader integration signal.

Feature-State Discipline

Use feature_state_matrix.md as the repo ledger for workflow state.

Update it when:

a feature moves between planned, scaffolded, implemented, default-on, or verified,
default behavior changes,
an artifact becomes the new verification point,
a doc/code mismatch is discovered or resolved,
a strategy becomes executable rather than planning-only.

Do not collapse these states into a single “done” label.

Human Escalation Points

Require explicit human review or sign-off for:

benchmark baseline changes,
quality policy threshold changes,
benchmark label updates that redefine expected quality,
release-default toggles,
harness grader changes,
destructive migrations or data resets.

Current Repository Mismatches To Preserve Explicitly

These are active mismatches, not wording accidents:

Reverse-check is implemented and tunable, but not yet default-on.
SRS serving is due-aware at the runtime gate when helper due metadata is present, but helper publication still uses the broader active/admitted inventory rather than a dedicated due-only artifact.
Rulegen docs describe runtime confidence filtering, but the live helper-rule runtime still has no post-emission confidence gate.
SRS planner docs describe multiple strategies, but executable behavior is still dominated by frequency_bootstrap.

These items should remain visible in feature_state_matrix.md until code, docs, and artifacts converge.

Near-Term Meta Priorities

Keep reverse-check experiments in the standard audit loop while the feature is being tuned.
Keep the SRS synthetic quality harness in the loop for scheduler/admission/publication/runtime-serving changes and expand its pair/user coverage beyond the current synthetic scenarios.
Preserve a fresh-reviewer step for ranking, SRS scheduler, and harness changes.
Keep feature-state dates exact so future agents can recover current behavior quickly.