AI-Assisted Rulegen Iteration Workflow

Status: active workflow Role: Runbook / operational Last updated: 2026-05-15 Last verified: 2026-05-15 Lane 4 validation-gate routing review against AGENTS.md, scripts/package.json, and quality-loop docs Purpose: current rulegen/POS and SRS quality-loop runbook for quality-affecting changes Source-of-truth: rulegen/SRS quality-loop policy; canonical commands remain the scripts listed here plus AGENTS.md; broader change-type validation routing lives in productization_lane4_validation_gate_inventory.md.

Purpose:

Keep rulegen tuning fast without sacrificing stability.
Make each tuning change measurable, reviewable, and reversible.
Convert benchmark failures into durable labeled cases.

This workflow is focused on rulegen/POS quality loops, not general coding. For non-rulegen/SRS change types, use productization_lane4_validation_gate_inventory.md to choose the validation bundle.

Why this exists

Rulegen quality changes can look successful while still being brittle if:

benchmark coverage is too small,
many parameter sets collapse to identical scores,
POS drift silently changes candidate quality,
case failures are not promoted back into benchmark labels.

The scripts and policy below enforce a tighter loop.

Source files

Benchmark dataset source-of-truth: docs/test_inputs/rulegen_benchmark_cases/
Quality policy: docs/test_inputs/rulegen_quality_policy.json
Baseline metrics: docs/test_outputs/baselines/rulegen_quality_baseline.json
Benchmark runner: scripts/testing/rulegen_benchmark.py
Quality gate: scripts/testing/rulegen_quality_gate.py
Benchmark summary renderer: scripts/testing/rulegen_benchmark_summary.py
Quality gate summary renderer: scripts/testing/rulegen_quality_gate_summary.py
Triage extractor: scripts/testing/rulegen_benchmark_triage.py
Triage summary renderer: scripts/testing/rulegen_benchmark_triage_summary.py
SRS quality harness: scripts/testing/srs_quality_harness.py
SRS quality summary renderer: scripts/testing/srs_quality_summary.py
Focused audit wrapper: scripts/testing/rulegen_pair_audit_cycle.py
Change-aware audit wrapper: scripts/testing/rulegen_auto_audit.py
Feature state ledger: docs/developer/feature_state_matrix.md
GenAI workflow contract: docs/developer/genai_workflow_architecture.md
Validation gate inventory: docs/developer/productization_lane4_validation_gate_inventory.md

Standard loop

Run a benchmark sweep for the pair(s) you touched.

python3 scripts/testing/rulegen_benchmark.py \
  --pairs en-es \
  --json-output docs/test_outputs/rulegen_benchmark_en_es_latest.json \
  --markdown-output docs/test_outputs/rulegen_benchmark_en_es_latest.md \
  --html-output docs/test_outputs/rulegen_benchmark_en_es_latest.html

Run the quality gate.

python3 scripts/testing/rulegen_quality_gate.py \
  --benchmark-json docs/test_outputs/rulegen_benchmark_en_es_latest.json \
  --policy-json docs/test_inputs/rulegen_quality_policy.json \
  --baseline-json docs/test_outputs/baselines/rulegen_quality_baseline.json \
  --pos-probe-json docs/test_outputs/phase6_pos_inventory/phase6_pos_probe_2026-02-23_final.json \
  --pos-inventory-json docs/test_outputs/phase6_pos_inventory/phase6_pos_inventory_2026-02-23_final.json

Generate triage artifacts from best-run failures/review cases.

python3 scripts/testing/rulegen_benchmark_triage.py \
  --benchmark-json docs/test_outputs/rulegen_benchmark_en_es_latest.json \
  --json-out docs/test_outputs/rulegen_benchmark_triage_latest.json \
  --markdown-out docs/test_outputs/rulegen_benchmark_triage_latest.md

Promote triage items into benchmark labels.
- Update the LP-specific file under docs/test_inputs/rulegen_benchmark_cases/ (for example en_de.json).
- Add/adjust expected_top1_any, forbidden_top1, forbidden_any, and tier.
Re-run steps 1-3 until gate passes and triage is empty (or clearly justified).

Benchmark slice metadata scaffolding

The LP-local benchmark case files can now carry optional slice metadata without changing the core benchmark contract.

Supported optional fields per case:

slice_tags: flat string tags for reusable families or hazards
slice_dimensions: flat object of dimension name -> string list

Current semantic-shadow veto proxy tooling projects these case-level fields onto reviewed target/trigger rows, auto-adds tier as a slice dimension, and emits per-slice summaries in addition to the global report.

Use this when expanding benchmark coverage so new cases can be grouped immediately by:

semantic family
ambiguity topology
POS
pipeline route
decision type
LP-local hazards

Recommended shape:

{
  "case_id": "en-es:cargo",
  "target": "cargo",
  "tier": "hard",
  "expected_any": ["post", "position", "job", "office"],
  "expected_top1_any": ["position", "post", "job"],
  "forbidden_top1": [],
  "forbidden_any": [],
  "slice_tags": [
    "family:job_role",
    "topology:shared_english_trigger"
  ],
  "slice_dimensions": {
    "semantic_family": ["job_role"],
    "ambiguity_topology": ["shared_english_trigger"],
    "pos": ["noun"],
    "decision": ["ambiguous"]
  }
}

Authoring rules:

Keep the fields optional so older cases remain valid.
Keep tags and dimension values stable and reusable; do not encode one-off prose there.
Prefer a small number of dimensions with many cases over many bespoke dimensions.
Treat tier as the default coarse slice; only add finer dimensions when they help benchmark review or research isolation.

Preferred wrappers

The commands above remain canonical. Use these wrappers when they fit the change:

python3 scripts/testing/rulegen_pair_audit_cycle.py --pairs en-es
python3 scripts/testing/rulegen_auto_audit.py --base-ref origin/main
python3 scripts/testing/rulegen_auto_audit.py --pairs en-es --reverse-check-profile experiment --strict-gate
npm --prefix scripts run quality:rulegen:en-de

Wrapper responsibilities:

rulegen_pair_audit_cycle.py
- runs benchmark -> quality gate -> triage with one command,
- forwards tuning knobs and emits summarized output.
rulegen_auto_audit.py
- infers touched rulegen pairs from git changes,
- writes dated artifacts,
- refreshes *_latest aliases,
- records a manifest for run provenance.

Use direct commands instead of the wrappers when pair inference is ambiguous or artifact paths need manual control.

For named advisory latest lanes that should stay separate from the strict en-es artifact gate, use pair-specific wrappers and summaries:

npm --prefix scripts run quality:rulegen:en-de
npm --prefix scripts run quality:rulegen:en-de:summary

Those advisory lanes can now scope the quality gate to the selected pair so the actionable output stays focused on that pair’s own floor, delta, and saturation story. The current en-de lane uses pair-scoped gate mode. Its remaining gate noise is only the expected:

DELTA_SCOPE_BASELINE_MISSING

until an en-de baseline is accepted.

Policy mechanics

rulegen_quality_policy.json currently enforces:

required benchmark pair coverage (en-es hard-gated now, others recommended),
dataset field/tier contract (smoke / hard),
per-pair quality floors,
delta budgets versus baseline,
saturation warnings for low-sensitivity sweeps,
POS mismatch and unknown-tag growth guardrails.

Baseline update policy

Do not update baseline for a routine tuning PR.

Update docs/test_outputs/baselines/rulegen_quality_baseline.json only when:

quality policy intentionally changes,
benchmark dataset meaningfully expands,
or a reviewed quality shift is accepted as the new target.

When baseline is updated, include in PR notes:

old vs new metrics,
why shift is intentional,
rollback strategy.

Artifact policy

For meaningful runs:

Keep immutable dated artifacts.
Update *_latest aliases from the same run.
Treat runs as non-comparable if policy, baseline, grader, or benchmark labels changed.
Preserve enough run metadata to recover the exact sweep later.

The auto-audit wrapper handles this automatically for standard rulegen runs.

When human-facing summaries are needed from the latest rulegen artifacts, use:

npm --prefix scripts run quality:rulegen:benchmark:summary
npm --prefix scripts run quality:rulegen:gate:summary
npm --prefix scripts run quality:rulegen:triage:summary

SRS quality harness

For SRS scheduler/admission/publication work, use the synthetic harness before relying on prose or code inspection alone:

python3 scripts/testing/srs_quality_harness.py \
  --json-out docs/test_outputs/srs_quality_latest.json

Human-facing summary:

python3 scripts/testing/srs_quality_summary.py \
  --quality-json docs/test_outputs/srs_quality_latest.json \
  --markdown-out docs/test_outputs/srs_quality_summary_latest.md

Latest-artifact wrapper:

npm --prefix scripts run quality:srs:summary

Current intent:

keep bootstrap/publication/runtime diagnostics measurable for en-ja and en-de,
keep feedback-driven pause/resume behavior measurable for en-ja,
verify due-aware runtime serving through helper SRS metadata plus extension gating while keeping the absence of a dedicated due-only publication artifact explicit.

State tracking

Update feature_state_matrix.md when:

default behavior changes,
a feature becomes executable or default-on,
the latest verification artifact changes materially,
a doc/code mismatch is discovered or resolved.

Examples that should stay explicit:

reverse-check implemented but not default-on,
due-aware SRS serving implemented at runtime when helper due metadata is present, but no due-only publication artifact exists,
extension-side helper-rule confidence gating documented, but the live helper-rule runtime still has no post-emission confidence gate.

Future extension path

Current artifact gate is strict for en-es and advisory for en-ja / en-de / es-en until those pair artifacts are produced regularly.

en-de now has a dedicated advisory latest lane with its own preset, wrapper commands, and *_latest artifacts, but it is still not part of required_benchmark_pairs.

As pair coverage matures:

Add those pairs to required_benchmark_pairs.
Tighten per-pair floors and delta budgets.
Enable stricter saturation mode (--strict-saturation) in CI/local gates.

AI usage guidance

Use AI to accelerate:

proposing new benchmark cases from triage output,
summarizing sweep deltas,
proposing candidate POS mappings for unknown tags.

Keep human review mandatory for:

benchmark label updates,
baseline changes,
quality policy threshold changes.

Use a fresh reviewer/evaluator instance for:

ranking logic changes,
reverse-check tuning rollouts,
benchmark policy or harness changes,
any change that would redefine default quality claims.