Skip to the content.

AI-Assisted Rulegen Iteration Workflow

Status: active workflow Role: Runbook / operational Last updated: 2026-03-21 Last verified: 2026-03-21 AGENTS command review + package-script inventory check Purpose: current rulegen/POS and SRS quality-loop runbook for quality-affecting changes Source-of-truth: rulegen/SRS quality-loop policy; canonical commands remain the scripts listed here plus AGENTS.md.

Purpose:

This workflow is focused on rulegen/POS quality loops, not general coding.

Why this exists

Rulegen quality changes can look successful while still being brittle if:

The scripts and policy below enforce a tighter loop.

Source files

Standard loop

  1. Run a benchmark sweep for the pair(s) you touched.
python3 scripts/testing/rulegen_benchmark.py \
  --pairs en-es \
  --json-output docs/test_outputs/rulegen_benchmark_en_es_latest.json \
  --markdown-output docs/test_outputs/rulegen_benchmark_en_es_latest.md \
  --html-output docs/test_outputs/rulegen_benchmark_en_es_latest.html
  1. Run the quality gate.
python3 scripts/testing/rulegen_quality_gate.py \
  --benchmark-json docs/test_outputs/rulegen_benchmark_en_es_latest.json \
  --policy-json docs/test_inputs/rulegen_quality_policy.json \
  --baseline-json docs/test_outputs/baselines/rulegen_quality_baseline.json \
  --pos-probe-json docs/test_outputs/phase6_pos_inventory/phase6_pos_probe_2026-02-23_final.json \
  --pos-inventory-json docs/test_outputs/phase6_pos_inventory/phase6_pos_inventory_2026-02-23_final.json
  1. Generate triage artifacts from best-run failures/review cases.
python3 scripts/testing/rulegen_benchmark_triage.py \
  --benchmark-json docs/test_outputs/rulegen_benchmark_en_es_latest.json \
  --json-out docs/test_outputs/rulegen_benchmark_triage_latest.json \
  --markdown-out docs/test_outputs/rulegen_benchmark_triage_latest.md
  1. Promote triage items into benchmark labels.
    • Update docs/test_inputs/rulegen_benchmark_cases.json.
    • Add/adjust expected_top1_any, forbidden_top1, forbidden_any, and tier.
  2. Re-run steps 1-3 until gate passes and triage is empty (or clearly justified).

Preferred wrappers

The commands above remain canonical. Use these wrappers when they fit the change:

python3 scripts/testing/rulegen_pair_audit_cycle.py --pairs en-es
python3 scripts/testing/rulegen_auto_audit.py --base-ref origin/main
python3 scripts/testing/rulegen_auto_audit.py --pairs en-es --reverse-check-profile experiment --strict-gate

Wrapper responsibilities:

Use direct commands instead of the wrappers when pair inference is ambiguous or artifact paths need manual control.

Policy mechanics

rulegen_quality_policy.json currently enforces:

Baseline update policy

Do not update baseline for a routine tuning PR.

Update docs/test_outputs/baselines/rulegen_quality_baseline.json only when:

When baseline is updated, include in PR notes:

Artifact policy

For meaningful runs:

  1. Keep immutable dated artifacts.
  2. Update *_latest aliases from the same run.
  3. Treat runs as non-comparable if policy, baseline, grader, or benchmark labels changed.
  4. Preserve enough run metadata to recover the exact sweep later.

The auto-audit wrapper handles this automatically for standard rulegen runs.

When human-facing summaries are needed from the latest rulegen artifacts, use:

npm --prefix scripts run quality:rulegen:benchmark:summary
npm --prefix scripts run quality:rulegen:gate:summary
npm --prefix scripts run quality:rulegen:triage:summary

SRS quality harness

For SRS scheduler/admission/publication work, use the synthetic harness before relying on prose or code inspection alone:

python3 scripts/testing/srs_quality_harness.py \
  --json-out docs/test_outputs/srs_quality_latest.json

Human-facing summary:

python3 scripts/testing/srs_quality_summary.py \
  --quality-json docs/test_outputs/srs_quality_latest.json \
  --markdown-out docs/test_outputs/srs_quality_summary_latest.md

Latest-artifact wrapper:

npm --prefix scripts run quality:srs:summary

Current intent:

State tracking

Update feature_state_matrix.md when:

Examples that should stay explicit:

Future extension path

Current artifact gate is strict for en-es and advisory for en-ja / en-de / es-en until those pair artifacts are produced regularly.

As pair coverage matures:

  1. Add those pairs to required_benchmark_pairs.
  2. Tighten per-pair floors and delta budgets.
  3. Enable stricter saturation mode (--strict-saturation) in CI/local gates.

AI usage guidance

Use AI to accelerate:

Keep human review mandatory for:

Use a fresh reviewer/evaluator instance for: