Feature State Matrix
Status: active ledger Role: Canonical current Last updated: 2026-03-26 Source-of-truth: cross-cutting state ledger; runtime truth still lives in code, tests, and dated evidence artifacts.
Purpose:
- Keep feature state explicit for GenAI-driven development.
- Separate
implemented,default-on,verified, andplannedso current behavior is easy to recover. - Give each feature a dated checkpoint plus evidence paths.
Use this file when:
- default behavior changes,
- benchmark policy or baseline assumptions change,
- a workstream moves from scaffolded to executable,
- code inspection finds a doc/code mismatch that should be tracked.
Status Vocabulary
planned: documented idea only.scaffolded: code/docs shape exists, but behavior is not yet fully executable.implemented: code path exists and is usable.default-on: implemented and enabled in normal/default behavior.verified: implementation has recent evidence (artifact, test, or direct code inspection).
Date Fields
Last documented checkpoint: most recent dated doc milestone or spec update.Last verified: most recent artifact date, test evidence, or dated code inspection.
Rulegen Benchmark / Gate / Triage Loop
- Status:
implemented,default-on,verified - Last documented checkpoint:
2026-02-24 - Last verified:
2026-03-26local benchmark/gate/triage refresh + portable bundle replay - Default behavior:
- Required for rulegen scoring, candidate filtering, POS normalization, and LP tuning changes.
- Canonical loop remains benchmark -> quality gate -> triage.
- Latest rulegen artifacts now have human-facing Markdown summaries for benchmark, gate, and triage surfaces.
- Evidence:
AGENTS.mddocs/developer/ai_workflow.mdscripts/testing/rulegen_benchmark.pyscripts/testing/rulegen_benchmark_bundle.pyscripts/testing/rulegen_benchmark_presets.pyscripts/testing/rulegen_benchmark_summary.pyscripts/testing/rulegen_quality_gate.pyscripts/testing/rulegen_quality_gate_summary.pyscripts/testing/rulegen_benchmark_triage.pyscripts/testing/rulegen_benchmark_triage_summary.pydocs/test_outputs/rulegen_benchmark_en_es_latest.mddocs/test_outputs/rulegen_benchmark_summary_latest.mddocs/test_outputs/rulegen_quality_gate_latest.jsondocs/test_outputs/rulegen_quality_gate_summary_latest.mddocs/test_outputs/rulegen_benchmark_triage_summary_latest.md
- Known gaps:
- Current
docs/test_outputs/rulegen_quality_gate_latest.jsonhas FAIL findings foren-esquality floor and delta budget. - Recommended pairs (
en-ja,en-de,es-en) are still advisory rather than hard-gated. - Current quality-gate output also shows saturation warnings for
en-es. - Artifact history and pair inference still depend on wrapper usage rather than a mandatory repo-wide gate.
- Benchmark artifacts now mirror resolved resources under each pair as well as in the top-level
resourcesblock, they now carry SHA-256 resource checksums, they now record the effective per-targetword_packagesnapshot used by the run, the benchmark CLI now supports named preset methodologies fromdocs/test_inputs/rulegen_benchmark_presets.json, and portable bundle export/replay now packages the exact dataset/resources/snapshots for cross-machine reruns; the remaining ergonomic gap is optional single-file archive/import support.
- Current
Rulegen Auto Audit Wrapper
- Status:
implemented,verified,default-on=no - Last documented checkpoint:
2026-03-11 - Last verified:
2026-03-11CLI inspection - Default behavior:
- Optional wrapper for touched-pair rulegen audits.
- Preserves the canonical benchmark -> quality gate -> triage sequence by calling
rulegen_pair_audit_cycle.py. - Adds dated artifacts,
*_latestalias updates, and run manifests.
- Evidence:
docs/developer/ai_workflow.mddocs/developer/genai_workflow_architecture.mdscripts/testing/rulegen_auto_audit.pyscripts/testing/rulegen_pair_audit_cycle.py
- Known gaps:
- Pair inference is heuristic and should not replace explicit
--pairswhen the touched scope is ambiguous. - Wrapper coverage is currently specific to the rulegen quality loop and not yet mirrored for SRS quality work.
- Pair inference is heuristic and should not replace explicit
SRS Quality Harness
- Status:
implemented,verified,default-on=yesfor SRS scheduler/admission/publication workflow - Last documented checkpoint:
2026-03-21FSRS scheduler migration and journey artifact refresh - Last verified:
2026-03-21synthetic harness run + summary artifact - Default behavior:
- Use the synthetic harness for SRS scheduler, admission refresh, helper publication, set execution, and runtime-serving workflow changes.
- Review scheduling is now FSRS-based.
- Current harness covers bootstrap/publication/runtime diagnostics for
en-jaanden-de, plus anen-jafeedback-cycle pause/resume scenario. - Human-facing summary is available from the JSON artifact.
- Evidence:
AGENTS.mddocs/developer/ai_workflow.mdscripts/testing/srs_quality_harness.pyscripts/testing/srs_quality_summary.pydocs/test_outputs/srs_quality_latest.jsondocs/test_outputs/srs_quality_summary_latest.md
- Known gaps:
- Coverage is synthetic and pair-limited; it does not yet grade pedagogical quality or real user data.
- Current harness intentionally surfaces the due-aware publication mismatch as a warning, not a hard failure.
es-en/en-esSRS quality scenarios are not yet represented in the synthetic harness.
Kaikki en-es Compatibility Dictionary Pipeline
- Status:
implemented,verified;default-on=yesfor forwardwiktionary-es-en.sqlitewhen present and for theen-esreverse-check path whenwiktionary-en-es.sqliteis present - Last documented checkpoint:
2026-03-23reverse-source evaluation + dedicated EN->ES converter/catalog path - Last verified:
2026-03-23targeted converter/helper/adapter tests plus rebuilt Kaikki forward artifact benchmark and Kaikki/Kaikki reverse-enableden-escomparison lane - Default behavior:
- App language-pack catalog now includes a pair-specific
wiktionary-es-enpack sourced from the English-edition Kaikki raw dump. - App language-pack catalog also includes a dedicated
wiktionary-en-esKaikki pack for EN->ES reverse-check evaluation. - Download flow now supports
download + convert + auto-linkfor this pack, producing a compatibility SQLite artifact rather than exposing raw JSONL to runtime. en-espair resource resolution now preferswiktionary-es-en.sqlitewhen present in the language-packs dir.- The normalized runtime contract stays aligned with the existing dictionary loader surface:
entries(headword, headword_lc, translation, translation_lc, rank, pos, entry_ord, gloss_ord). - Converter preserves richer Kaikki metadata in auxiliary SQLite tables for later ranking/synonym work, and the reverse converter additionally preserves translation-box metadata in
translation_meta.
- App language-pack catalog now includes a pair-specific
- Evidence:
docs/language_pairs/kaikki_en_es_integration_plan.mddocs/language_pairs/language_pack_urls.txtdocs/language_pairs/lp_resource_requirements.mddocs/language_pairs/data_source_licensing_and_distribution.mdapps/gui/src/language_packs_catalog.pyapps/gui/src/language_packs.pyapps/gui/src/settings_language_packs.pyapps/gui/src/settings_language_packs_path_mixin.pycore/lexishift_core/resources/kaikki_sqlite.pyscripts/data/convert_kaikki_glosses_to_sqlite.pyscripts/data/convert_kaikki_es_en_to_sqlite.pyscripts/data/convert_kaikki_translations_to_sqlite.pyscripts/data/convert_kaikki_en_es_to_sqlite.pycore/lexishift_core/helper/lp_capabilities.pycore/lexishift_core/pos/normalization.pycore/lexishift_core/rulegen/adapters.pycore/lexishift_core/rulegen/pairs/en_es.pycore/tests/resources/test_kaikki_sqlite_conversion.pycore/tests/helper/test_lp_capabilities.pycore/tests/pos/test_pos_normalization.pycore/tests/rulegen/test_rulegen_adapters.pydocs/test_outputs/rulegen_benchmark_en_es_kaikki_latest.jsondocs/test_outputs/rulegen_benchmark_triage_en_es_kaikki_latest.jsondocs/test_outputs/rulegen_benchmark_en_es_kaikki_bidir_reverse_latest.jsondocs/test_outputs/rulegen_benchmark_triage_en_es_kaikki_bidir_reverse_latest.json
- Known gaps:
en-esquality gate remains red in the current workspace even after the Kaikki forward ordering fix; further sense-policy and reverse-check work is still required.- The reverse Kaikki source decision is documented, the EN->ES converter exists, and the first reverse-enabled Kaikki/Kaikki lane improved
en-estop1 to81.25%, but the remaining failure classes still need review before promoting the same artifact to the generales-enforward path. - Synonym extraction from Kaikki metadata is still deferred.
- Bulk-rules GUI selection is not yet wired to use the new Kaikki pack id.
SRS Journey E2E Harness
- Status:
implemented,verified;default-on=no - Last documented checkpoint:
2026-03-21FSRS-backed journey artifacts for deterministic, synthetic-real, and installed-resourceen-ja+en-eslanes - Last verified:
2026-03-21deterministicen-ja+en-escore and edge journey harness runs, synthetic-resource real-publication lanes, installed-resourceen-ja+en-esruns, Markdown summaries, and interactive HTML review artifacts - Default behavior:
- Deterministic
en-jaanden-escore and edge journey lanes plus matching real-publication lanes are available as analysis-first SRS E2E harnesses, but they are not yet part of the required default SRS workflow loop inAGENTS.md. - The core lane captures item-level admitted
S, dueD, and publishedPsets across bootstrap, refresh, and fade/stick phases. - Journey JSON now includes bootstrap candidate audits, refresh candidate ranking audits, and richer per-item state fields such as confidence, due rank, and lexical previews for retroactive pedagogical review.
- The edge lane captures duplicate-feedback and exposure-only behavior with the same item-level reporting contract.
- The real-publication lane keeps deterministic clocks/resources, uses the actual seed-builder plus helper/rulegen publication path, and now holds complete due publication for the current
en-jaanden-esscenarios. - Separate installed-resource review lanes now stage the user’s local frequency/dictionary packs into an isolated temp helper root, assign cohorts from actual admitted lemmas, and surface real-data pedagogical flow without mutating the live helper state.
- Interactive HTML playback artifacts now provide step-by-step review with phase controls, admission rationale tables, and a sticky profile-state panel.
- Current contract mode defaults to observation: publication broader than the due subset is surfaced as a warning rather than a hard failure.
- Deterministic
- Evidence:
docs/srs/srs_journey_harness_workstream.mdscripts/testing/srs_journey_harness.pyscripts/testing/srs_journey_summary.pyscripts/testing/srs_journey_html.pydocs/test_outputs/srs_journey/srs_journey_en_ja_latest.jsondocs/test_outputs/srs_journey/srs_journey_en_ja_latest.mddocs/test_outputs/srs_journey/srs_journey_en_ja_latest.htmldocs/test_outputs/srs_journey/srs_journey_en_ja_edge_latest.jsondocs/test_outputs/srs_journey/srs_journey_en_ja_edge_latest.mddocs/test_outputs/srs_journey/srs_journey_en_ja_edge_latest.htmldocs/test_outputs/srs_journey/srs_journey_en_ja_real_latest.jsondocs/test_outputs/srs_journey/srs_journey_en_ja_real_latest.mddocs/test_outputs/srs_journey/srs_journey_en_ja_real_latest.htmldocs/test_outputs/srs_journey/srs_journey_en_es_latest.jsondocs/test_outputs/srs_journey/srs_journey_en_es_latest.mddocs/test_outputs/srs_journey/srs_journey_en_es_latest.htmldocs/test_outputs/srs_journey/srs_journey_en_es_edge_latest.jsondocs/test_outputs/srs_journey/srs_journey_en_es_edge_latest.mddocs/test_outputs/srs_journey/srs_journey_en_es_edge_latest.htmldocs/test_outputs/srs_journey/srs_journey_en_es_real_latest.jsondocs/test_outputs/srs_journey/srs_journey_en_es_real_latest.mddocs/test_outputs/srs_journey/srs_journey_en_es_real_latest.htmldocs/test_outputs/srs_journey/srs_journey_en_ja_installed_latest.jsondocs/test_outputs/srs_journey/srs_journey_en_ja_installed_latest.mddocs/test_outputs/srs_journey/srs_journey_en_ja_installed_latest.htmldocs/test_outputs/srs_journey/srs_journey_en_es_installed_latest.jsondocs/test_outputs/srs_journey/srs_journey_en_es_installed_latest.mddocs/test_outputs/srs_journey/srs_journey_en_es_installed_latest.html
- Known gaps:
en-deextension is still pending.- The deterministic and synthetic-resource real-publication lanes are still useful regression surfaces, but installed-resource review currently depends on local data-pack availability and is not yet part of the default required workflow loop.
- The due-aware publication contract remains unresolved; the harness currently records the mismatch instead of enforcing it.
Development Workflow Safeties
- Status:
implemented,default-on,verified - Last documented checkpoint:
2026-03-17canonical-doc metadata enforcement + changed-scope doc-reference expansion + health warning-delta gating - Last verified:
2026-03-21localcheck:state,check:report,check:summary,check:style:report,check:style:summary, andhealth:project:report - Default behavior:
npm --prefix scripts run checkis the stable non-mutating repo safety command.npm --prefix scripts run checknow includes the strict Windows parity audit, so parity regressions fail the default local safety gate and pre-push hook.npm --prefix scripts run checknow includes strict repo-wide Ruff lint/format checks because the repo-wide style baseline is clean.npm --prefix scripts run check:changedis the preferred branch-scope workflow command.npm --prefix scripts run check:changednow records both total changed files and substantive changed files, and uses the substantive set when inferring heavier quality loops such as rulegen audit; Python uses AST comparison, JSON uses parsed equality, and Markdown/text uses whitespace-normalized comparison.npm --prefix scripts run check:docsnow validates top metadata (Status,Role,Last updated) plus referenced repo paths for canonical routing/policy docs.npm --prefix scripts run check:changednow reruns the canonical doc integrity audit when canonical docs change or when referenced source files underapps/,core/,scripts/,.github/, or canonical root files change materially.npm --prefix scripts run health:project:changednow blocks new/regressed warning debt alongside new/regressed violation debt.npm --prefix scripts run buildis the local build smoke for maintained build surfaces.npm --prefix scripts run build:reportis the full build contract and now verifies expected BetterDiscord / GUI artifacts in the report payload.- Hosted macOS
build:reportkeeps the full GUI bundle validation path; hosted Windowsbuild:reportnow uses the full GUI build plus artifact verification, while the strict Windows parity audit remains the dedicated Windows-specific validation gate. - Hosted CI now runs both the full macOS
build:reportpath and the explicit Ubuntubuild:ci:reportpartial path. - Python-backed npm workflow commands now resolve their interpreter through
scripts/dev/run_python.jssocheck/build/ audit entrypoints remain usable on Windows hosts. npm --prefix scripts run build:ci/build:ci:reportkeep the same build workflow on unsupported hosts while recording explicit GUI-validation skips.npm --prefix scripts run check:styleis the standalone repo-wide style loop.npm --prefix scripts run check:style:reportandcheck:style:summarypublish the current repo-wide Ruff style state as JSON and Markdown artifacts.npm --prefix scripts run check:stateaudits the feature-state ledger for required fields, dated checkpoints, evidence paths, and transition-aware updates relative toHEAD.npm --prefix scripts run check:report,check:changed:report, andbuild:reportemit machine-readable JSON artifacts for automation.- Failed
check/buildcommands now record stdout/stderr tail lines and missing-artifact details in the JSON reports so hosted CI failures remain inspectable from artifacts and summaries. npm --prefix scripts run check:summaryrenders a Markdown summary from the latest workflow reports and now surfaces first-failure detail tails when present.- Hosted CI now lets report-producing steps continue long enough to upload summaries/artifacts, then fails the job via explicit JSON-based gate steps.
- Hosted Ubuntu repo-safety now uses
npm --prefix scripts run check:report:ci, which skips the redundant Windows parity audit; dedicated Windows parity/build jobs remain responsible for that surface. - Hosted repo-safety still renders the latest rulegen benchmark/gate/triage summaries, but the known-red rulegen artifact no longer blocks the generic repo-safety job.
npm --prefix scripts run hooks:installinstalls bothpre-commitandpre-push; the pre-push hook mirrorsnpm --prefix scripts run check.pre-commitnow runs repo-wide Ruff lint and Ruff format before commit, whilepre-pushkeeps the full repo-safety gate.
- Evidence:
scripts/dev/feature_state_audit.pyscripts/dev/dev_workflow_check.pyscripts/dev/dev_workflow_changed_check.pyscripts/dev/dev_workflow_build.pyscripts/dev/dev_workflow_style_check.pyscripts/dev/dev_workflow_style_summary.pyscripts/dev/check_doc_references.pyscripts/dev/check_project_health.jsscripts/dev/project_health_rules.jsscripts/dev/ci_report_gate.pyscripts/dev/run_python.jsapps/betterdiscord-plugin/build_plugin.js.pre-commit-config.yaml.github/workflows/ci.ymlrequirements-build.txtscripts/package.jsondocs/test_outputs/dev_workflow/feature_state_audit_latest.jsondocs/test_outputs/dev_workflow/doc_references_latest.jsondocs/test_outputs/dev_workflow/check_latest.jsondocs/test_outputs/dev_workflow/check_changed_latest.jsondocs/test_outputs/dev_workflow/build_latest.jsondocs/test_outputs/dev_workflow/build_ci_latest.jsondocs/test_outputs/dev_workflow/summary_latest.mddocs/test_outputs/dev_workflow/style_latest.jsondocs/test_outputs/dev_workflow/style_summary_latest.mddocs/test_outputs/project_health/project_health_latest.jsondocs/developer/documentation_governance.mddocs/developer/project_health_gate_structure.mddocs/developer/local_setup.mddocs/developer/build_and_release.md
- Known gaps:
- GUI packaging makes
buildmaterially slower thancheck. - Hosted build coverage is now macOS full, Windows full-build plus artifact verification with a separate strict parity gate, and Ubuntu CI-safe partial; Ubuntu remains the explicit non-GUI proof lane rather than full desktop packaging.
- Canonical-doc metadata enforcement is currently limited to the canonical routing/policy layer, not every maintained doc in the repo.
- Pre-commit and pre-push coverage are optional until contributors run
npm --prefix scripts run hooks:install. - Branch-scope changed reports intentionally surface the whole branch delta, so long-running branches can report unrelated debt unless contributors use
check:changed:localorcheck:changed:staged.
- GUI packaging makes
GitHub Pages Docs Deployment
- Status:
implemented,default-on,verified - Last documented checkpoint:
2026-03-13 - Last verified:
2026-03-13localbundle exec jekyll build --trace+ hostedpages/pages-build-deploymentsuccess on302bba5 - Default behavior:
- Repo-owned Pages workflow now lives in
.github/workflows/pages.yml. - Pull requests touching
docs/**run a build-only Pages validation job. - Pushes to
maintouchingdocs/**build and deploy the site through GitHub Actions. - Local parity command is
cd docs && bundle exec jekyll build --trace.
- Repo-owned Pages workflow now lives in
- Evidence:
.github/workflows/pages.ymldocs/runbooks/github_pages_setup.mddocs/Gemfiledocs/Gemfile.lockdocs/_config.ymldocs/developer/local_setup.mddocs/test_outputs/dev_workflow/github_pages_workflow_verification_latest.md
- Known gaps:
- Current workflow validates Jekyll build/deploy only; it does not yet run link checking or browser-level UI smoke tests for docs JavaScript.
Windows GUI Parity Audit
- Status:
implemented,verified,default-on - Last documented checkpoint:
2026-03-12 - Last verified:
2026-03-12parity audit rerun + repo-safety integration + changed-scope/CI workflow wiring review - Default behavior:
npm --prefix scripts run checknow runs the strict Windows parity audit as part of repo safety and pre-push.npm --prefix scripts run check:windows:paritywrites a machine-readable parity audit of Windows GUI/helper/build parity.npm --prefix scripts run check:windows:parity:summaryrenders the current parity state into Markdown for human handoff.- Hosted CI now has a Windows full-build lane plus parity audit artifacts.
npm --prefix scripts run check:changednow runs the Windows parity audit automatically when parity-related files change.- Windows CI now uses the strict parity audit command so parity regressions fail the hosted workflow.
- Evidence:
docs/developer/windows_gui_parity_workstream.mdscripts/dev/windows_parity_audit.pyscripts/dev/windows_parity_summary.pyapps/gui/src/frozen_layout.pyapps/gui/src/helper_installer.pyapps/gui/src/helper_ui.pyapps/gui/src/helper_tray.pydocs/architecture/native_messaging_design.mddocs/test_outputs/dev_workflow/windows_parity_latest.jsondocs/test_outputs/dev_workflow/windows_parity_summary_latest.md.github/workflows/ci.yml
- Known gaps:
- The parity audit is now a required workflow gate, but it is still not a complete release certification on its own.
- Current browser coverage is limited to the supported GUI helper environments (
chrome,chromium,brave).
Feature-State Evidence Audit
- Status:
implemented,default-on,verified - Last documented checkpoint:
2026-03-12 - Last verified:
2026-03-12local audit run + repo safety/base-ref integration - Default behavior:
scripts/dev/feature_state_audit.pyvalidates that feature entries include status, dated checkpoints, default behavior bullets, evidence bullets, and known gaps.- Evidence paths in
docs/developer/feature_state_matrix.mdmust resolve on disk. - Repo safety now runs this audit directly against
HEAD, pre-commit runs it when the feature ledger changes, and changed-scope workflow checks run it against the branch base when the ledger is touched.
- Evidence:
scripts/dev/feature_state_audit.pycore/tests/dev/test_feature_state_audit.pyscripts/dev/dev_workflow_check.py.pre-commit-config.yamldocs/test_outputs/dev_workflow/feature_state_audit_latest.json
- Known gaps:
- The audit enforces structure and evidence existence, not semantic correctness of every status claim.
- It does not yet require every status transition to update its verification date in the same commit.
Generic Gloss Demotion
- Status:
implemented,default-on,verified - Last documented checkpoint:
2026-02-27 - Last verified:
2026-02-28benchmark artifact review;2026-03-21code inspection afteren_jaadapter/module rename - Default behavior:
- Active for current rulegen pairs through pair-specific demotion lists.
- Tuned via
semantic_demotion_scale.
- Evidence:
docs/rulegen/rule_generation_technical.mddocs/rulegen/rulegen_congruity_implementation_plan.mdcore/lexishift_core/rulegen/semantic_demotion.pycore/lexishift_core/rulegen/pairs/en_es.pycore/lexishift_core/rulegen/pairs/es_en.pycore/lexishift_core/rulegen/pairs/en_de.pycore/lexishift_core/rulegen/pairs/en_ja.py
- Known gaps:
- Heuristic demotion is conservative and does not replace sense-level disambiguation.
- Current
en-es:madrefailure shows generic demotion alone is not sufficient.
Reverse-Check Scoring
- Status:
implemented,verified,default-on=no - Last documented checkpoint:
2026-03-26exact-hit ambiguity + exact-hit specificity reverse signals with benchmark/probe harness exposure - Last verified:
2026-03-26targeted ranking/tuning/helper/harness tests, canonicalen-esbenchmark/gate/triage rerun over the expanded 48-config reverse sweep, reverse run-matrix refresh, and focusedcuadroprobe - Default behavior:
- Configurable and pair-aware for
en-esandes-en. - Not yet promoted to default production tuning.
- Reverse-check-specific evaluation now has a named
en-eslane vianpm --prefix scripts run quality:rulegen:reverse:en-es. - Parameter-set comparison is now tracked in
docs/test_outputs/rulegen_reverse_en_es_run_matrix_latest.md. - Reverse scoring now also supports:
- an exact-hit ambiguity penalty keyed off
reverse_check_total - an additive exact-hit specificity bonus keyed off
reverse_check_total
- an exact-hit ambiguity penalty keyed off
- both signals are harness-exposed, but both are still off in the current canonical best run.
- Configurable and pair-aware for
- Evidence:
docs/rulegen/reverse_check_scoring_phase1.mddocs/rulegen/reverse_check_rollout_matrix.mddocs/rulegen/reverse_check_en_es_case_review_2026-03-13.mddocs/rulegen/reverse_check_en_es_aggressive_expansion_2026-03-13.mddocs/rulegen/reverse_check_en_es_failure_traits_2026-03-13.mdcore/lexishift_core/rulegen/ranking.pycore/lexishift_core/rulegen/pairs/en_es.pycore/lexishift_core/rulegen/pairs/es_en.pycore/lexishift_core/rulegen/tuning.pyscripts/testing/rulegen_benchmark.pydocs/test_outputs/rulegen_benchmark_en_es_latest.mddocs/test_outputs/rulegen_benchmark_triage_latest.mddocs/test_outputs/rulegen_benchmark_en_es_reverse_far_hit_experiment_2026-03-13.jsondocs/test_outputs/rulegen_benchmark_en_es_reverse_far_hit_experiment_2026-03-13.mddocs/test_outputs/rulegen_benchmark_triage_en_es_reverse_far_hit_experiment_2026-03-13.mddocs/test_outputs/rulegen_benchmark_en_es_reverse_latest.jsondocs/test_outputs/rulegen_benchmark_en_es_reverse_latest.mddocs/test_outputs/rulegen_quality_gate_en_es_reverse_latest.jsondocs/test_outputs/rulegen_benchmark_triage_en_es_reverse_latest.mddocs/test_outputs/rulegen_reverse_en_es_run_matrix_latest.mddocs/test_outputs/rulegen_benchmark_en_es_reverse_ambiguity_experiment_latest.jsondocs/test_outputs/rulegen_benchmark_en_es_reverse_ambiguity_experiment_latest.mddocs/test_outputs/rulegen_probe_en_es_reverse_off_latest.jsondocs/test_outputs/rulegen_probe_en_es_reverse_on_latest.jsondocs/test_outputs/rulegen_probe_en_es_reverse_far_hit_experiment_2026-03-13.jsondocs/test_outputs/rulegen_benchmark_en_es_latest.jsondocs/test_outputs/rulegen_reverse_en_es_run_matrix_latest.md
- Known gaps:
- Only
en-esandes-enare wired;en-deanden-jahave no reverse-check implementation. - No committed
es-enbenchmark/gate/triage artifact yet proves rollout maturity. - The canonical benchmark loop now sweeps both
rev=offandrev=on, buten-esstill remains red on top-1 accuracy and average-rule volume even after the repaired verb reverse normalization restored the bestrev=onlane. - The current
en-esreverse-enabled best run liftstop3to98.25%, buttop1is still capped at91.23%; remaining work is now more about lexical choice than reverse plumbing. - The new exact-hit ambiguity penalty and exact-hit specificity bonus are both implemented and harness-exposed, but neither beat the existing best lane yet; current
cuadrobehavior is still more sensitive to miss/far-penalty tradeoffs and score clamping than to these exact-hit refinements alone. cuadrostill exposes a non-separable failure class for reverse evidence alone, andsacarstill needs phrase-policy work when the benchmark is judged on top-1 quality rather than only top-3 recall.- Current rollout is scoring-only, not strict candidate blocking.
- Only
Kaikki Provenance / Competition Scoring
- Status:
implemented,verified,default-on=no - Last documented checkpoint:
2026-03-27provenance scoring with second benchmark-expansion pass and live Kaikki demotion now winning - Last verified:
2026-03-27targeteden-esprovenance/adapter/benchmark tests, canonicalen-esbenchmark/gate/triage rerun over the expanded 57-case / 144-config sweep, and probe-path verification - Default behavior:
en-esKaikki candidates now support a sweepable additive provenance penalty:late_sense_clean_earlier_competition_penalty
- the signal is off unless the selected config sets a nonzero penalty
- the current canonical best run now selects:
kprov=0.10
- the signal is powered only by existing metadata already carried on candidates:
target_provenancegloss_provenancesense_provenancekaikki_policy_shadow
- benchmark and probe seams both expose it:
- benchmark label:
kprov - probe flag:
--kaikki-policy-late-sense-penalty
- benchmark label:
- Evidence:
docs/language_pairs/kaikki_en_es_integration_plan.mddocs/test_outputs/rulegen_benchmark_en_es_latest.jsondocs/test_outputs/rulegen_benchmark_en_es_latest.mddocs/test_outputs/rulegen_benchmark_triage_latest.jsoncore/lexishift_core/rulegen/pairs/en_es.pycore/lexishift_core/rulegen/pairs/en_es_support.pycore/lexishift_core/rulegen/adapters.pyscripts/testing/rulegen_benchmark.pyscripts/testing/rulegen_probe_words.pycore/tests/rulegen/test_rulegen_en_es_kaikki_provenance.pycore/tests/rulegen/test_rulegen_adapters.pycore/tests/dev/test_rulegen_benchmark.py
- Known gaps:
- only the smallest provenance signal is live so far; richer provenance/competition features are still pending
- the current signal is now selected together with live Kaikki demotion, but it still does not solve
cuadroor the new slang-side failures en-de,en-ja, andes-endo not yet have analogous provenance-scoring work- per-family Kaikki demotion strengths, gloss-decay shape exposure, and lexical short-phrase policy are still the next nearby sweep candidates
Trait-Conditioned Rulegen Profiles
- Status:
planned; runtime routing not implemented or verified - Last documented checkpoint:
2026-03-26 - Last verified:
2026-03-26planning review against current benchmark and Kaikki architecture - Default behavior:
- No runtime profile routing exists yet.
- Current rulegen still uses one selected configuration per run rather than choosing profiles from runtime-computable target traits.
- The intended future direction is to route among a small bank of named profiles using a shared feature extractor and benchmark-backed trait analysis.
- Evidence:
docs/rulegen/trait_conditioned_rulegen_profiles.mddocs/rulegen/rule_generation_technical.mddocs/language_pairs/kaikki_en_es_integration_plan.mdscripts/testing/rulegen_benchmark.pyscripts/testing/rulegen_benchmark_presets.pyscripts/testing/rulegen_benchmark_bundle.pycore/lexishift_core/rulegen/pairs/en_es.pycore/lexishift_core/rulegen/kaikki_views.pycore/lexishift_core/rulegen/ranking.py
- Known gaps:
- There is no shared runtime trait extractor yet.
- Benchmark artifacts do not yet emit per-case feature vectors.
- No profile bank or interpretable router is implemented.
- Current dataset size is still better suited to coarse directional experiments than fine-grained routed-policy learning.
- Learner-stage-aware routing is only conceptual at this point and must stay separate from lexical trait inference.
POS Normalization
- Status:
implemented,default-on,verified - Last documented checkpoint:
2026-02-23 - Last verified:
2026-02-23phase-6 artifacts;2026-03-11code inspection - Default behavior:
- Seed extraction and word-package metadata carry raw and canonical POS.
- Rulegen pair modules can consume normalized POS metadata.
- Evidence:
docs/rulegen/pos_normalization_workstream.mdcore/lexishift_core/pos/normalization.pycore/lexishift_core/srs/seed.pycore/lexishift_core/rulegen/pairs/pos_utils.pydocs/test_outputs/phase6_pos_inventory/phase6_pos_probe_2026-02-23_final.jsondocs/test_outputs/phase6_pos_inventory/phase6_pos_inventory_2026-02-23_final.json
- Known gaps:
- Unknown POS inventory remains for
freq-de-default.sqliteandfreq-ja-bccwj.sqlite. - POS metadata is stronger than current downstream decision usage for both rulegen ranking and SRS growth.
- Unknown POS inventory remains for
SRS Set Planner Strategies
- Status:
frequency_bootstrap:implemented,default-on,verifiedprofile_bootstrap:scaffoldedprofile_growth:scaffoldedadaptive_refresh:scaffolded
- Last documented checkpoint:
2026-02-23 - Last verified:
2026-03-11code inspection - Default behavior:
- Executable behavior remains frequency bootstrap.
- Profile-aware strategies still fall back to planning-only or frequency-bootstrap execution.
- Evidence:
docs/srs/srs_set_planning_technical.mdcore/lexishift_core/srs/set_planner.pycore/lexishift_core/helper/use_cases/initialize_set.py
- Known gaps:
- Planner diagnostics are ahead of executable strategy diversity.
- Pair policy defaults are currently near-identical across active pairs.
Due-Aware SRS Serving
- Status:
planned; end-to-end implementation not verified - Last documented checkpoint:
2026-02-23 - Last verified:
2026-03-11code inspection - Default behavior:
- Docs define due-set-driven serving.
- Current helper publication and extension gate behavior appear to operate on admitted
Sitems rather than a separately published due subset.
- Evidence:
docs/srs/srs_hybrid_model_technical.mdcore/lexishift_core/srs/scheduler.pycore/lexishift_core/helper/rulegen.pyapps/chrome-extension/shared/srs/srs_gate.js
- Known gaps:
- No explicit due-state artifact or due-aware helper ruleset publish path is currently tracked here.
- This item should remain
planneduntil helper publication and runtime gating are verified against due-state behavior.
Extension-Side Confidence Gating For Helper Rules
- Status:
planned/unverified - Last documented checkpoint:
2026-02-27rulegen docs review - Last verified:
2026-03-11code inspection - Default behavior:
- Docs describe confidence-based runtime filtering.
- Extension runtime path inspected on
2026-03-11did not confirm a live helper-rule confidence filter.
- Evidence:
docs/rulegen/rule_generation_technical.mddocs/reference/glossary.mdapps/chrome-extension/content/runtime/rules/active_rules_runtime.jsapps/chrome-extension/shared/srs/srs_gate.js
- Known gaps:
- Treat this as unresolved until a code path is identified and tested.
- Do not mark confidence gating as shipped based on docs alone.
GenAI Workflow Architecture
- Status:
implemented,default-on,verified - Last documented checkpoint:
2026-03-11 - Last verified:
2026-03-12 - Default behavior:
- Use the rulegen quality loop already defined in
AGENTS.mdanddocs/developer/ai_workflow.md. - Use
docs/developer/genai_workflow_architecture.mdfor agent roles, instance splitting, and harness policy. - Use
scripts/testing/rulegen_auto_audit.pyfor dated plus latest rulegen audit runs when a change-aware wrapper is helpful.
- Use the rulegen quality loop already defined in
- Evidence:
docs/developer/genai_workflow_architecture.mdscripts/testing/rulegen_auto_audit.pyscripts/testing/rulegen_pair_audit_cycle.py
- Known gaps:
- Feature-state discipline is stronger now, but status transitions are not yet enforced against commit-scoped artifact diffs.
- Hosted CI still uses an explicit CI-safe build mode rather than full macOS GUI validation.
Current State Mismatches To Preserve Explicitly
These are not accidental wording issues. Keep them explicit until code and docs converge.
- Reverse-check is implemented but not yet default-on.
- SRS docs define due-aware serving, but current end-to-end publish/gate behavior is not yet verified as due-aware.
- Docs mention runtime confidence filtering, but extension-side helper-rule confidence gating is not yet verified in code.
- Planner docs describe multiple strategies, but executable behavior is still dominated by frequency bootstrap.