Gemini multimodal judge
Uploads the audio file to Gemini via the Files API, sends a versioned prompt + a strict JSON response schema. Returns per-axis scores (overall, loudness, tonal_balance, intent_fidelity, hiss, loop_seam, pops_clicks) + findings + a final verdict.
How scoring works
- Probe actual audio duration via ffprobe (ignores catalog drift)
- Load + format the prompt (substitute track metadata + duration)
- Upload audio to Gemini Files API (auto-expires after 2 days)
- POST with
response_schemaenforcing structured JSON output - Parse + validate
Models
gemini-2.5-pro(default) — best quality; ~$0.01-0.02/callgemini-2.5-flash(fallback) — ~10x cheaper; usable for re-judging
Routes 503 retries with Pro → Flash fallback.
API auth
Two paths, in order of preference:
- AI Studio API —
GEMINI_API_KEYenv var (1000 req/day free tier). Default mode (GEMINI_MODE=aistudio). - Vertex AI —
GCP_SERVICE_ACCOUNT_JSONenv var or~/aviana-server/secrets/gcp-vertex.json. Higher quota but paid per request. UseGEMINI_MODE=vertex.
The prompt
Versioned (2026-05-14b current). Tells Gemini:
- It’s evaluating a sleep-audio track for a specific catalog entry
- Score 0-10 on 7 specific axes
- Emit findings as
{kind, label, time_range, severity} - Return a verdict:
ship | regenerate | re_curate | needs_review - Detailed rubric for each axis (e.g. loudness: -19 to -21 LUFS = 10)
Lives in tools/curate/validators/llm_judge_prompt.md.
Consensus
Never trust a single Gemini call for ship decisions. We’ve seen the same 639Hz sine wave score 10/10, 2/10, and 2/10 across three runs. The empirical reliability floor is 3-of-5 consensus for synth content; 2-of-3 for organic content.
The regenerate_v2 + ab_studio pipelines both bake in 3-call consensus. For one-off validation in the UI, always run 3-5 times before acting on the result.
Common verdict patterns
| Verdict | What to do |
|---|---|
ship | Promote (after consensus confirms) |
regenerate | Re-run the underlying pipeline with adjusted params |
re_curate | Replace from a different source (e.g. different Freesound ID) |
needs_review | Severe finding — human listen required |
When the judge is wrong
The judge has biases:
- It rewards “richness” (harmonics, modulation) — pure sines lose
- It rewards “production polish” — DSP often reads as flat
- It penalises “obvious loop seam” — but some loops are intentionally perceptible (heartbeat)
If you see the judge consistently down-rate a category that listeners love, consider:
- Adding
intent_fidelity_overridefor synth-pure - Bumping
prefer_legacy_deltafor that category - Using consensus thresholds instead of single-call
Runtime + cost
- 30-60 s per call (Gemini latency)
- ~$0.01-0.02 per call (Pro), $0.001-0.002 (Flash)
- Free tier: 1000 calls/day on AI Studio (gemini-2.5-pro)