Gemini multimodal judge

Uploads the audio file to Gemini via the Files API, sends a versioned prompt + a strict JSON response schema. Returns per-axis scores (overall, loudness, tonal_balance, intent_fidelity, hiss, loop_seam, pops_clicks) + findings + a final verdict.

How scoring works

Probe actual audio duration via ffprobe (ignores catalog drift)
Load + format the prompt (substitute track metadata + duration)
Upload audio to Gemini Files API (auto-expires after 2 days)
POST with response_schema enforcing structured JSON output
Parse + validate

Models

gemini-2.5-pro (default) — best quality; ~$0.01-0.02/call
gemini-2.5-flash (fallback) — ~10x cheaper; usable for re-judging

Routes 503 retries with Pro → Flash fallback.

API auth

Two paths, in order of preference:

AI Studio API — GEMINI_API_KEY env var (1000 req/day free tier). Default mode (GEMINI_MODE=aistudio).
Vertex AI — GCP_SERVICE_ACCOUNT_JSON env var or ~/aviana-server/secrets/gcp-vertex.json. Higher quota but paid per request. Use GEMINI_MODE=vertex.

The prompt

Versioned (2026-05-14b current). Tells Gemini:

It’s evaluating a sleep-audio track for a specific catalog entry
Score 0-10 on 7 specific axes
Emit findings as {kind, label, time_range, severity}
Return a verdict: ship | regenerate | re_curate | needs_review
Detailed rubric for each axis (e.g. loudness: -19 to -21 LUFS = 10)

Lives in tools/curate/validators/llm_judge_prompt.md.

Consensus

Never trust a single Gemini call for ship decisions. We’ve seen the same 639Hz sine wave score 10/10, 2/10, and 2/10 across three runs. The empirical reliability floor is 3-of-5 consensus for synth content; 2-of-3 for organic content.

The regenerate_v2 + ab_studio pipelines both bake in 3-call consensus. For one-off validation in the UI, always run 3-5 times before acting on the result.

Common verdict patterns

Verdict	What to do
`ship`	Promote (after consensus confirms)
`regenerate`	Re-run the underlying pipeline with adjusted params
`re_curate`	Replace from a different source (e.g. different Freesound ID)
`needs_review`	Severe finding — human listen required

When the judge is wrong

The judge has biases:

It rewards “richness” (harmonics, modulation) — pure sines lose
It rewards “production polish” — DSP often reads as flat
It penalises “obvious loop seam” — but some loops are intentionally perceptible (heartbeat)

If you see the judge consistently down-rate a category that listeners love, consider:

Adding intent_fidelity_override for synth-pure
Bumping prefer_legacy_delta for that category
Using consensus thresholds instead of single-call

Runtime + cost

30-60 s per call (Gemini latency)
~$0.01-0.02 per call (Pro), $0.001-0.002 (Flash)
Free tier: 1000 calls/day on AI Studio (gemini-2.5-pro)

Sign in