← All docs

Gemini multimodal judge

Uploads the audio file to Gemini via the Files API, sends a versioned prompt + a strict JSON response schema. Returns per-axis scores (overall, loudness, tonal_balance, intent_fidelity, hiss, loop_seam, pops_clicks) + findings + a final verdict.

How scoring works

  1. Probe actual audio duration via ffprobe (ignores catalog drift)
  2. Load + format the prompt (substitute track metadata + duration)
  3. Upload audio to Gemini Files API (auto-expires after 2 days)
  4. POST with response_schema enforcing structured JSON output
  5. Parse + validate

Models

  • gemini-2.5-pro (default) — best quality; ~$0.01-0.02/call
  • gemini-2.5-flash (fallback) — ~10x cheaper; usable for re-judging

Routes 503 retries with Pro → Flash fallback.

API auth

Two paths, in order of preference:

  1. AI Studio APIGEMINI_API_KEY env var (1000 req/day free tier). Default mode (GEMINI_MODE=aistudio).
  2. Vertex AIGCP_SERVICE_ACCOUNT_JSON env var or ~/aviana-server/secrets/gcp-vertex.json. Higher quota but paid per request. Use GEMINI_MODE=vertex.

The prompt

Versioned (2026-05-14b current). Tells Gemini:

  • It’s evaluating a sleep-audio track for a specific catalog entry
  • Score 0-10 on 7 specific axes
  • Emit findings as {kind, label, time_range, severity}
  • Return a verdict: ship | regenerate | re_curate | needs_review
  • Detailed rubric for each axis (e.g. loudness: -19 to -21 LUFS = 10)

Lives in tools/curate/validators/llm_judge_prompt.md.

Consensus

Never trust a single Gemini call for ship decisions. We’ve seen the same 639Hz sine wave score 10/10, 2/10, and 2/10 across three runs. The empirical reliability floor is 3-of-5 consensus for synth content; 2-of-3 for organic content.

The regenerate_v2 + ab_studio pipelines both bake in 3-call consensus. For one-off validation in the UI, always run 3-5 times before acting on the result.

Common verdict patterns

VerdictWhat to do
shipPromote (after consensus confirms)
regenerateRe-run the underlying pipeline with adjusted params
re_curateReplace from a different source (e.g. different Freesound ID)
needs_reviewSevere finding — human listen required

When the judge is wrong

The judge has biases:

  • It rewards “richness” (harmonics, modulation) — pure sines lose
  • It rewards “production polish” — DSP often reads as flat
  • It penalises “obvious loop seam” — but some loops are intentionally perceptible (heartbeat)

If you see the judge consistently down-rate a category that listeners love, consider:

  • Adding intent_fidelity_override for synth-pure
  • Bumping prefer_legacy_delta for that category
  • Using consensus thresholds instead of single-call

Runtime + cost

  • 30-60 s per call (Gemini latency)
  • ~$0.01-0.02 per call (Pro), $0.001-0.002 (Flash)
  • Free tier: 1000 calls/day on AI Studio (gemini-2.5-pro)