← All docs

Intent drift validator (CLAP)

Uses LAION-CLAP (contrastive language-audio pre-training) to flag audio windows that drift far from the track’s catalog description. Zero-shot: embeds "{name}. {description}", then each 2-sec audio window, then computes cosine similarity. Outlier windows are flagged.

How scoring works

  1. Load CLAP HTSAT-base (~600 MB, cached at ~/.cache/clap)
  2. Embed text: f"{name}. {description}"
  3. Decode audio to mono 48 kHz
  4. Slice into 2-sec windows with 1-sec hop (50% overlap)
  5. Per window: embed audio, compute cosine similarity to text
  6. Robust baseline: median + MAD of all windows (trim tail to skip fade-out)
  7. Flag windows with z ≤ -z_thresh (drift below baseline)
  8. Merge adjacent (gap ≤ 1 s)

Verdict mapping

z-scoreSeverity
-4 to -5minor
-5 to -7moderate
< -7severe

Parameters

  • z_thresh (default 4.0, nullable) — drift threshold
  • z_thresh_per_category — per-category override
  • z_thresh_per_sound_class — per-sound-class override
  • window_sec (2.0) — embedding window length
  • hop_sec (1.0) — window hop
  • baseline_trim_tail (2) — exclude last N windows from baseline (skip fade-out)

Critical: when to disable

CLAP works WELL for organic content (purrs, bird calls, rain). It works POORLY for:

  • synth-pure — pure sines / square waves don’t match any text description well. CLAP often scores them low overall, which means every window is “drift”. Always set synth-pure: null.
  • synth-ambient — same, lower magnitude. Usually null.
  • cultural-instrument — singing bowls / gongs sometimes drift because their tail content is very different from their attack. Pattern-by-pattern.
  • affirmations — TTS voice doesn’t match a typical sleep-pad description. Always null.

Common gotchas

  • Track description in sounds.json matters! Vague descriptions (“ambient texture”) produce vague embeddings → more drift flagged. Concrete descriptions (“warm pad in C minor with subtle vibrato”) embed more precisely → fewer false positives.
  • Long tracks (>60s) have more drift opportunities by sheer count.
  • CLAP is biased toward common AudioSet classes; obscure content (alien synth pads) may always score “drifty”.

Runtime

~10 s per 90 s track. CLAP embedding load takes 5-10 s first time per process; cached after.