Intent drift validator (CLAP)
Uses LAION-CLAP (contrastive language-audio pre-training) to flag
audio windows that drift far from the track’s catalog description.
Zero-shot: embeds "{name}. {description}", then each 2-sec audio
window, then computes cosine similarity. Outlier windows are
flagged.
How scoring works
- Load CLAP HTSAT-base (~600 MB, cached at
~/.cache/clap) - Embed text:
f"{name}. {description}" - Decode audio to mono 48 kHz
- Slice into 2-sec windows with 1-sec hop (50% overlap)
- Per window: embed audio, compute cosine similarity to text
- Robust baseline: median + MAD of all windows (trim tail to skip fade-out)
- Flag windows with z ≤ -z_thresh (drift below baseline)
- Merge adjacent (gap ≤ 1 s)
Verdict mapping
| z-score | Severity |
|---|---|
| -4 to -5 | minor |
| -5 to -7 | moderate |
| < -7 | severe |
Parameters
- z_thresh (default 4.0, nullable) — drift threshold
- z_thresh_per_category — per-category override
- z_thresh_per_sound_class — per-sound-class override
- window_sec (2.0) — embedding window length
- hop_sec (1.0) — window hop
- baseline_trim_tail (2) — exclude last N windows from baseline (skip fade-out)
Critical: when to disable
CLAP works WELL for organic content (purrs, bird calls, rain). It works POORLY for:
- synth-pure — pure sines / square waves don’t match any text
description well. CLAP often scores them low overall, which means
every window is “drift”. Always set
synth-pure: null. - synth-ambient — same, lower magnitude. Usually
null. - cultural-instrument — singing bowls / gongs sometimes drift because their tail content is very different from their attack. Pattern-by-pattern.
- affirmations — TTS voice doesn’t match a typical sleep-pad
description. Always
null.
Common gotchas
- Track description in
sounds.jsonmatters! Vague descriptions (“ambient texture”) produce vague embeddings → more drift flagged. Concrete descriptions (“warm pad in C minor with subtle vibrato”) embed more precisely → fewer false positives. - Long tracks (>60s) have more drift opportunities by sheer count.
- CLAP is biased toward common AudioSet classes; obscure content (alien synth pads) may always score “drifty”.
Runtime
~10 s per 90 s track. CLAP embedding load takes 5-10 s first time per process; cached after.