Intent drift validator (CLAP)

Uses LAION-CLAP (contrastive language-audio pre-training) to flag audio windows that drift far from the track’s catalog description. Zero-shot: embeds "{name}. {description}", then each 2-sec audio window, then computes cosine similarity. Outlier windows are flagged.

How scoring works

Load CLAP HTSAT-base (~600 MB, cached at ~/.cache/clap)
Embed text: f"{name}. {description}"
Decode audio to mono 48 kHz
Slice into 2-sec windows with 1-sec hop (50% overlap)
Per window: embed audio, compute cosine similarity to text
Robust baseline: median + MAD of all windows (trim tail to skip fade-out)
Flag windows with z ≤ -z_thresh (drift below baseline)
Merge adjacent (gap ≤ 1 s)

Verdict mapping

z-score	Severity
-4 to -5	minor
-5 to -7	moderate
< -7	severe

Parameters

z_thresh (default 4.0, nullable) — drift threshold
z_thresh_per_category — per-category override
z_thresh_per_sound_class — per-sound-class override
window_sec (2.0) — embedding window length
hop_sec (1.0) — window hop
baseline_trim_tail (2) — exclude last N windows from baseline (skip fade-out)

Critical: when to disable

CLAP works WELL for organic content (purrs, bird calls, rain). It works POORLY for:

synth-pure — pure sines / square waves don’t match any text description well. CLAP often scores them low overall, which means every window is “drift”. Always set synth-pure: null.
synth-ambient — same, lower magnitude. Usually null.
cultural-instrument — singing bowls / gongs sometimes drift because their tail content is very different from their attack. Pattern-by-pattern.
affirmations — TTS voice doesn’t match a typical sleep-pad description. Always null.

Common gotchas

Track description in sounds.json matters! Vague descriptions (“ambient texture”) produce vague embeddings → more drift flagged. Concrete descriptions (“warm pad in C minor with subtle vibrato”) embed more precisely → fewer false positives.
Long tracks (>60s) have more drift opportunities by sheer count.
CLAP is biased toward common AudioSet classes; obscure content (alien synth pads) may always score “drifty”.

Runtime

~10 s per 90 s track. CLAP embedding load takes 5-10 s first time per process; cached after.