[FFmpeg-devel] [PATCH v2 FFmpeg 0/20] Zero-Shot Classification Support for FFMPEG (CLIP and CLAP)

From: <m.kaindl0208@gmail.com>
To: <ffmpeg-devel@ffmpeg.org>
Subject: [FFmpeg-devel] [PATCH v2 FFmpeg 0/20] Zero-Shot Classification Support for FFMPEG (CLIP and CLAP)
Date: Mon, 10 Mar 2025 20:48:47 +0100
Message-ID: <003301db91f5$70868240$519386c0$@gmail.com> (raw)
In-Reply-To: 

Hi,

I'm excited to propose a series of patches adding support for modern zero-shot classification models to FFmpeg. These patches enable FFmpeg to leverage CLIP (Contrastive Language-Image Pre-training) and CLAP (Contrastive Language-Audio Pre-training) models for media classification.

Key Features:
Zero-shot classification support: Use text prompts to classify media without training specific models Audio classification with CLAP: Extend FFmpeg's DNN capabilities to audio content Hierarchical classification: Group classification categories with a new category file format Stream classification averaging: New avgclass filter for averaging classification results

Implementation Details:
The implementation adds tokenizer support to the LibTorch backend using the tokenizers-cpp library The existing dnn_classify filter has been transformed from a video-only filter to a multimedia filter, now supporting both video and audio inputs based on a configuration flag.
For video, the implementation supports both standard/original classification (OpenVINO backend) and CLIP (Torch backend). For audio, it adds CLAP support via the Torch backend.

For further details, please refer to the documentation.

For model conversion/scripting or step-by-step installation, see my GitHub project: https://github.com/MaximilianKaindl/DeepFFMPEGVideoClassification

Regarding CLAP models, they unfortunately need to be traced due to NumPy weak references, which seems to lock in the device used for tracing.

For audio preprocessing, I've implemented two functions: handle_long_audio and handle_short_audio, which imitate the original CLAP Preprocessor. These functions aren't used by default since classify automatically buffers frames to the desired length, but they might improve performance, especially handle_short_audio which repeats parts of the audio. That's why I've kept them in place.

I could use help ensuring my implementation doesn't interfere with the original dnn_classification or dnn_processing functionality. Thanks!

Furthermore, should I upload tests for this functionality? Model sizes are big around >500 Mb. 

This time the patches should be fine, I could apply them on my machine.

Signed-off-by: MaximilianKaindl <m.kaindl0208@gmail.com>

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".