From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 80EA14C2A0 for ; Sat, 8 Mar 2025 15:02:15 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C1C6868F48E; Sat, 8 Mar 2025 17:01:56 +0200 (EET) Received: from mail-wm1-f51.google.com (mail-wm1-f51.google.com [209.85.128.51]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 7551A68DBE1 for ; Sat, 8 Mar 2025 17:01:55 +0200 (EET) Received: by mail-wm1-f51.google.com with SMTP id 5b1f17b1804b1-43bcfa6c57fso15793165e9.0 for ; Sat, 08 Mar 2025 07:01:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1741446114; x=1742050914; darn=ffmpeg.org; h=content-language:thread-index:content-transfer-encoding :mime-version:message-id:date:subject:to:from:from:to:cc:subject :date:message-id:reply-to; bh=c/KW67NSF8dSJ7XXDZzUi6s5QUXAnA4TJIeSFpkML+c=; b=SrDDKezrsTMpo3TOvTkced/c+kFnCbk1GWVgIJyzAfm/LzCNVr2NMb4ozRzNuvx35e 2M1fIXlvyq7CjTb/cy0Wi4Xrb46nNaN7DHYfu5+6Uc9j/tm+6uxxvlUdtHuZYLtDPsxg i18e45JS8fRJcCxpTOEtnO9/vUQvGdHvvfBxIK8WX5+9FhN7lOXElRqZNrbWGZmEx2+R yHqXcn5DCn/so7Nzcc0OMO8lyJEseboF4zPj4HhTrbBfz9WrseKgj3m+iqHlzwYcq7qe l4NHpsa45qeEsnok00IIiwT9oXDZOyXXOWnMKyoJkFrUqmidFgh5+QJDR+1p5gpRyxBQ imdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741446114; x=1742050914; h=content-language:thread-index:content-transfer-encoding :mime-version:message-id:date:subject:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=c/KW67NSF8dSJ7XXDZzUi6s5QUXAnA4TJIeSFpkML+c=; b=jdGkg7j4gS6rnOAozZHJ01lLq/nmDGmgEoV5WKNhxBtdrHYzvfRXzQ9DTd8KjOGG/n KuGT7C54XNjD9AKtUSQxkFJ2mpuYza06LT97MKMDmCBTwOAcav1ms85t8NtSMzpVQ4qe 4SFfeXgW0/D6gMjCDXAFUm+nPRZB4roToMmVEqUj1r25Ifu1/PtWZ2RK2KsXoedknkTz y5YzQpVdb2J07IdmgTiKrq7YWvraIst/YGqxafZ72Qin2JI0ZAwGQciyxNKkYtzyntdN nZOWJbM/jt9UCKGBMEWS0o3VMM4D0Ir39JninUzeF+kqbQkKRxAe0D72SCYiF26RP+qK X7aw== X-Gm-Message-State: AOJu0YzyQi1/ahu0eV/sgiB4q/h2W5p4V8u/ebgl5m6VqZHnjypeUsGr uj637SC5OYUeoRzb9byGdgsj6v6eJSavVkfJP6dUwm7drm0Acwl1nkbnwQ== X-Gm-Gg: ASbGnct4Cbet+34wkr0guEBJ2/AzFo1h1MhJDxLhvxAsDdhnhf54C4PGS+GEHYyq1TQ zWYi19sHvmzxHgunwOMT4t9bhs70IfBwjMPunfvdAB8pGPAy7MMbyVL4nOdyTsXLzDn0P4Z8VTY CVgPbFXSmevoBACjJkJa2DX3ceXMm/7eT4TmChknfjbgMiUD6fF3y3NA8XwVk/Hivf9+K4Vr1OQ qcgoukNjCmuv4D1/iJnzX8vJO1PP3gPHwQLKP5Gflq6Y578ETUKejykVVvmZMDJ4iPUlCXgixXQ cVn633ScfSKj+IBbksgy2Zpt1hQz5qnwWJ4qz3IRptNqMEn1ZFo7MSIZt/GaH76BVusdqCWRwLT wM0iu2GRPU9+g31SY X-Google-Smtp-Source: AGHT+IEMr7J/Zd2MmTWP/IDKCyTRl9wnbJ95z70uNEzmY7e9ObgA4OYFBqPPNwMQntkV3h3tt4gcbA== X-Received: by 2002:a05:600c:1c28:b0:43b:cd0a:970f with SMTP id 5b1f17b1804b1-43c5a5e9848mr43335295e9.3.1741446113987; Sat, 08 Mar 2025 07:01:53 -0800 (PST) Received: from MK2 (80-108-16-220.cable.dynamic.surfer.at. [80.108.16.220]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43cee67ae5esm6433445e9.33.2025.03.08.07.01.53 for (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 08 Mar 2025 07:01:53 -0800 (PST) From: To: Date: Sat, 8 Mar 2025 16:01:56 +0100 Message-ID: <007e01db903b$09633b50$1c29b1f0$@gmail.com> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Thread-Index: AduQOIrwV6q3QQFlQj2ibJmSEUB8Yw== Content-Language: en-at Subject: [FFmpeg-devel] [PATCH FFmpeg 12/15] doc: move classify Filter doc to Multimedia Filters chapter X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: Try the new filters using my Github Repo https://github.com/MaximilianKaindl/DeepFFMPEGVideoClassification. Any Feedback is appreciated! Signed-off-by: MaximilianKaindl --- doc/filters.texi | 170 +++++++++++++++++++++++------------------------ 1 file changed, 85 insertions(+), 85 deletions(-) diff --git a/doc/filters.texi b/doc/filters.texi index bd75982d7d..915e0244cd 100644 --- a/doc/filters.texi +++ b/doc/filters.texi @@ -11970,91 +11970,6 @@ ffmpeg -i INPUT -f lavfi -i nullsrc=hd720,geq='r=128+80*(sin(sqrt((X-W/2)*(X-W/2 @end example @end itemize -@section dnn_classify -Analyze media (video frames or audio) using deep neural networks to apply classifications based on the content. -This filter supports three classification modes: - -@itemize @bullet -@item Standard image classification (OpenVINO backend) -@item CLIP (Contrastive Language-Image Pre-training) classification (Torch backend) -@item CLAP (Contrastive Language-Audio Pre-training) classification (Torch backend) -@end itemize - -The filter accepts the following options: -@table @option -@item dnn_backend -Specify which DNN backend to use for model loading and execution. Currently supports: -@table @samp -@item openvino -Use OpenVINO backend (standard image classification only). -@item torch -Use LibTorch backend (supports CLIP for images and CLAP for audio). -@end table -@item confidence -Set the confidence threshold (default: 0.5). Classifications with confidence below this value will be filtered out. -@item labels -Set path to a label file specifying classification labels. This is required for standard classification and can be used for CLIP/CLAP classification. -Each label is written on a separate line in the file. Trailing spaces and empty lines are skipped. -@item categories -Path to a categories file for hierarchical classification (CLIP/CLAP only). This allows classification to be organized into multiple category units with individual categories containing related labels. -@item tokenizer -Path to the text tokenizer.json file (CLIP/CLAP only). Required for text embedding generation. -@item target -Specify which objects to classify. When omitted, the entire frame is classified. When specified, only bounding boxes with detection labels matching this value are classified. -@item is_audio -Enable audio processing mode for CLAP models (default: 0). Set to 1 to process audio input instead of video frames. -@item logit_scale -Logit scale for similarity calculation in CLIP/CLAP (default: 4.6052 for CLIP, 33.37 for CLAP). Values below 0 use the default. -@item temperature -Softmax temperature for CLIP/CLAP models (default: 1.0). Lower values make the output more peaked, higher values make it smoother. -@item forward_order -Order of forward output for CLIP/CLAP: 0 for media-text order, 1 for text-media order (default depends on model type). -@item normalize -Whether to normalize the input tensor for CLIP/CLAP (default depends on model type). Some scripted models already do this in the forward, so this is not necessary in some cases. -@item input_res -Expected input resolution for video processing models (default: automatically detected). -@item sample_rate -Expected sample rate for audio processing models (default: 44100). -@item sample_duration -Expected sample duration in seconds for audio processing models (default: 7). -@item token_dimension -Dimension of token vector for text embeddings (default: 77). -@item optimize -Enable graph executor optimization (0: disabled, 1: enabled). -@end table -@subsection Category Files Format -For CLIP/CLAP models, a hierarchical categories file can be provided with the following format: -@example -[RecordingSystem] -(Professional) -a photo with high level of detail -a professionally recorded sound -(HomeRecording) -a photo with low level of detail -an amateur recording -[ContentType] -(Nature) -trees -mountains -birds singing -(Urban) -buildings -street noise -traffic sounds -@end example -Each unit enclosed in square brackets [] creates a classification group. Within each group, categories are defined with parentheses () and the labels under each category are used to classify the input. -@subsection Examples -@example -Classify video using OpenVINO -ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=openvino:model=model.xml:labels=labels.txt" output.mp4 -Classify video using CLIP -ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=torch:model=clip_model.pt:categories=categories.txt:tokenizer=tokenizer.json" output.mp4 -Classify only person objects in a video -ffmpeg -i input.mp4 -vf "dnn_detect=model=detection.xml:input=data:output=detection_out:confidence=0.5,dnn_classify=model=clip_model.pt:dnn_backend=torch:tokenizer=tokenizer.json:labels=labels.txt:target=person" output.mp4 -Classify audio using CLAP -ffmpeg -i input.mp3 -af "dnn_classify=dnn_backend=torch:model=clap_model.pt:categories=audio_categories.txt:tokenizer=tokenizer.json:is_audio=1:sample_rate=44100:sample_duration=7" output.mp3 -@end example - @section dnn_detect Do object detection with deep neural networks. @@ -30925,6 +30840,91 @@ bench=start,selectivecolor=reds=-.2 .12 -.49,bench=stop @end example @end itemize +@section dnn_classify +Analyze media (video frames or audio) using deep neural networks to apply classifications based on the content. +This filter supports three classification modes: + +@itemize @bullet +@item Standard image classification (OpenVINO backend) +@item CLIP (Contrastive Language-Image Pre-training) classification (Torch backend) +@item CLAP (Contrastive Language-Audio Pre-training) classification (Torch backend) +@end itemize + +The filter accepts the following options: +@table @option +@item dnn_backend +Specify which DNN backend to use for model loading and execution. Currently supports: +@table @samp +@item openvino +Use OpenVINO backend (standard image classification only). +@item torch +Use LibTorch backend (supports CLIP for images and CLAP for audio). +@end table +@item confidence +Set the confidence threshold (default: 0.5). Classifications with confidence below this value will be filtered out. +@item labels +Set path to a label file specifying classification labels. This is required for standard classification and can be used for CLIP/CLAP classification. +Each label is written on a separate line in the file. Trailing spaces and empty lines are skipped. +@item categories +Path to a categories file for hierarchical classification (CLIP/CLAP only). This allows classification to be organized into multiple category units with individual categories containing related labels. +@item tokenizer +Path to the text tokenizer.json file (CLIP/CLAP only). Required for text embedding generation. +@item target +Specify which objects to classify. When omitted, the entire frame is classified. When specified, only bounding boxes with detection labels matching this value are classified. +@item is_audio +Enable audio processing mode for CLAP models (default: 0). Set to 1 to process audio input instead of video frames. +@item logit_scale +Logit scale for similarity calculation in CLIP/CLAP (default: 4.6052 for CLIP, 33.37 for CLAP). Values below 0 use the default. +@item temperature +Softmax temperature for CLIP/CLAP models (default: 1.0). Lower values make the output more peaked, higher values make it smoother. +@item forward_order +Order of forward output for CLIP/CLAP: 0 for media-text order, 1 for text-media order (default depends on model type). +@item normalize +Whether to normalize the input tensor for CLIP/CLAP (default depends on model type). Some scripted models already do this in the forward, so this is not necessary in some cases. +@item input_res +Expected input resolution for video processing models (default: automatically detected). +@item sample_rate +Expected sample rate for audio processing models (default: 44100). +@item sample_duration +Expected sample duration in seconds for audio processing models (default: 7). +@item token_dimension +Dimension of token vector for text embeddings (default: 77). +@item optimize +Enable graph executor optimization (0: disabled, 1: enabled). +@end table +@subsection Category Files Format +For CLIP/CLAP models, a hierarchical categories file can be provided with the following format: +@example +[RecordingSystem] +(Professional) +a photo with high level of detail +a professionally recorded sound +(HomeRecording) +a photo with low level of detail +an amateur recording +[ContentType] +(Nature) +trees +mountains +birds singing +(Urban) +buildings +street noise +traffic sounds +@end example +Each unit enclosed in square brackets [] creates a classification group. Within each group, categories are defined with parentheses () and the labels under each category are used to classify the input. +@subsection Examples +@example +Classify video using OpenVINO +ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=openvino:model=model.xml:labels=labels.txt" output.mp4 +Classify video using CLIP +ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=torch:model=clip_model.pt:categories=categories.txt:tokenizer=tokenizer.json" output.mp4 +Classify only person objects in a video +ffmpeg -i input.mp4 -vf "dnn_detect=model=detection.xml:input=data:output=detection_out:confidence=0.5,dnn_classify=model=clip_model.pt:dnn_backend=torch:tokenizer=tokenizer.json:labels=labels.txt:target=person" output.mp4 +Classify audio using CLAP +ffmpeg -i input.mp3 -af "dnn_classify=dnn_backend=torch:model=clap_model.pt:categories=audio_categories.txt:tokenizer=tokenizer.json:is_audio=1:sample_rate=44100:sample_duration=7" output.mp3 +@end example + @section concat Concatenate audio and video streams, joining them together one after the -- 2.34.1 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".