From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 80EA14C2A0
	for <ffmpegdev@gitmailbox.com>; Sat,  8 Mar 2025 15:02:15 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C1C6868F48E;
	Sat,  8 Mar 2025 17:01:56 +0200 (EET)
Received: from mail-wm1-f51.google.com (mail-wm1-f51.google.com
 [209.85.128.51])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 7551A68DBE1
 for <ffmpeg-devel@ffmpeg.org>; Sat,  8 Mar 2025 17:01:55 +0200 (EET)
Received: by mail-wm1-f51.google.com with SMTP id
 5b1f17b1804b1-43bcfa6c57fso15793165e9.0
 for <ffmpeg-devel@ffmpeg.org>; Sat, 08 Mar 2025 07:01:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1741446114; x=1742050914; darn=ffmpeg.org;
 h=content-language:thread-index:content-transfer-encoding
 :mime-version:message-id:date:subject:to:from:from:to:cc:subject
 :date:message-id:reply-to;
 bh=c/KW67NSF8dSJ7XXDZzUi6s5QUXAnA4TJIeSFpkML+c=;
 b=SrDDKezrsTMpo3TOvTkced/c+kFnCbk1GWVgIJyzAfm/LzCNVr2NMb4ozRzNuvx35e
 2M1fIXlvyq7CjTb/cy0Wi4Xrb46nNaN7DHYfu5+6Uc9j/tm+6uxxvlUdtHuZYLtDPsxg
 i18e45JS8fRJcCxpTOEtnO9/vUQvGdHvvfBxIK8WX5+9FhN7lOXElRqZNrbWGZmEx2+R
 yHqXcn5DCn/so7Nzcc0OMO8lyJEseboF4zPj4HhTrbBfz9WrseKgj3m+iqHlzwYcq7qe
 l4NHpsa45qeEsnok00IIiwT9oXDZOyXXOWnMKyoJkFrUqmidFgh5+QJDR+1p5gpRyxBQ
 imdg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1741446114; x=1742050914;
 h=content-language:thread-index:content-transfer-encoding
 :mime-version:message-id:date:subject:to:from:x-gm-message-state
 :from:to:cc:subject:date:message-id:reply-to;
 bh=c/KW67NSF8dSJ7XXDZzUi6s5QUXAnA4TJIeSFpkML+c=;
 b=jdGkg7j4gS6rnOAozZHJ01lLq/nmDGmgEoV5WKNhxBtdrHYzvfRXzQ9DTd8KjOGG/n
 KuGT7C54XNjD9AKtUSQxkFJ2mpuYza06LT97MKMDmCBTwOAcav1ms85t8NtSMzpVQ4qe
 4SFfeXgW0/D6gMjCDXAFUm+nPRZB4roToMmVEqUj1r25Ifu1/PtWZ2RK2KsXoedknkTz
 y5YzQpVdb2J07IdmgTiKrq7YWvraIst/YGqxafZ72Qin2JI0ZAwGQciyxNKkYtzyntdN
 nZOWJbM/jt9UCKGBMEWS0o3VMM4D0Ir39JninUzeF+kqbQkKRxAe0D72SCYiF26RP+qK
 X7aw==
X-Gm-Message-State: AOJu0YzyQi1/ahu0eV/sgiB4q/h2W5p4V8u/ebgl5m6VqZHnjypeUsGr
 uj637SC5OYUeoRzb9byGdgsj6v6eJSavVkfJP6dUwm7drm0Acwl1nkbnwQ==
X-Gm-Gg: ASbGnct4Cbet+34wkr0guEBJ2/AzFo1h1MhJDxLhvxAsDdhnhf54C4PGS+GEHYyq1TQ
 zWYi19sHvmzxHgunwOMT4t9bhs70IfBwjMPunfvdAB8pGPAy7MMbyVL4nOdyTsXLzDn0P4Z8VTY
 CVgPbFXSmevoBACjJkJa2DX3ceXMm/7eT4TmChknfjbgMiUD6fF3y3NA8XwVk/Hivf9+K4Vr1OQ
 qcgoukNjCmuv4D1/iJnzX8vJO1PP3gPHwQLKP5Gflq6Y578ETUKejykVVvmZMDJ4iPUlCXgixXQ
 cVn633ScfSKj+IBbksgy2Zpt1hQz5qnwWJ4qz3IRptNqMEn1ZFo7MSIZt/GaH76BVusdqCWRwLT
 wM0iu2GRPU9+g31SY
X-Google-Smtp-Source: AGHT+IEMr7J/Zd2MmTWP/IDKCyTRl9wnbJ95z70uNEzmY7e9ObgA4OYFBqPPNwMQntkV3h3tt4gcbA==
X-Received: by 2002:a05:600c:1c28:b0:43b:cd0a:970f with SMTP id
 5b1f17b1804b1-43c5a5e9848mr43335295e9.3.1741446113987; 
 Sat, 08 Mar 2025 07:01:53 -0800 (PST)
Received: from MK2 (80-108-16-220.cable.dynamic.surfer.at. [80.108.16.220])
 by smtp.gmail.com with ESMTPSA id
 5b1f17b1804b1-43cee67ae5esm6433445e9.33.2025.03.08.07.01.53
 for <ffmpeg-devel@ffmpeg.org>
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 08 Mar 2025 07:01:53 -0800 (PST)
From: <m.kaindl0208@gmail.com>
To: <ffmpeg-devel@ffmpeg.org>
Date: Sat, 8 Mar 2025 16:01:56 +0100
Message-ID: <007e01db903b$09633b50$1c29b1f0$@gmail.com>
MIME-Version: 1.0
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AduQOIrwV6q3QQFlQj2ibJmSEUB8Yw==
Content-Language: en-at
Subject: [FFmpeg-devel] [PATCH FFmpeg 12/15] doc: move classify Filter doc
 to Multimedia Filters chapter
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/007e01db903b$09633b50$1c29b1f0$@gmail.com/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

Try the new filters using my Github Repo https://github.com/MaximilianKaindl/DeepFFMPEGVideoClassification. 

Any Feedback is appreciated!

Signed-off-by: MaximilianKaindl <m.kaindl0208@gmail.com>
---
 doc/filters.texi | 170 +++++++++++++++++++++++------------------------
 1 file changed, 85 insertions(+), 85 deletions(-)

diff --git a/doc/filters.texi b/doc/filters.texi
index bd75982d7d..915e0244cd 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -11970,91 +11970,6 @@ ffmpeg -i INPUT -f lavfi -i nullsrc=hd720,geq='r=128+80*(sin(sqrt((X-W/2)*(X-W/2
 @end example
 @end itemize
 
-@section dnn_classify
-Analyze media (video frames or audio) using deep neural networks to apply classifications based on the content.
-This filter supports three classification modes:
-
-@itemize @bullet
-@item Standard image classification (OpenVINO backend)
-@item CLIP (Contrastive Language-Image Pre-training) classification (Torch backend)
-@item CLAP (Contrastive Language-Audio Pre-training) classification (Torch backend)
-@end itemize
-
-The filter accepts the following options:
-@table @option
-@item dnn_backend
-Specify which DNN backend to use for model loading and execution. Currently supports:
-@table @samp
-@item openvino
-Use OpenVINO backend (standard image classification only).
-@item torch
-Use LibTorch backend (supports CLIP for images and CLAP for audio).
-@end table
-@item confidence
-Set the confidence threshold (default: 0.5). Classifications with confidence below this value will be filtered out.
-@item labels
-Set path to a label file specifying classification labels. This is required for standard classification and can be used for CLIP/CLAP classification.
-Each label is written on a separate line in the file. Trailing spaces and empty lines are skipped.
-@item categories
-Path to a categories file for hierarchical classification (CLIP/CLAP only). This allows classification to be organized into multiple category units with individual categories containing related labels.
-@item tokenizer
-Path to the text tokenizer.json file (CLIP/CLAP only). Required for text embedding generation.
-@item target
-Specify which objects to classify. When omitted, the entire frame is classified. When specified, only bounding boxes with detection labels matching this value are classified.
-@item is_audio
-Enable audio processing mode for CLAP models (default: 0). Set to 1 to process audio input instead of video frames.
-@item logit_scale
-Logit scale for similarity calculation in CLIP/CLAP (default: 4.6052 for CLIP, 33.37 for CLAP). Values below 0 use the default.
-@item temperature
-Softmax temperature for CLIP/CLAP models (default: 1.0). Lower values make the output more peaked, higher values make it smoother.
-@item forward_order
-Order of forward output for CLIP/CLAP: 0 for media-text order, 1 for text-media order (default depends on model type).
-@item normalize
-Whether to normalize the input tensor for CLIP/CLAP (default depends on model type). Some scripted models already do this in the forward, so this is not necessary in some cases.
-@item input_res
-Expected input resolution for video processing models (default: automatically detected).
-@item sample_rate
-Expected sample rate for audio processing models (default: 44100).
-@item sample_duration
-Expected sample duration in seconds for audio processing models (default: 7).
-@item token_dimension
-Dimension of token vector for text embeddings (default: 77).
-@item optimize
-Enable graph executor optimization (0: disabled, 1: enabled).
-@end table
-@subsection Category Files Format
-For CLIP/CLAP models, a hierarchical categories file can be provided with the following format:
-@example
-[RecordingSystem]
-(Professional)
-a photo with high level of detail
-a professionally recorded sound
-(HomeRecording)
-a photo with low level of detail
-an amateur recording
-[ContentType]
-(Nature)
-trees
-mountains
-birds singing
-(Urban)
-buildings
-street noise
-traffic sounds
-@end example
-Each unit enclosed in square brackets [] creates a classification group. Within each group, categories are defined with parentheses () and the labels under each category are used to classify the input.
-@subsection Examples
-@example
-Classify video using OpenVINO
-ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=openvino:model=model.xml:labels=labels.txt" output.mp4
-Classify video using CLIP
-ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=torch:model=clip_model.pt:categories=categories.txt:tokenizer=tokenizer.json" output.mp4
-Classify only person objects in a video
-ffmpeg -i input.mp4 -vf "dnn_detect=model=detection.xml:input=data:output=detection_out:confidence=0.5,dnn_classify=model=clip_model.pt:dnn_backend=torch:tokenizer=tokenizer.json:labels=labels.txt:target=person" output.mp4
-Classify audio using CLAP
-ffmpeg -i input.mp3 -af "dnn_classify=dnn_backend=torch:model=clap_model.pt:categories=audio_categories.txt:tokenizer=tokenizer.json:is_audio=1:sample_rate=44100:sample_duration=7" output.mp3
-@end example
-
 @section dnn_detect
 
 Do object detection with deep neural networks.
@@ -30925,6 +30840,91 @@ bench=start,selectivecolor=reds=-.2 .12 -.49,bench=stop
 @end example
 @end itemize
 
+@section dnn_classify
+Analyze media (video frames or audio) using deep neural networks to apply classifications based on the content.
+This filter supports three classification modes:
+
+@itemize @bullet
+@item Standard image classification (OpenVINO backend)
+@item CLIP (Contrastive Language-Image Pre-training) classification (Torch backend)
+@item CLAP (Contrastive Language-Audio Pre-training) classification (Torch backend)
+@end itemize
+
+The filter accepts the following options:
+@table @option
+@item dnn_backend
+Specify which DNN backend to use for model loading and execution. Currently supports:
+@table @samp
+@item openvino
+Use OpenVINO backend (standard image classification only).
+@item torch
+Use LibTorch backend (supports CLIP for images and CLAP for audio).
+@end table
+@item confidence
+Set the confidence threshold (default: 0.5). Classifications with confidence below this value will be filtered out.
+@item labels
+Set path to a label file specifying classification labels. This is required for standard classification and can be used for CLIP/CLAP classification.
+Each label is written on a separate line in the file. Trailing spaces and empty lines are skipped.
+@item categories
+Path to a categories file for hierarchical classification (CLIP/CLAP only). This allows classification to be organized into multiple category units with individual categories containing related labels.
+@item tokenizer
+Path to the text tokenizer.json file (CLIP/CLAP only). Required for text embedding generation.
+@item target
+Specify which objects to classify. When omitted, the entire frame is classified. When specified, only bounding boxes with detection labels matching this value are classified.
+@item is_audio
+Enable audio processing mode for CLAP models (default: 0). Set to 1 to process audio input instead of video frames.
+@item logit_scale
+Logit scale for similarity calculation in CLIP/CLAP (default: 4.6052 for CLIP, 33.37 for CLAP). Values below 0 use the default.
+@item temperature
+Softmax temperature for CLIP/CLAP models (default: 1.0). Lower values make the output more peaked, higher values make it smoother.
+@item forward_order
+Order of forward output for CLIP/CLAP: 0 for media-text order, 1 for text-media order (default depends on model type).
+@item normalize
+Whether to normalize the input tensor for CLIP/CLAP (default depends on model type). Some scripted models already do this in the forward, so this is not necessary in some cases.
+@item input_res
+Expected input resolution for video processing models (default: automatically detected).
+@item sample_rate
+Expected sample rate for audio processing models (default: 44100).
+@item sample_duration
+Expected sample duration in seconds for audio processing models (default: 7).
+@item token_dimension
+Dimension of token vector for text embeddings (default: 77).
+@item optimize
+Enable graph executor optimization (0: disabled, 1: enabled).
+@end table
+@subsection Category Files Format
+For CLIP/CLAP models, a hierarchical categories file can be provided with the following format:
+@example
+[RecordingSystem]
+(Professional)
+a photo with high level of detail
+a professionally recorded sound
+(HomeRecording)
+a photo with low level of detail
+an amateur recording
+[ContentType]
+(Nature)
+trees
+mountains
+birds singing
+(Urban)
+buildings
+street noise
+traffic sounds
+@end example
+Each unit enclosed in square brackets [] creates a classification group. Within each group, categories are defined with parentheses () and the labels under each category are used to classify the input.
+@subsection Examples
+@example
+Classify video using OpenVINO
+ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=openvino:model=model.xml:labels=labels.txt" output.mp4
+Classify video using CLIP
+ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=torch:model=clip_model.pt:categories=categories.txt:tokenizer=tokenizer.json" output.mp4
+Classify only person objects in a video
+ffmpeg -i input.mp4 -vf "dnn_detect=model=detection.xml:input=data:output=detection_out:confidence=0.5,dnn_classify=model=clip_model.pt:dnn_backend=torch:tokenizer=tokenizer.json:labels=labels.txt:target=person" output.mp4
+Classify audio using CLAP
+ffmpeg -i input.mp3 -af "dnn_classify=dnn_backend=torch:model=clap_model.pt:categories=audio_categories.txt:tokenizer=tokenizer.json:is_audio=1:sample_rate=44100:sample_duration=7" output.mp3
+@end example
+
 @section concat
 
 Concatenate audio and video streams, joining them together one after the
-- 
2.34.1


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".