From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 7FA944E322
	for <ffmpegdev@gitmailbox.com>; Mon, 10 Mar 2025 19:55:29 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CBF2F68DF30;
	Mon, 10 Mar 2025 21:55:21 +0200 (EET)
Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com
 [209.85.128.49])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 9474668E0D6
 for <ffmpeg-devel@ffmpeg.org>; Mon, 10 Mar 2025 21:55:15 +0200 (EET)
Received: by mail-wm1-f49.google.com with SMTP id
 5b1f17b1804b1-43948f77f1aso27908765e9.0
 for <ffmpeg-devel@ffmpeg.org>; Mon, 10 Mar 2025 12:55:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1741636515; x=1742241315; darn=ffmpeg.org;
 h=thread-index:content-language:content-transfer-encoding
 :mime-version:message-id:date:subject:to:from:from:to:cc:subject
 :date:message-id:reply-to;
 bh=sbe3Q5IynQRSr3ln5zhWMRbjZk7rTphDf6W5xKaKDiI=;
 b=gAMjAxroCg6686gtH6KktdRtyuUmK3tiZHqMRYN4VBvKXgZDH1OPvquhpCdRFoI20C
 VoT/yZ/6+6uiftvHStHH3lUlZYZtYWdYyBTsMETlf0Xi7uIALbbFDGzvK4yo/WT+j5w9
 +l2npbDjX1vGxRu/bgs058kNArG5T4VxhtOlm586QecceV6O1EDKFVqyjE1Ih45vt8iT
 M860LcUt+1NfNtdBvlbk1JAIvM1zekakI0xB05Nkb5LTta5r3D/4e9REHI/51VS+rrmX
 PDc6MeG3C/ktERI0EI2d6emYLtszRldvPf7yKA3s3odPyP6HyKiwJCM9rYRazKpvRwZg
 ESAg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1741636515; x=1742241315;
 h=thread-index:content-language:content-transfer-encoding
 :mime-version:message-id:date:subject:to:from:x-gm-message-state
 :from:to:cc:subject:date:message-id:reply-to;
 bh=sbe3Q5IynQRSr3ln5zhWMRbjZk7rTphDf6W5xKaKDiI=;
 b=TYRKunpWFoKNLsvSDO79Glvm1Jt2nGgJObgmmaqQB2d+VQbBil03DBF/VYQAgZuVcX
 N1l23uM/MGkPIguHtZnzLvGa3GNwe7o90VbEPB/fc+N3RCuZVtrArMsFmMhuzn8Bgyds
 xe6YSjWJsZ3egdf61T9Ro8IFZKTul2XSq9nUud6PceiM5StGBt98Wcvt/R2o7ZbCFpMf
 VNgKZmnvPcAFtnLNZKJUUPeiy6scaS0NhtMF49UL/ua34KsH6gMUmftRdrevFmtGASn3
 6vZRd3+ebYXzpoGnakd7zlZt1M0LblanE4qbcdCTrWJdmltyu0XKpm6CGQ1Qk0yfJ1Yr
 Sszg==
X-Gm-Message-State: AOJu0YywQ9rlkGX6Nnj9dSt/Bw1Mam3EtEgvvJ0xlNd9nY6dXte/Ppmc
 c82KGq0X4dNJd2QRh3NgJzo+RZhURD22Y3jxQzXUsf3SUuZz3eUl6EuB3A==
X-Gm-Gg: ASbGnctay8d94FOJggVeIrpoxzcqpRblC67P0hP3uRQq02Kh7nxgA6tLKrti7Nz4RtV
 BU9UVsQdfvb/3CABw1pGhLUC1wMruHadUynn3jmP2PKSiaWmU6Q22Di8eg27ik1e69tkmdvqrb3
 ziiVOc7+4K5pkM+Kp3NCBDITdUYamWf2pcJt1PG0pbIgjwQRMHIho4ALooLdfSmIquv1OmhBpi9
 S2oVdkxbzndMtTPzebY9V5QxWJx5FkiJoYNr1ArRCxqclvvSk3kxdWJX7d+v/D8kvi4lveb/oOa
 EuhtbsfWd4C8vyWInYJ3tZQV9QD48bq4YuHs/1E446u9NJT8NVKfzN8r+PTPXQn/dzu+5US8KBP
 vVEO39/jkr9QX0uuv
X-Google-Smtp-Source: AGHT+IG+IbddAW29PPbwU1Io5oSuRwiKh1pjUvwAwJBboHqxlpgBY6Q0UAgGxJHUCvmPv3O/6n1/Gg==
X-Received: by 2002:a7b:cc88:0:b0:43c:f75a:eb38 with SMTP id
 5b1f17b1804b1-43cf75aed80mr46625765e9.2.1741636514381; 
 Mon, 10 Mar 2025 12:55:14 -0700 (PDT)
Received: from MK2 (80-108-16-220.cable.dynamic.surfer.at. [80.108.16.220])
 by smtp.gmail.com with ESMTPSA id
 5b1f17b1804b1-43cf0c42eb6sm66020285e9.16.2025.03.10.12.55.13
 for <ffmpeg-devel@ffmpeg.org>
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 10 Mar 2025 12:55:14 -0700 (PDT)
From: <m.kaindl0208@gmail.com>
To: <ffmpeg-devel@ffmpeg.org>
Date: Mon, 10 Mar 2025 20:55:14 +0100
Message-ID: <004301db91f6$56fa2ce0$04ee86a0$@gmail.com>
MIME-Version: 1.0
X-Mailer: Microsoft Outlook 16.0
Content-Language: en-at
Thread-Index: AduR9lMCLaE4tRwgSiirPTBUgXGs2g==
Subject: [FFmpeg-devel] [PATCH v2 FFmpeg 15/20]
 libavfilter/dnn/dnn_backend_torch: Audio and Video preprocessing for
 CLIP/CLAP models
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/004301db91f6$56fa2ce0$04ee86a0$@gmail.com/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

Signed-off-by: MaximilianKaindl <m.kaindl0208@gmail.com>
---
 libavfilter/dnn/dnn_backend_torch.cpp | 128 ++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/libavfilter/dnn/dnn_backend_torch.cpp b/libavfilter/dnn/dnn_backend_torch.cpp
index 12ba2674b3..1d2bfb191a 100644
--- a/libavfilter/dnn/dnn_backend_torch.cpp
+++ b/libavfilter/dnn/dnn_backend_torch.cpp
@@ -458,6 +458,134 @@ static torch::Tensor apply_softmax(torch::Tensor input_tensor, float temperature
     }
 }

+static torch::Tensor handle_short_audio_tensor(torch::Tensor audio_tensor, int target_samples)
+{
+    int nb_samples = audio_tensor.size(0);
+    int repeat_factor = (target_samples + nb_samples - 1) / nb_samples;
+
+    // Repeat tensor along dimension 0 to fill required length
+    torch::Tensor repeated = audio_tensor.repeat({repeat_factor});
+
+    // Take only the needed samples
+    return repeated.slice(0, 0, target_samples);
+}
+
+static torch::Tensor handle_long_audio_tensor(torch::Tensor audio_tensor, int target_samples, TaskItem *task)
+{
+    int nb_samples = audio_tensor.size(0);
+    int max_start = nb_samples - target_samples;
+
+    // Use a deterministic seed based on frame properties
+    unsigned int seed =
+        (unsigned int)((uintptr_t)task ^ (uintptr_t)(task->in_frame->pts ? task->in_frame->pts : nb_samples));
+
+    // Determine start position - center-biased for better representation
+    int start_idx;
+
+    // Prefer center segments for better representation, with some randomness
+    if (seed % 3 == 0) { // ~33% chance for center segment
+        start_idx = (nb_samples - target_samples) / 2;
+    } else {
+        // Otherwise use seeded position
+        start_idx = seed % (max_start + 1);
+    }
+
+    // Extract the segment using slice operation
+    return audio_tensor.slice(0, start_idx, start_idx + target_samples);
+}
+
+static int prepare_audio_tensor(const THModel *th_model, const THRequestItem *request)
+{
+    THInferRequest *infer_request = request->infer_request;
+    LastLevelTaskItem *lltask = request->lltasks[0];
+    TaskItem *task = lltask->task;
+    DnnContext *ctx = th_model->ctx;
+    int ret = 0;
+
+    const int target_samples = th_model->ctx->torch_option.sample_rate * th_model->ctx->torch_option.sample_duration;
+
+    // Validate input frame
+    if (!task->in_frame->data[0]) {
+        av_log(ctx, AV_LOG_ERROR, "Invalid frame input data\n");
+        return AVERROR(EINVAL);
+    }
+
+    // Get audio data from the frame
+    float *audio_data = (float *)task->in_frame->data[0];
+    int nb_samples = task->in_frame->nb_samples;
+
+    // Validate audio parameters
+    if (task->in_frame->sample_rate != th_model->ctx->torch_option.sample_rate) {
+        av_log(ctx, AV_LOG_ERROR, "Sample rate mismatch. Expected %ld Hz, got %d Hz\n",
+                th_model->ctx->torch_option.sample_rate, task->in_frame->sample_rate);
+        return AVERROR(EINVAL);
+    }
+
+    if (task->in_frame->format != AV_SAMPLE_FMT_FLT) {
+        av_log(ctx, AV_LOG_ERROR, "Unsupported sample format. Expected float\n");
+        return AVERROR(EINVAL);
+    }
+
+    try {
+        torch::Tensor audio_tensor = torch::from_blob(audio_data, {nb_samples}, torch::kFloat32).clone();
+
+        c10::Device device = (*th_model->jit_model->parameters().begin()).device();
+        if (audio_tensor.device() != device) {
+            audio_tensor = audio_tensor.to(device);
+        }
+
+        // Create target tensor based on the audio length
+        torch::Tensor processed_tensor;
+
+        if (nb_samples < target_samples) {
+            // Handle short audio using tensor repeat operation
+            processed_tensor = handle_short_audio_tensor(audio_tensor, target_samples);
+        } else if (nb_samples > target_samples) {
+            // Handle long audio using tensor slice operation
+            processed_tensor = handle_long_audio_tensor(audio_tensor, target_samples, task);
+        } else {
+            // Exact length, just use the tensor as is
+            processed_tensor = audio_tensor;
+        }
+
+        processed_tensor = processed_tensor.reshape({1, -1});
+
+        // Assign to output
+        *infer_request->input_tensor = processed_tensor;
+    } catch (const c10::Error &e) {
+        av_log(ctx, AV_LOG_ERROR, "Audio tensor processing failed: %s\n", e.what());
+        return AVERROR(EINVAL);
+    } catch (const std::exception &e) {
+        av_log(ctx, AV_LOG_ERROR, "Audio tensor processing failed: %s\n", e.what());
+        return AVERROR(EINVAL);
+    }
+
+    return ret;
+}
+
+static int preprocess_image_tensor(const THModel *th_model, torch::Tensor *input_tensor, const c10::Device &device)
+{
+    DnnContext *ctx = th_model->ctx;
+    try {
+        if (input_tensor->device() != device) {
+            *input_tensor = input_tensor->to(device);
+        }
+        *input_tensor = torch::nn::functional::interpolate(
+            *input_tensor,
+            torch::nn::functional::InterpolateFuncOptions()
+                .size(std::vector<int64_t>{ctx->torch_option.input_resolution, ctx->torch_option.input_resolution})
+                .mode(torch::kBicubic)
+                .align_corners(false));
+        return 0;
+    } catch (const c10::Error &e) {
+        av_log(ctx, AV_LOG_ERROR, "Image encoding error: %s\n", e.what());
+        return AVERROR(EINVAL);
+    } catch (const std::exception &e) {
+        av_log(ctx, AV_LOG_ERROR, "Image encoding error: %s\n", e.what());
+        return AVERROR(EINVAL);
+    }
+}
+
 static int fill_model_input_th(THModel *th_model, THRequestItem *request)
 {
     LastLevelTaskItem *lltask = NULL;
--
2.34.1


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".