From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id D2F094AF2E
	for <ffmpegdev@gitmailbox.com>; Sun, 20 Jul 2025 01:22:22 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id E50F668D25E;
	Sun, 20 Jul 2025 04:22:18 +0300 (EEST)
Received: from relay15.mail.gandi.net (relay15.mail.gandi.net [217.70.178.235])
 by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 03D6768C698
 for <ffmpeg-devel@ffmpeg.org>; Sun, 20 Jul 2025 04:22:11 +0300 (EEST)
Received: by mail.gandi.net (Postfix) with ESMTPSA id C1BC743142
 for <ffmpeg-devel@ffmpeg.org>; Sun, 20 Jul 2025 01:22:10 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=niedermayer.cc;
 s=gm1; t=1752974531;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=Hh51CnCZ0r4THCaZFhP2anP/0QCdnQ1Mn8R2YMAIRXk=;
 b=X3gKG6lHULdmFq5wrIeuhm659NcpuSVKCUXUch6o7A9b4lvfUrq3MyGu+ayP0Xjcf20YZz
 cRYTBZQu8mcT9UOs2fQttSxFK/qBDTa+TTyzlWCO23qhVgp4iQ6RQJbehDngIKAWt5D70L
 T3pz4ICe4RGcOPA+QykLvpBlKy2FLlRkGQ6PQulL1ppWQaPaYDjjXEYYCV4MnboQdQtqIp
 wFilsq9qJtP4HZLeeRVXvpbLcfQY0wKJiIoZ63ZDY7AfgtKM7TTBxMZBqNgMT8YIJP7OwZ
 HAvJyTdaxBS5b15sSppzGoGHGrUpXtHvVkWBryyTZLHmqvOzsU4DjCXpvsHTzg==
Date: Sun, 20 Jul 2025 03:22:09 +0200
From: Michael Niedermayer <michael@niedermayer.cc>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <20250720012209.GW29660@pb2>
References: <20250719001553.GP29660@pb2>
 <20250719125526.389239-1-vpalmisano@gmail.com>
MIME-Version: 1.0
In-Reply-To: <20250719125526.389239-1-vpalmisano@gmail.com>
X-GND-State: clean
X-GND-Score: -85
X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeijeekkecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnegfrhhlucfvnfffucdludehmdenucfjughrpeffhffvuffkfhggtggujgesghdtreertddtvdenucfhrhhomhepofhitghhrggvlhcupfhivgguvghrmhgrhigvrhcuoehmihgthhgrvghlsehnihgvuggvrhhmrgihvghrrdgttgeqnecuggftrfgrthhtvghrnhepieegkedtjeduffejhfetgeejtdegteetgfegtdfhjefgvefhteegkeejtddvhfevnecukfhppeeguddrieeirdeihedrudejieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeeguddrieeirdeihedrudejiedphhgvlhhopehlohgtrghlhhhoshhtpdhmrghilhhfrhhomhepmhhitghhrggvlhesnhhivgguvghrmhgrhigvrhdrtggtpdhnsggprhgtphhtthhopedupdhrtghpthhtohepfhhfmhhpvghgqdguvghvvghlsehffhhmphgvghdrohhrgh
Subject: Re: [FFmpeg-devel] [PATCH] libavfilter: Whisper audio filter
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: multipart/mixed; boundary="===============7719798195986656207=="
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/20250720012209.GW29660@pb2/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>


--===============7719798195986656207==
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="FbOUTUqlEC+IYKY/"
Content-Disposition: inline


--FbOUTUqlEC+IYKY/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jul 19, 2025 at 02:55:26PM +0200, Vittorio Palmisano wrote:
> It adds a new audio filter for running audio transcriptions with the whis=
per model.
> Documentation and examples are included into the patch.
>=20
> Signed-off-by: Vittorio Palmisano <vpalmisano@gmail.com>
> ---
>  configure                |   5 +
>  doc/filters.texi         | 107 +++++++++
>  libavfilter/Makefile     |   2 +
>  libavfilter/af_whisper.c | 456 +++++++++++++++++++++++++++++++++++++++
>  libavfilter/allfilters.c |   2 +
>  5 files changed, 572 insertions(+)
>  create mode 100644 libavfilter/af_whisper.c
>=20
[...]

> +static int init(AVFilterContext *ctx)
> +{
> +    WhisperContext *wctx =3D ctx->priv;
> +
> +    static AVOnce init_static_once =3D AV_ONCE_INIT;
> +    ff_thread_once(&init_static_once, ggml_backend_load_all);
> +
> +    whisper_log_set(cb_log, ctx);
> +
> +    // Init whisper context
> +    if (!wctx->model_path) {
> +        av_log(ctx, AV_LOG_ERROR, "No whisper model path specified. Use =
the 'model' option.\n");
> +        return AVERROR(EINVAL);
> +    }
> +
> +    struct whisper_context_params params =3D whisper_context_default_par=
ams();
> +    params.use_gpu =3D wctx->use_gpu;
> +    params.gpu_device =3D wctx->gpu_device;
> +
> +    wctx->ctx_wsp =3D whisper_init_from_file_with_params(wctx->model_pat=
h, params);
> +    if (wctx->ctx_wsp =3D=3D NULL) {
> +        av_log(ctx, AV_LOG_ERROR, "Failed to initialize whisper context =
=66rom model: %s\n", wctx->model_path);
> +        return AVERROR(EIO);
> +    }
> +
> +    // Init buffer
> +    wctx->audio_buffer_queue_size =3D WHISPER_SAMPLE_RATE * wctx->queue =
/ 1000000;
> +    wctx->audio_buffer =3D av_malloc_array(wctx->audio_buffer_queue_size=
, sizeof(*wctx->audio_buffer));
> +    if (!wctx->audio_buffer)
> +        return AVERROR(ENOMEM);
> +
> +    // Init VAD model context
> +    if (wctx->vad_model_path) {
> +        struct whisper_vad_context_params ctx_params =3D whisper_vad_def=
ault_context_params();
> +        ctx_params.n_threads =3D ff_filter_get_nb_threads(ctx);
> +        // ctx_params.use_gpu =3D wctx->use_gpu; TODO (see: whisper_vad_=
init_context)
> +        ctx_params.gpu_device =3D wctx->gpu_device;
> +        wctx->ctx_vad =3D whisper_vad_init_from_file_with_params(wctx->v=
ad_model_path, ctx_params);
> +
> +        wctx->vad_params =3D whisper_vad_default_params();
> +        wctx->vad_params.threshold =3D wctx->vad_threshold;
> +        wctx->vad_params.min_speech_duration_ms =3D wctx->vad_min_speech=
_duration / 1000;
> +        wctx->vad_params.min_silence_duration_ms =3D wctx->vad_min_silen=
ce_duration / 1000;
> +        wctx->vad_params.max_speech_duration_s =3D wctx->queue / 1000000=
=2E0;
> +        wctx->vad_params.speech_pad_ms =3D 0;
> +        wctx->vad_params.samples_overlap =3D 0;
> +    }
> +
> +    wctx->next_pts =3D AV_NOPTS_VALUE;
> +
> +    if (wctx->destination && strcmp("", wctx->destination)) {
> +        const char *dst =3D wctx->destination;
> +        if (!strcmp("-", dst))
> +            dst =3D "pipe:1";
> +        int ret =3D avio_open(&wctx->avio_context, dst, AVIO_FLAG_WRITE);
> +
> +        if (ret < 0) {
> +            av_log(ctx, AV_LOG_ERROR, "Could not open %s: %s\n", wctx->d=
estination, av_err2str(ret));
> +            return ret;
> +        }
> +
> +        wctx->avio_context->direct =3D AVIO_FLAG_DIRECT;
> +    }
> +
> +    av_log(ctx, AV_LOG_INFO,
> +           "Whisper filter initialized: model: %s lang: %s queue: %ld ms=
\n",
> +           wctx->model_path, wctx->language, wctx->queue / 1000);
> +
> +    return 0;
> +}
> +
> +static void uninit(AVFilterContext *ctx)
> +{
> +    WhisperContext *wctx =3D ctx->priv;
> +
> +    if (wctx->audio_buffer_fill_size > 0) {
> +        av_log(ctx, AV_LOG_WARNING,
> +               "Remaining audio buffer %d samples (%d seconds) after sto=
pping\n",
> +               wctx->audio_buffer_fill_size, wctx->audio_buffer_fill_siz=
e / WHISPER_SAMPLE_RATE);
> +    }
> +
> +    if (wctx->ctx_vad) {
> +        whisper_vad_free(wctx->ctx_vad);
> +        wctx->ctx_vad =3D NULL;
> +    }
> +
> +    if (wctx->ctx_wsp) {
> +        whisper_free(wctx->ctx_wsp);
> +        wctx->ctx_wsp =3D NULL;
> +    }
> +
> +    av_freep(&wctx->audio_buffer);
> +
> +    if (wctx->avio_context)
> +        avio_closep(&wctx->avio_context);
> +}
> +

> +static void run_transcription(AVFilterContext *ctx, AVDictionary **metad=
ata, int frames)
> +{
> +    WhisperContext *wctx =3D ctx->priv;

> +    frames =3D FFMAX(0, FFMIN(frames, wctx->audio_buffer_fill_size));

I would call it samples, sample_count or nb_samples

why are you cliping the number of samples ?

I assume run_transcription() would be called with the correct number or am =
i missing
something ?


> +
> +    if (!wctx->ctx_wsp || frames =3D=3D 0)
> +        return;
> +
> +    float duration =3D (float) frames / WHISPER_SAMPLE_RATE;
> +
> +    av_log(ctx, AV_LOG_INFO,
> +           "run transcription %d/%d samples (%.2f seconds)...\n", frames=
, wctx->audio_buffer_fill_size, duration);
> +
> +    struct whisper_full_params params =3D whisper_full_default_params(WH=
ISPER_SAMPLING_GREEDY);
> +    params.language =3D wctx->language;
> +    params.n_threads =3D ff_filter_get_nb_threads(ctx);
> +    params.print_special =3D 0;
> +    params.print_progress =3D 0;
> +    params.print_realtime =3D 0;
> +    params.print_timestamps =3D 0;
> +
> +    if (whisper_full(wctx->ctx_wsp, params, wctx->audio_buffer, frames) =
!=3D 0) {
> +        av_log(ctx, AV_LOG_ERROR, "Failed to process audio with whisper.=
cpp\n");
> +        return;
> +    }
> +

> +    const int64_t timestamp =3D wctx->frames * 1000 / WHISPER_SAMPLE_RAT=
E;

to make this a bit easier to understand i suggest to call it
timestamp_ms as its a timestamp in milliseconds and we have
timestamps in the timebase too.
But thats not really important, just an idea

A bigger problem is that the input frame->pts are not passed through to the=
 output
srt/json timestamps.

To understand why this is a problem, consider some audio input device
which samples at 16khz. This hardware contains lets say for simplicity a 16=
khz
crystal and samples based on that. But depending on temperature of this
crystal it will really sample lets say between 15990 and 16010khz. So
simply counting samples alone is not enough. the frame->pts need to be
used too.
If the subtitles should be perfectly in sync with the video

Its probably best to give the user the option to produce srt/json times
based purely on sample numbers but also on pts.

sorry iam bringing this just up now, i just realizes it as i was reviewing
the timestamp code

thx

[...]

--=20
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No great genius has ever existed without some touch of madness. -- Aristotle

--FbOUTUqlEC+IYKY/
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iF0EABEKAB0WIQSf8hKLFH72cwut8TNhHseHBAsPqwUCaHxEvQAKCRBhHseHBAsP
qw4XAJ4+I96zPmqTCGPPWvWROz8NYkiXwQCeMqzOFixp018o+UwxAIy54n5X2ms=
=Va5f
-----END PGP SIGNATURE-----

--FbOUTUqlEC+IYKY/--

--===============7719798195986656207==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

--===============7719798195986656207==--