From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 4CCC444B43
	for <ffmpegdev@gitmailbox.com>; Sat, 19 Jul 2025 00:16:07 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 0D6CC68DECB;
	Sat, 19 Jul 2025 03:16:03 +0300 (EEST)
Received: from relay4-d.mail.gandi.net (relay4-d.mail.gandi.net
 [217.70.183.196])
 by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 749B368BECB
 for <ffmpeg-devel@ffmpeg.org>; Sat, 19 Jul 2025 03:15:55 +0300 (EEST)
Received: by mail.gandi.net (Postfix) with ESMTPSA id 5B33C43375
 for <ffmpeg-devel@ffmpeg.org>; Sat, 19 Jul 2025 00:15:54 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=niedermayer.cc;
 s=gm1; t=1752884154;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=oC3ZyUjSZnlbWaPNgKSHyEmV952NupGzeHg31yAH5xM=;
 b=jfuZK3/R/1mZP99IXMz6MAGRM40G/LhVRl3x0xB6Ll5nbNPmKaOAfjbhmVch7yzVjm+4Dt
 c3qQcv3Y/5tJ10f92NncalGrasT+gBrriB5tNH74uhHOCIazUlIFa6I/4Jn/77P9m9v42e
 PDJARNuvuYp5LZDAe+QFuN9bgFCCbFTOKpo4KtxdnRlVnGon5kppL2aDmVhBT0kvD3zmaX
 oaVC0+ttfKsXkCr548za/w3akO9FSOd8k7ywcelwCkC3NWo2ZsAHT35vSgo3lviSAaZdXr
 aoIe/rVTsgVWnWEj0SkFfGzuHtUkbPDWiie2D+1iwW1LXRXHL+DYLcfD+z10+Q==
Date: Sat, 19 Jul 2025 02:15:53 +0200
From: Michael Niedermayer <michael@niedermayer.cc>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <20250719001553.GP29660@pb2>
References: <CADv15W_jJCZS3kLL72WLhHZXdNXXor-AHNUDShkhbrH+25aPkQ@mail.gmail.com>
 <20250717085157.88889-1-vpalmisano@gmail.com>
MIME-Version: 1.0
In-Reply-To: <20250717085157.88889-1-vpalmisano@gmail.com>
X-GND-State: clean
X-GND-Score: -85
X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeigeekjecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnegfrhhlucfvnfffucdludehmdenucfjughrpeffhffvuffkfhggtggujgesghdtreertddtvdenucfhrhhomhepofhitghhrggvlhcupfhivgguvghrmhgrhigvrhcuoehmihgthhgrvghlsehnihgvuggvrhhmrgihvghrrdgttgeqnecuggftrfgrthhtvghrnhepieegkedtjeduffejhfetgeejtdegteetgfegtdfhjefgvefhteegkeejtddvhfevnecukfhppeeguddrieeirdeihedrudejieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeeguddrieeirdeihedrudejiedphhgvlhhopehlohgtrghlhhhoshhtpdhmrghilhhfrhhomhepmhhitghhrggvlhesnhhivgguvghrmhgrhigvrhdrtggtpdhnsggprhgtphhtthhopedupdhrtghpthhtohepfhhfmhhpvghgqdguvghvvghlsehffhhmphgvghdrohhrgh
X-GND-Sasl: michael@niedermayer.cc
Subject: Re: [FFmpeg-devel] [PATCH] Whisper audio filter
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: multipart/mixed; boundary="===============8102752109795311672=="
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/20250719001553.GP29660@pb2/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>


--===============8102752109795311672==
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="EOAYCpukU7U6EEqY"
Content-Disposition: inline


--EOAYCpukU7U6EEqY
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi Vittorio

On Thu, Jul 17, 2025 at 10:51:57AM +0200, Vittorio Palmisano wrote:
> It adds a new audio filter for running audio transcriptions with the whis=
per model.
> Documentation and examples are included into the patch.
>=20
> Signed-off-by: Vittorio Palmisano <vpalmisano@gmail.com>
> ---
>  configure                |   5 +
>  doc/filters.texi         | 107 +++++++++
>  libavfilter/Makefile     |   2 +
>  libavfilter/af_whisper.c | 452 +++++++++++++++++++++++++++++++++++++++
>  libavfilter/allfilters.c |   2 +
>  5 files changed, 568 insertions(+)
>  create mode 100644 libavfilter/af_whisper.c
[...]

> +static void cb_log(enum ggml_log_level level, const char *text, void *us=
er_data)
> +{
> +    AVFilterContext *ctx =3D (AVFilterContext *) user_data;
> +    switch (level) {
> +    case GGML_LOG_LEVEL_ERROR:
> +        av_log(ctx, AV_LOG_ERROR, "%s", text);
> +        break;
> +    case GGML_LOG_LEVEL_WARN:
> +        av_log(ctx, AV_LOG_WARNING, "%s", text);
> +        break;
> +    case GGML_LOG_LEVEL_INFO:
> +    case GGML_LOG_LEVEL_DEBUG:
> +        av_log(ctx, AV_LOG_DEBUG, "%s", text);
> +        break;
> +    }
> +}

you can factor the function calls out of the switch/case


> +
> +static int init(AVFilterContext *ctx)
> +{
> +    WhisperContext *wctx =3D ctx->priv;
> +
> +    static AVOnce init_static_once =3D AV_ONCE_INIT;
> +    ff_thread_once(&init_static_once, ggml_backend_load_all);
> +
> +    whisper_log_set(cb_log, ctx);
> +
> +    // Init whisper context
> +    if (!wctx->model_path) {
> +        av_log(ctx, AV_LOG_ERROR, "No whisper model path specified. Use =
the 'model' option.\n");
> +        return AVERROR(EINVAL);
> +    }
> +
> +    struct whisper_context_params params =3D whisper_context_default_par=
ams();
> +    params.use_gpu =3D wctx->use_gpu;
> +    params.gpu_device =3D wctx->gpu_device;
> +
> +    wctx->ctx_wsp =3D whisper_init_from_file_with_params(wctx->model_pat=
h, params);
> +    if (wctx->ctx_wsp =3D=3D NULL) {
> +        av_log(ctx, AV_LOG_ERROR, "Failed to initialize whisper context =
=66rom model: %s\n", wctx->model_path);
> +        return AVERROR(EIO);
> +    }
> +
> +    // Init buffer

> +    wctx->audio_buffer_queue_size =3D WHISPER_SAMPLE_RATE * wctx->queue =
/ 1000000;

The multiplication can overflow also the 32bit output could overflow
best is probably to limit queue to a more reasonable value than INT64_MAX


> +    wctx->audio_buffer =3D av_malloc(wctx->audio_buffer_queue_size * siz=
eof(*wctx->audio_buffer));

av_calloc() or av_malloc_array()


[...]
> +static void run_transcription(AVFilterContext *ctx, AVDictionary **metad=
ata, int end_pos)
> +{
> +    WhisperContext *wctx =3D ctx->priv;
> +    end_pos =3D FFMAX(0, FFMIN(end_pos, wctx->audio_buffer_fill_size));
> +
> +    if (!wctx->ctx_wsp || end_pos =3D=3D 0)
> +        return;
> +
> +    float duration =3D (float) end_pos / WHISPER_SAMPLE_RATE;
[...]

> +    wctx->timestamp +=3D duration * 1000;

floats are not precise and the accumulated rounding errors will
add up and lead to synchronization issues between the subtitles
and audio or video over a long enough timespan

Also for reproducability this should use integers

what you could do, is to use:
wctx->timestamp +=3D end_pos;

and then replace every use of wctx->timestamp by wctx->timestamp / WHISPER_=
SAMPLE_RATE

or wctx->timestamp / (double)WHISPER_SAMPLE_RATE if the context demands a
double for example

that way the code is exact and no errors accumulate


> +
> +    if (metadata && segments_text) {
> +        av_dict_set(metadata, "lavfi.whisper.text", segments_text, 0);
> +        char *duration_text =3D av_asprintf("%f", duration);
> +        av_dict_set(metadata, "lavfi.whisper.duration", duration_text, A=
V_DICT_DONT_STRDUP_VAL);
> +    }
> +    av_freep(&segments_text);
> +
> +    memcpy(wctx->audio_buffer, wctx->audio_buffer + end_pos, end_pos * s=
izeof(*wctx->audio_buffer));
> +    wctx->audio_buffer_fill_size -=3D end_pos;
> +    wctx->audio_buffer_vad_size =3D wctx->audio_buffer_fill_size;
> +}
> +

> +static int filter_frame(AVFilterLink *inlink, AVFrame *frame)
> +{
> +    AVFilterContext *ctx =3D inlink->dst;
> +    WhisperContext *wctx =3D ctx->priv;
> +    AVFilterLink *outlink =3D ctx->outputs[0];
> +    AVDictionary **metadata =3D &frame->metadata;
> +
> +    const int samples =3D frame->nb_samples;
> +    const float *input_data =3D (const float *) frame->data[0];
> +
> +    if (wctx->audio_buffer_fill_size + samples > wctx->audio_buffer_queu=
e_size) {
> +        run_transcription(ctx, metadata, wctx->audio_buffer_fill_size);
> +    }
> +
> +    memcpy(wctx->audio_buffer + wctx->audio_buffer_fill_size, input_data=
, samples * sizeof(*wctx->audio_buffer));
> +    wctx->audio_buffer_fill_size +=3D samples;
> +
> +    if (wctx->ctx_vad
> +        && (wctx->audio_buffer_fill_size - wctx->audio_buffer_vad_size) =
>=3D
> +        WHISPER_SAMPLE_RATE * (wctx->vad_min_speech_duration + wctx->vad=
_min_silence_duration) / 1000000) {
> +        struct whisper_vad_segments *segments =3D whisper_vad_segments_f=
rom_samples(wctx->ctx_vad,
> +                                                                        =
          wctx->vad_params,
> +                                                                        =
          wctx->audio_buffer,
> +                                                                        =
          wctx->audio_buffer_fill_size);
> +        wctx->audio_buffer_vad_size =3D wctx->audio_buffer_fill_size;
> +
> +        if (!segments) {
> +            av_log(ctx, AV_LOG_ERROR, "failed to detect VAD\n");
> +        } else {
> +            int n_segments =3D whisper_vad_segments_n_segments(segments);
> +
> +            if (n_segments > 0) {
> +                const float start_ms =3D whisper_vad_segments_get_segmen=
t_t0(segments, 0) * 10.0;
> +                const float end_ms =3D whisper_vad_segments_get_segment_=
t1(segments, n_segments - 1) * 10.0;
> +                int end_pos =3D (int) (end_ms * WHISPER_SAMPLE_RATE / 10=
00);
> +
> +                if (end_pos <=3D wctx->audio_buffer_fill_size - WHISPER_=
SAMPLE_RATE * wctx->vad_min_silence_duration / 1000000) {
> +                    av_log(ctx, AV_LOG_INFO,
> +                            "VAD detected %d segments, start: %.0f ms, e=
nd: %.0f ms (buffer: %d ms)\n",
> +                            n_segments, start_ms, end_ms, 1000 * wctx->a=
udio_buffer_fill_size / WHISPER_SAMPLE_RATE);
> +                    run_transcription(ctx, metadata, end_pos);
> +                }
> +            }
> +
> +            whisper_vad_free_segments(segments);
> +        }
> +    } else if (wctx->audio_buffer_fill_size >=3D wctx->audio_buffer_queu=
e_size)
> +        run_transcription(ctx, metadata, wctx->audio_buffer_fill_size);
> +
> +    wctx->next_pts =3D frame->pts + av_rescale_q(frame->nb_samples, (AVR=
ational) {
> +                                               1, inlink->sample_rate}
> +                                               , inlink->time_base);

I think you should consistently use samples or frame->nb_samples, they are =
the same
value i think

thx

[...]
--=20
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Never trust a computer, one day, it may think you are the virus. -- Compn

--EOAYCpukU7U6EEqY
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iF0EABEKAB0WIQSf8hKLFH72cwut8TNhHseHBAsPqwUCaHrjtgAKCRBhHseHBAsP
q9cAAJ94FTkl7f4mVc9bjf1ZsGHTC4FdJQCfWyOn7W1LXqgZOehx5DqQ1hs8Ckg=
=UnID
-----END PGP SIGNATURE-----

--EOAYCpukU7U6EEqY--

--===============8102752109795311672==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

--===============8102752109795311672==--