From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 4CCC444B43 for ; Sat, 19 Jul 2025 00:16:07 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 0D6CC68DECB; Sat, 19 Jul 2025 03:16:03 +0300 (EEST) Received: from relay4-d.mail.gandi.net (relay4-d.mail.gandi.net [217.70.183.196]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 749B368BECB for ; Sat, 19 Jul 2025 03:15:55 +0300 (EEST) Received: by mail.gandi.net (Postfix) with ESMTPSA id 5B33C43375 for ; Sat, 19 Jul 2025 00:15:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=niedermayer.cc; s=gm1; t=1752884154; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=oC3ZyUjSZnlbWaPNgKSHyEmV952NupGzeHg31yAH5xM=; b=jfuZK3/R/1mZP99IXMz6MAGRM40G/LhVRl3x0xB6Ll5nbNPmKaOAfjbhmVch7yzVjm+4Dt c3qQcv3Y/5tJ10f92NncalGrasT+gBrriB5tNH74uhHOCIazUlIFa6I/4Jn/77P9m9v42e PDJARNuvuYp5LZDAe+QFuN9bgFCCbFTOKpo4KtxdnRlVnGon5kppL2aDmVhBT0kvD3zmaX oaVC0+ttfKsXkCr548za/w3akO9FSOd8k7ywcelwCkC3NWo2ZsAHT35vSgo3lviSAaZdXr aoIe/rVTsgVWnWEj0SkFfGzuHtUkbPDWiie2D+1iwW1LXRXHL+DYLcfD+z10+Q== Date: Sat, 19 Jul 2025 02:15:53 +0200 From: Michael Niedermayer To: FFmpeg development discussions and patches Message-ID: <20250719001553.GP29660@pb2> References: <20250717085157.88889-1-vpalmisano@gmail.com> MIME-Version: 1.0 In-Reply-To: <20250717085157.88889-1-vpalmisano@gmail.com> X-GND-State: clean X-GND-Score: -85 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeigeekjecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnegfrhhlucfvnfffucdludehmdenucfjughrpeffhffvuffkfhggtggujgesghdtreertddtvdenucfhrhhomhepofhitghhrggvlhcupfhivgguvghrmhgrhigvrhcuoehmihgthhgrvghlsehnihgvuggvrhhmrgihvghrrdgttgeqnecuggftrfgrthhtvghrnhepieegkedtjeduffejhfetgeejtdegteetgfegtdfhjefgvefhteegkeejtddvhfevnecukfhppeeguddrieeirdeihedrudejieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeeguddrieeirdeihedrudejiedphhgvlhhopehlohgtrghlhhhoshhtpdhmrghilhhfrhhomhepmhhitghhrggvlhesnhhivgguvghrmhgrhigvrhdrtggtpdhnsggprhgtphhtthhopedupdhrtghpthhtohepfhhfmhhpvghgqdguvghvvghlsehffhhmphgvghdrohhrgh X-GND-Sasl: michael@niedermayer.cc Subject: Re: [FFmpeg-devel] [PATCH] Whisper audio filter X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Type: multipart/mixed; boundary="===============8102752109795311672==" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: --===============8102752109795311672== Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="EOAYCpukU7U6EEqY" Content-Disposition: inline --EOAYCpukU7U6EEqY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi Vittorio On Thu, Jul 17, 2025 at 10:51:57AM +0200, Vittorio Palmisano wrote: > It adds a new audio filter for running audio transcriptions with the whis= per model. > Documentation and examples are included into the patch. >=20 > Signed-off-by: Vittorio Palmisano > --- > configure | 5 + > doc/filters.texi | 107 +++++++++ > libavfilter/Makefile | 2 + > libavfilter/af_whisper.c | 452 +++++++++++++++++++++++++++++++++++++++ > libavfilter/allfilters.c | 2 + > 5 files changed, 568 insertions(+) > create mode 100644 libavfilter/af_whisper.c [...] > +static void cb_log(enum ggml_log_level level, const char *text, void *us= er_data) > +{ > + AVFilterContext *ctx =3D (AVFilterContext *) user_data; > + switch (level) { > + case GGML_LOG_LEVEL_ERROR: > + av_log(ctx, AV_LOG_ERROR, "%s", text); > + break; > + case GGML_LOG_LEVEL_WARN: > + av_log(ctx, AV_LOG_WARNING, "%s", text); > + break; > + case GGML_LOG_LEVEL_INFO: > + case GGML_LOG_LEVEL_DEBUG: > + av_log(ctx, AV_LOG_DEBUG, "%s", text); > + break; > + } > +} you can factor the function calls out of the switch/case > + > +static int init(AVFilterContext *ctx) > +{ > + WhisperContext *wctx =3D ctx->priv; > + > + static AVOnce init_static_once =3D AV_ONCE_INIT; > + ff_thread_once(&init_static_once, ggml_backend_load_all); > + > + whisper_log_set(cb_log, ctx); > + > + // Init whisper context > + if (!wctx->model_path) { > + av_log(ctx, AV_LOG_ERROR, "No whisper model path specified. Use = the 'model' option.\n"); > + return AVERROR(EINVAL); > + } > + > + struct whisper_context_params params =3D whisper_context_default_par= ams(); > + params.use_gpu =3D wctx->use_gpu; > + params.gpu_device =3D wctx->gpu_device; > + > + wctx->ctx_wsp =3D whisper_init_from_file_with_params(wctx->model_pat= h, params); > + if (wctx->ctx_wsp =3D=3D NULL) { > + av_log(ctx, AV_LOG_ERROR, "Failed to initialize whisper context = =66rom model: %s\n", wctx->model_path); > + return AVERROR(EIO); > + } > + > + // Init buffer > + wctx->audio_buffer_queue_size =3D WHISPER_SAMPLE_RATE * wctx->queue = / 1000000; The multiplication can overflow also the 32bit output could overflow best is probably to limit queue to a more reasonable value than INT64_MAX > + wctx->audio_buffer =3D av_malloc(wctx->audio_buffer_queue_size * siz= eof(*wctx->audio_buffer)); av_calloc() or av_malloc_array() [...] > +static void run_transcription(AVFilterContext *ctx, AVDictionary **metad= ata, int end_pos) > +{ > + WhisperContext *wctx =3D ctx->priv; > + end_pos =3D FFMAX(0, FFMIN(end_pos, wctx->audio_buffer_fill_size)); > + > + if (!wctx->ctx_wsp || end_pos =3D=3D 0) > + return; > + > + float duration =3D (float) end_pos / WHISPER_SAMPLE_RATE; [...] > + wctx->timestamp +=3D duration * 1000; floats are not precise and the accumulated rounding errors will add up and lead to synchronization issues between the subtitles and audio or video over a long enough timespan Also for reproducability this should use integers what you could do, is to use: wctx->timestamp +=3D end_pos; and then replace every use of wctx->timestamp by wctx->timestamp / WHISPER_= SAMPLE_RATE or wctx->timestamp / (double)WHISPER_SAMPLE_RATE if the context demands a double for example that way the code is exact and no errors accumulate > + > + if (metadata && segments_text) { > + av_dict_set(metadata, "lavfi.whisper.text", segments_text, 0); > + char *duration_text =3D av_asprintf("%f", duration); > + av_dict_set(metadata, "lavfi.whisper.duration", duration_text, A= V_DICT_DONT_STRDUP_VAL); > + } > + av_freep(&segments_text); > + > + memcpy(wctx->audio_buffer, wctx->audio_buffer + end_pos, end_pos * s= izeof(*wctx->audio_buffer)); > + wctx->audio_buffer_fill_size -=3D end_pos; > + wctx->audio_buffer_vad_size =3D wctx->audio_buffer_fill_size; > +} > + > +static int filter_frame(AVFilterLink *inlink, AVFrame *frame) > +{ > + AVFilterContext *ctx =3D inlink->dst; > + WhisperContext *wctx =3D ctx->priv; > + AVFilterLink *outlink =3D ctx->outputs[0]; > + AVDictionary **metadata =3D &frame->metadata; > + > + const int samples =3D frame->nb_samples; > + const float *input_data =3D (const float *) frame->data[0]; > + > + if (wctx->audio_buffer_fill_size + samples > wctx->audio_buffer_queu= e_size) { > + run_transcription(ctx, metadata, wctx->audio_buffer_fill_size); > + } > + > + memcpy(wctx->audio_buffer + wctx->audio_buffer_fill_size, input_data= , samples * sizeof(*wctx->audio_buffer)); > + wctx->audio_buffer_fill_size +=3D samples; > + > + if (wctx->ctx_vad > + && (wctx->audio_buffer_fill_size - wctx->audio_buffer_vad_size) = >=3D > + WHISPER_SAMPLE_RATE * (wctx->vad_min_speech_duration + wctx->vad= _min_silence_duration) / 1000000) { > + struct whisper_vad_segments *segments =3D whisper_vad_segments_f= rom_samples(wctx->ctx_vad, > + = wctx->vad_params, > + = wctx->audio_buffer, > + = wctx->audio_buffer_fill_size); > + wctx->audio_buffer_vad_size =3D wctx->audio_buffer_fill_size; > + > + if (!segments) { > + av_log(ctx, AV_LOG_ERROR, "failed to detect VAD\n"); > + } else { > + int n_segments =3D whisper_vad_segments_n_segments(segments); > + > + if (n_segments > 0) { > + const float start_ms =3D whisper_vad_segments_get_segmen= t_t0(segments, 0) * 10.0; > + const float end_ms =3D whisper_vad_segments_get_segment_= t1(segments, n_segments - 1) * 10.0; > + int end_pos =3D (int) (end_ms * WHISPER_SAMPLE_RATE / 10= 00); > + > + if (end_pos <=3D wctx->audio_buffer_fill_size - WHISPER_= SAMPLE_RATE * wctx->vad_min_silence_duration / 1000000) { > + av_log(ctx, AV_LOG_INFO, > + "VAD detected %d segments, start: %.0f ms, e= nd: %.0f ms (buffer: %d ms)\n", > + n_segments, start_ms, end_ms, 1000 * wctx->a= udio_buffer_fill_size / WHISPER_SAMPLE_RATE); > + run_transcription(ctx, metadata, end_pos); > + } > + } > + > + whisper_vad_free_segments(segments); > + } > + } else if (wctx->audio_buffer_fill_size >=3D wctx->audio_buffer_queu= e_size) > + run_transcription(ctx, metadata, wctx->audio_buffer_fill_size); > + > + wctx->next_pts =3D frame->pts + av_rescale_q(frame->nb_samples, (AVR= ational) { > + 1, inlink->sample_rate} > + , inlink->time_base); I think you should consistently use samples or frame->nb_samples, they are = the same value i think thx [...] --=20 Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Never trust a computer, one day, it may think you are the virus. -- Compn --EOAYCpukU7U6EEqY Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iF0EABEKAB0WIQSf8hKLFH72cwut8TNhHseHBAsPqwUCaHrjtgAKCRBhHseHBAsP q9cAAJ94FTkl7f4mVc9bjf1ZsGHTC4FdJQCfWyOn7W1LXqgZOehx5DqQ1hs8Ckg= =UnID -----END PGP SIGNATURE----- --EOAYCpukU7U6EEqY-- --===============8102752109795311672== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". --===============8102752109795311672==--