From: Michael Niedermayer <michael@niedermayer.cc>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH] Whisper audio filter
Date: Sat, 12 Jul 2025 02:03:30 +0200
Message-ID: <20250712000330.GX29660@pb2> (raw)
In-Reply-To: <CADv15W-W3=VkGcJfnnbD7mw5JdxhB7Vn+Dr_O4+4Pt47YSnHqg@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 5236 bytes --]
Hi Vittorio
On Fri, Jul 11, 2025 at 10:41:04AM +0200, Vittorio Palmisano wrote:
> > > +
> > > + memcpy(wctx->audio_buffer, wctx->audio_buffer + end_pos,
> > > + end_pos * sizeof(float));
> >
> > sizeof(*wctx->audio_buffer) is more robust than float
>
> But end_pos is not necessarily equal to the audio_buffer size, it
> could be lower.
you misunderstood
sizeof(*wctx->audio_buffer) == sizeof(float)
I was just sugesting to use the "type of the array" not to repeat
the type in the source
>
> >
> > not sure how others think of this, but i would ignore the 80 char limit and format this like:
> >
> > static const AVOption whisper_options[] = {
> > { "model" , "Path to the whisper.cpp model file" , OFFSET(model_path), AV_OPT_TYPE_STRING,.flags = FLAGS },
> > { "language", "Language for transcription ('auto' for auto-detect)", OFFSET(language) , AV_OPT_TYPE_STRING, {.str = "auto"}, .flags = FLAGS },
>
> I've used `indent -i4 -kr -nut` to format the code.
human formatted code looks better than what indent generates.
We are not litterally using indent to format code.
the docs also say "The presentation is one inspired by 'indent -i4 -kr -nut'."
A human will add a space here or a empty line there or align things to make
everything be neatly formatted and readable.
indent is not a human and not AI.
AI produces this: (i didnt verify this is still correct, but it should
show that its more readable)
static const AVOption whisper_options[] = {
{ "model", "Path to the whisper.cpp model file", OFFSET(model_path), AV_OPT_TYPE_STRING, {.str = NULL}, 0, 0, FLAGS },
{ "language", "Language for transcription ('auto' for auto-detect)", OFFSET(language), AV_OPT_TYPE_STRING, {.str = "auto"}, 0, 0, FLAGS },
{ "queue", "Audio queue size in milliseconds", OFFSET(queue), AV_OPT_TYPE_INT, {.i64 = 3000}, 20, INT_MAX, FLAGS },
{ "use_gpu", "Use GPU for processing", OFFSET(use_gpu), AV_OPT_TYPE_BOOL, {.i64 = 1}, 0, 1, FLAGS },
{ "gpu_device", "GPU device to use", OFFSET(gpu_device), AV_OPT_TYPE_INT, {.i64 = 0}, 0, INT_MAX, FLAGS },
{ "threads", "Number of threads to use", OFFSET(threads), AV_OPT_TYPE_INT, {.i64 = 4}, 0, INT_MAX, FLAGS },
{ "destination", "Output destination", OFFSET(destination), AV_OPT_TYPE_STRING, {.str = ""}, 0, 0, FLAGS },
{ "format", "Output format (text|srt|json)", OFFSET(format), AV_OPT_TYPE_STRING, {.str = "text"}, 0, 0, FLAGS },
{ "vad_model", "Path to the VAD model file", OFFSET(vad_model_path), AV_OPT_TYPE_STRING, {.str = NULL}, 0, 0, FLAGS },
{ "vad_threshold", "VAD threshold", OFFSET(vad_threshold), AV_OPT_TYPE_FLOAT, {.dbl = 0.5}, 0.0, 1.0, FLAGS },
{ "vad_min_speech_duration", "Minimum speech duration in milliseconds for VAD", OFFSET(vad_min_speech_duration),AV_OPT_TYPE_INT, {.i64 = 50}, 20, INT_MAX, FLAGS },
{ "vad_min_silence_duration","Minimum silence duration in milliseconds for VAD", OFFSET(vad_min_silence_duration),AV_OPT_TYPE_INT, {.i64 = 500}, 0, INT_MAX, FLAGS },
{ NULL }
};
>
> >
> > Also it seems, this is alot slower than whisper-cli
> >
> > time whisper-cli matrix.wav -m ~/whisper.cpp/models/ggml-base.en.bin --output-srt
> > real 0m16,283s
> > user 1m3,644s
> > sys 0m0,581s
> >
> >
> > time ./ffmpeg -v 99 -i matrix.wav -af "aformat=sample_rates=16000:channel_layouts=mono,whisper=model=/home/michael/whisper.cpp/models/ggml-base.en.bin:language=en:queue=3000:destination=output.srt:format=srt" -f null - 2> /tmp/log
> > real 1m30,827s
> > user 6m0,590s
> > sys 0m0,756s
> >
>
> Tested with: https://github.com/vpalmisano/webrtcperf/releases/download/videos-1.0/kt.mp4
> (and you need to increase the queue param to obtain a fair
> comparison):
This should be explained better in the documentation
it just says:
@item queue
The maximum size in milliseconds that will be queued into the filter before
processing the audio with whisper
Default value: @code{"3000"}
From reading that i have no idea that its value affects speed.
I might guess it affects latency.
Please make this a bit more elaborate so the user has enough information
so she can select a queue value.
ATM she just has a example value which seemed slow
thx
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Complexity theory is the science of finding the exact solution to an
approximation. Benchmarking OTOH is finding an approximation of the exact
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
[-- Attachment #2: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2025-07-12 0:03 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-09 7:23 Vittorio Palmisano
2025-07-09 13:36 ` Marvin Scholz
2025-07-09 15:24 ` Zhao Zhili
2025-07-10 8:43 ` Vittorio Palmisano
2025-07-10 9:47 ` Zhao Zhili
2025-07-10 12:41 ` Michael Niedermayer
2025-07-09 23:37 ` Michael Niedermayer
2025-07-10 8:34 ` Vittorio Palmisano
2025-07-10 10:05 ` Marvin Scholz
2025-07-10 10:20 ` Vittorio Palmisano
2025-07-10 10:25 ` Vittorio Palmisano
2025-07-10 12:20 ` Michael Niedermayer
2025-07-11 8:41 ` Vittorio Palmisano
2025-07-11 9:07 ` Vittorio Palmisano
2025-07-11 19:05 ` Marvin Scholz
2025-07-12 0:03 ` Michael Niedermayer [this message]
2025-07-13 11:16 ` Vittorio Palmisano
2025-07-14 10:34 ` Vittorio Palmisano
2025-07-14 21:47 ` Michael Niedermayer
2025-07-15 7:44 ` Vittorio Palmisano
2025-07-17 8:51 ` Vittorio Palmisano
2025-07-19 0:15 ` Michael Niedermayer
2025-07-19 12:55 ` [FFmpeg-devel] [PATCH] libavfilter: " Vittorio Palmisano
2025-07-20 1:22 ` Michael Niedermayer
2025-07-23 8:43 ` Vittorio Palmisano
2025-07-23 10:19 ` Vittorio Palmisano
2025-07-23 10:51 ` Vittorio Palmisano
[not found] ` <d444b062-15c5-ebc5-ef6d-c12185b69f10@scateu.me>
2025-08-29 8:54 ` [FFmpeg-devel] " Vittorio Palmisano via ffmpeg-devel
2025-07-19 12:58 ` [FFmpeg-devel] [PATCH] " Vittorio Palmisano
2025-07-18 23:24 ` Michael Niedermayer
2025-07-10 11:31 ` Michael Niedermayer
2025-07-10 12:07 ` Nicolas George
2025-07-10 12:10 ` Nicolas George
2025-07-09 23:41 ` Michael Niedermayer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250712000330.GX29660@pb2 \
--to=michael@niedermayer.cc \
--cc=ffmpeg-devel@ffmpeg.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git