From: Marvin Scholz <epirat07-at-gmail.com@ffmpeg.org> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Subject: Re: [FFmpeg-devel] [PATCH] Whisper audio filter Date: Wed, 09 Jul 2025 15:36:15 +0200 Message-ID: <7BDD446F-9ECB-4BBF-88EF-21BCFF80D5EF@gmail.com> (raw) In-Reply-To: <20250709072350.578693-1-vpalmisano@gmail.com> On 9 Jul 2025, at 9:23, Vittorio Palmisano wrote: > It adds a new audio filter for running audio transcriptions with the whisper model. > Documentation and examples are included into the patch. > > Signed-off-by: Vittorio Palmisano <vpalmisano@gmail.com> > --- > configure | 5 + > doc/filters.texi | 101 ++++++++ > libavfilter/Makefile | 2 + > libavfilter/af_whisper.c | 494 +++++++++++++++++++++++++++++++++++++++ > libavfilter/allfilters.c | 2 + > 5 files changed, 604 insertions(+) > create mode 100644 libavfilter/af_whisper.c > > diff --git a/configure b/configure > index 2ccafe7c20..573dfc67dc 100755 > --- a/configure > +++ b/configure > @@ -337,6 +337,7 @@ External library support: > --enable-vapoursynth enable VapourSynth demuxer [no] > --disable-xlib disable xlib [autodetect] > --disable-zlib disable zlib [autodetect] > + --enable-whisper enable whisper filter [no] > > The following libraries provide various hardware acceleration features: > --disable-amf disable AMF video encoding code [autodetect] > @@ -2003,6 +2004,7 @@ EXTERNAL_LIBRARY_LIST=" > pocketsphinx > vapoursynth > vulkan_static > + whisper > " > > HWACCEL_AUTODETECT_LIBRARY_LIST=" > @@ -4059,6 +4061,7 @@ xstack_qsv_filter_deps="libmfx" > xstack_qsv_filter_select="qsvvpp" > pad_vaapi_filter_deps="vaapi_1" > drawbox_vaapi_filter_deps="vaapi_1" > +whisper_filter_deps="whisper" > > # examples > avio_http_serve_files_deps="avformat avutil fork" > @@ -7108,6 +7111,8 @@ enabled libvo_amrwbenc && require libvo_amrwbenc vo-amrwbenc/enc_if.h E_IF_in > enabled libvorbis && require_pkg_config libvorbis vorbis vorbis/codec.h vorbis_info_init && > require_pkg_config libvorbisenc vorbisenc vorbis/vorbisenc.h vorbis_encode_init > > +enabled whisper && require_pkg_config whisper "whisper >= 1.7.5" whisper.h whisper_init_from_file_with_params > + > enabled libvpx && { > enabled libvpx_vp8_decoder && { > check_pkg_config libvpx_vp8_decoder "vpx >= 1.4.0" "vpx/vpx_decoder.h vpx/vp8dx.h" vpx_codec_vp8_dx || > diff --git a/doc/filters.texi b/doc/filters.texi > index ed2956fe75..c00e73478f 100644 > --- a/doc/filters.texi > +++ b/doc/filters.texi > @@ -7682,6 +7682,107 @@ There are 6 samples at -4 dB, 62 at -5 dB, 286 at -6 dB, etc. > In other words, raising the volume by +4 dB does not cause any clipping, > raising it by +5 dB causes clipping for 6 samples, etc. > > +@anchor{whisper} > +@section whisper > + > +It runs a automatic speech recognition using the OpenAI’s Whisper model. > + > +It requires the whisper.cpp library (https://github.com/ggml-org/whisper.cpp) > +as a pre-requisite. After installing the library it can be enabled using: > +@code{./configure --enable-whisper}. > + > +The filter has following options: > + > +@table @option > +@item model > +The file path of the downloaded whisper.cpp model (mandatory). > + > +@item language > +The language to use for transcription ('auto' for auto-detect). > +Default value: @code{"auto"} > + > +@item queue > +The maximum size in milliseconds that will be queued into the filter before > +processing the audio with whisper > +Default value: @code{"3000"} > + > +@item use_gpu > +If the GPU support should be enabled. > +Default value: @code{"true"} > + > +@item gpu_device > +The GPU device to use. > +Default value: @code{"0"} > + > +@item destination > +If set, the transcription output will be sent to the specified file or URL > +(use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as > +info messages. > +The output will also be set in the "lavfi.whisper.text" frame metadata. > + > +@item format > +The destination format string; it could be "text" (only the transcribed text > +will be sent to the destination), "srt" (subtitle format) or "json". > +Default value: @code{"text"} > + > +@item vad_model > +Path to the VAD model file. If set, the filter will load an additional voice > +activity detection module (https://github.com/snakers4/silero-vad) that will be > +used to fragment the audio queue; use this option setting a valid path obtained > +from the whisper.cpp repository (e.g. "../whisper.cpp/models/ggml-silero-v5.1.2.bin") > +and increase the queue parameter to an higher value (e.g. 10000) > + > +@item vad_threshold > +The VAD threshold to use. > +Default value: @code{"0.5"} > + > +@item vad_min_speech_duration > +The minimum VAD speaking duration in milliseconds. > +Default value: @code{"50"} > + > +@item vad_min_silence_duration > +The minimum VAD silence duration in milliseconds. > +Default value: @code{"500"} > + > +@end table > + > +@subsection Examples > +@itemize > + > +@item > +Run a transcription with srt file generation: > +@example > +ffmpeg -i input.mp4 -vn -af "aformat=sample_rates=16000:channel_layouts=mono,whisper=model=../whisper.cpp/models/ggml-base.en.bin\ > +:language=en\ > +:queue=3000\ > +:destination=output.srt\ > +:format=srt" -f null - > +@end example > + > +@item > +Run a transcription and send the output in JSON format to an HTTP service: > +@example > +ffmpeg -i input.mp4 -vn -af "aformat=sample_rates=16000:channel_layouts=mono,whisper=model=../whisper.cpp/models/ggml-base.en.bin\ > +:language=en\ > +:queue=3000\ > +:destination=http\\://localhost\\:3000\ > +:format=json' -f null - > +@end example > + > +@item > +Transcribe the microphone input using the VAD option: > +@example > +ffmpeg -loglevel warning -f pulse -i default \ > +-af 'highpass=f=200,lowpass=f=3000,aformat=sample_rates=16000:channel_layouts=mono,whisper=model=../whisper.cpp/models/ggml-medium.bin\ > +:language=en\ > +:queue=10000\ > +:destination=-\ > +:format=json\ > +:vad_model=../whisper.cpp/models/ggml-silero-v5.1.2.bin' -f null - > +@end example > + > +@end itemize > + > @c man end AUDIO FILTERS > > @chapter Audio Sources > diff --git a/libavfilter/Makefile b/libavfilter/Makefile > index 9e9153f5b0..e133422ca4 100644 > --- a/libavfilter/Makefile > +++ b/libavfilter/Makefile > @@ -188,6 +188,8 @@ OBJS-$(CONFIG_HILBERT_FILTER) += asrc_hilbert.o > OBJS-$(CONFIG_SINC_FILTER) += asrc_sinc.o > OBJS-$(CONFIG_SINE_FILTER) += asrc_sine.o > > +OBJS-$(CONFIG_WHISPER_FILTER) += af_whisper.o > + > OBJS-$(CONFIG_ANULLSINK_FILTER) += asink_anullsink.o > > # video filters > diff --git a/libavfilter/af_whisper.c b/libavfilter/af_whisper.c > new file mode 100644 > index 0000000000..7bfdd3a9dc > --- /dev/null > +++ b/libavfilter/af_whisper.c > @@ -0,0 +1,494 @@ > +/* > + * Copyright (c) 2025 Vittorio Palmisano > + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public License > + * as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public License > + * along with FFmpeg; if not, write to the Free Software Foundation, Inc., > + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA > + */ > + > +#include <stdio.h> > +#include <stdint.h> > +#include <stdlib.h> > + > +#include "libavutil/avutil.h" > +#include "libavutil/opt.h" > +#include "libavutil/channel_layout.h" > +#include "libavutil/samplefmt.h" > +#include "libavfilter/avfilter.h" > +#include "libavfilter/audio.h" > +#include "libavutil/mem.h" > +#include "libavutil/avstring.h" > +#include "libavutil/internal.h" > +#include "libavformat/avio.h" > +#include "libavutil/thread.h" > + > +#include "formats.h" > + > +#include "whisper.h" > + > +typedef struct WhisperContext > +{ > + const AVClass *class; > + char *model_path; > + char *language; > + bool use_gpu; > + int gpu_device; > + char *vad_model_path; > + float vad_threshold; > + int vad_min_speech_duration; > + int vad_min_silence_duration; > + > + int queue; > + char *destination; > + char *format; > + > + struct whisper_context *ctx_wsp; > + struct whisper_state *whisper_state; > + struct whisper_vad_context *ctx_vad; > + struct whisper_vad_params vad_params; > + > + float *audio_buffer; > + int audio_buffer_queue_size; > + int audio_buffer_fill_size; > + int audio_buffer_vad_size; > + > + int eof; > + int64_t next_pts; > + > + AVIOContext *avio_context; > + int index; > + int64_t timestamp; > +} WhisperContext; > + > +static void cb_log_disable(enum ggml_log_level, const char *, void *) {} > + > +static int init(AVFilterContext *ctx) > +{ > + WhisperContext *wctx = ctx->priv; > + > + ggml_backend_load_all(); > + whisper_log_set(cb_log_disable, NULL); > + > + // Init whisper context > + if (!wctx->model_path) > + { > + av_log(ctx, AV_LOG_ERROR, "No whisper model path specified. Use the 'model' option.\n"); > + return AVERROR(EINVAL); > + } > + > + struct whisper_context_params params = whisper_context_default_params(); > + params.use_gpu = wctx->use_gpu; > + params.gpu_device = wctx->gpu_device; > + > + wctx->ctx_wsp = whisper_init_from_file_with_params(wctx->model_path, params); > + if (wctx->ctx_wsp == NULL) > + { > + av_log(ctx, AV_LOG_ERROR, "Failed to initialize whisper context from model: %s\n", wctx->model_path); > + return AVERROR(EIO); > + } > + > + wctx->whisper_state = whisper_init_state(wctx->ctx_wsp); > + if (wctx->whisper_state == NULL) > + { > + av_log(ctx, AV_LOG_ERROR, "Failed to get whisper state from context\n"); > + whisper_free(wctx->ctx_wsp); > + wctx->ctx_wsp = NULL; > + return AVERROR(EIO); > + } > + > + // Init VAD model context > + if (wctx->vad_model_path) > + { > + struct whisper_vad_context_params ctx_params = whisper_vad_default_context_params(); > + ctx_params.n_threads = 4; > + // ctx_params.use_gpu = wctx->use_gpu; TODO (see: whisper_vad_init_context) > + ctx_params.gpu_device = wctx->gpu_device; > + wctx->ctx_vad = whisper_vad_init_from_file_with_params( > + wctx->vad_model_path, > + ctx_params); > + > + wctx->vad_params = whisper_vad_default_params(); > + wctx->vad_params.threshold = wctx->vad_threshold; > + wctx->vad_params.min_speech_duration_ms = wctx->vad_min_speech_duration; > + wctx->vad_params.min_silence_duration_ms = wctx->vad_min_silence_duration; > + wctx->vad_params.max_speech_duration_s = (float)(wctx->audio_buffer_queue_size / 1000.0f); > + wctx->vad_params.speech_pad_ms = 0; > + wctx->vad_params.samples_overlap = 0; > + } > + > + // Init buffer > + wctx->audio_buffer_queue_size = WHISPER_SAMPLE_RATE * wctx->queue / 1000; > + wctx->audio_buffer = av_malloc(wctx->audio_buffer_queue_size * sizeof(float)); > + if (!wctx->audio_buffer) > + { > + return AVERROR(ENOMEM); > + } > + > + wctx->audio_buffer_fill_size = 0; > + > + wctx->next_pts = AV_NOPTS_VALUE; > + > + wctx->avio_context = NULL; > + if (wctx->destination && strcmp("", wctx->destination)) > + { > + int ret = 0; > + > + if (!strcmp("-", wctx->destination)) > + { > + ret = avio_open(&wctx->avio_context, "pipe:1", AVIO_FLAG_WRITE); > + } > + else > + { > + ret = avio_open(&wctx->avio_context, wctx->destination, AVIO_FLAG_WRITE); > + } > + > + if (ret < 0) > + { > + av_log(ctx, AV_LOG_ERROR, "Could not open %s: %s\n", > + wctx->destination, av_err2str(ret)); > + return ret; > + } > + > + wctx->avio_context->direct = AVIO_FLAG_DIRECT; > + } > + > + av_log(ctx, AV_LOG_INFO, "Whisper filter initialized: model: %s lang: %s queue: %d ms\n", > + wctx->model_path, wctx->language, wctx->queue); > + > + return 0; > +} > + > +static void uninit(AVFilterContext *ctx) > +{ > + WhisperContext *wctx = ctx->priv; > + > + if (wctx->audio_buffer_fill_size > 0) > + { > + av_log(ctx, AV_LOG_WARNING, "Remaining audio buffer %d samples (%d seconds) after stopping\n", > + wctx->audio_buffer_fill_size, > + wctx->audio_buffer_fill_size / WHISPER_SAMPLE_RATE); > + } > + > + if (wctx->ctx_vad) > + { > + whisper_vad_free(wctx->ctx_vad); > + wctx->ctx_vad = NULL; > + } > + > + if (wctx->whisper_state) > + { > + whisper_free_state(wctx->whisper_state); > + wctx->whisper_state = NULL; > + } > + > + if (wctx->ctx_wsp) > + { > + whisper_free(wctx->ctx_wsp); > + wctx->ctx_wsp = NULL; > + } > + > + av_freep(&wctx->audio_buffer); > + > + if (wctx->avio_context) > + { > + avio_closep(&wctx->avio_context); > + } > +} > + > +static void run_transcription(AVFilterContext *ctx, AVDictionary **metadata, int end_pos) > +{ > + WhisperContext *wctx = ctx->priv; > + end_pos = FFMIN(end_pos, wctx->audio_buffer_fill_size); > + > + if (!wctx->ctx_wsp || end_pos == 0) > + { > + return; > + } > + > + if (!wctx->ctx_wsp) > + { > + return; > + } > + > + float duration = (float)end_pos / WHISPER_SAMPLE_RATE; > + > + av_log(ctx, AV_LOG_INFO, "run transcription %d/%d samples (%.2f seconds)...\n", > + end_pos, wctx->audio_buffer_fill_size, > + duration); > + > + struct whisper_full_params params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY); > + params.language = wctx->language; > + params.print_special = 0; > + params.print_progress = 0; > + params.print_realtime = 0; > + params.print_timestamps = 0; > + > + if (whisper_full(wctx->ctx_wsp, params, wctx->audio_buffer, end_pos) != 0) > + { > + av_log(ctx, AV_LOG_ERROR, "Failed to process audio with whisper.cpp\n"); > + return; > + } > + > + const int n_segments = whisper_full_n_segments_from_state(wctx->whisper_state); > + char *segments_text = NULL; > + > + for (int i = 0; i < n_segments; ++i) > + { > + const bool turn = whisper_full_get_segment_speaker_turn_next(wctx->ctx_wsp, i); > + const int64_t t0 = whisper_full_get_segment_t0(wctx->ctx_wsp, i) * 10; > + const int64_t t1 = whisper_full_get_segment_t1(wctx->ctx_wsp, i) * 10; > + const char *text = whisper_full_get_segment_text(wctx->ctx_wsp, i); > + char *text_cleaned = av_strireplace(text + 1, "[BLANK_AUDIO]", ""); > + > + if (av_strnlen(text_cleaned, 1) == 0) > + { > + av_free(text_cleaned); > + continue; > + } > + av_log(ctx, AV_LOG_INFO, " [%ld-%ld%s]: \"%s\"\n", wctx->timestamp + t0, wctx->timestamp + t1, turn ? " (turn)" : "", text_cleaned); > + > + if (segments_text) > + { > + char *new_text = av_asprintf("%s%s", segments_text, text_cleaned); > + av_free(segments_text); > + segments_text = new_text; > + } > + else > + { > + segments_text = av_strdup(text_cleaned); > + } > + > + if (wctx->avio_context) > + { > + const int64_t start_t = wctx->timestamp + t0; > + const int64_t end_t = wctx->timestamp + t1; > + char *buf = NULL; > + > + if (!av_strcasecmp(wctx->format, "srt")) > + { > + buf = av_asprintf("%d\n%02ld:%02ld:%02ld.%03ld --> %02ld:%02ld:%02ld.%03ld\n%s\n\n", > + wctx->index, > + start_t / 3600000, (start_t / 60000) % 60, (start_t / 1000) % 60, start_t % 1000, > + end_t / 3600000, (end_t / 60000) % 60, (end_t / 1000) % 60, end_t % 1000, > + text_cleaned); > + } > + else if (!av_strcasecmp(wctx->format, "json")) > + { > + buf = av_asprintf("{\"start\":%ld,\"end\":%ld,\"text\":\"%s\",\"turn\":%s}\n", > + start_t, end_t, text_cleaned, turn ? "true" : "false"); > + } > + else > + { > + buf = av_strdup(text_cleaned); > + } > + > + if (buf) > + { > + avio_write(wctx->avio_context, buf, strlen(buf)); > + av_free(buf); > + } > + } > + > + av_free(text_cleaned); > + } > + > + wctx->index++; > + wctx->timestamp += (int64_t)(duration * 1000); > + > + if (metadata && segments_text) > + { > + av_dict_set(metadata, "lavfi.whisper.text", segments_text, 0); > + char *duration_text = av_asprintf("%f", duration); > + av_dict_set(metadata, "lavfi.whisper.duration", duration_text, 0); > + av_free(duration_text); > + } > + if (segments_text) > + { > + av_free(segments_text); > + } > + > + memcpy(wctx->audio_buffer, wctx->audio_buffer + end_pos, end_pos * sizeof(float)); > + wctx->audio_buffer_fill_size -= end_pos; > + wctx->audio_buffer_vad_size = wctx->audio_buffer_fill_size; > +} > + > +static int filter_frame(AVFilterLink *inlink, AVFrame *frame) > +{ > + AVFilterContext *ctx = inlink->dst; > + WhisperContext *wctx = ctx->priv; > + AVFilterLink *outlink = ctx->outputs[0]; > + AVDictionary **metadata = &frame->metadata; > + > + const int samples = frame->nb_samples; > + const float *input_data = (const float *)frame->data[0]; > + > + if (wctx->audio_buffer_fill_size + samples > wctx->audio_buffer_queue_size) > + { > + run_transcription(ctx, metadata, wctx->audio_buffer_fill_size); > + } > + > + memcpy(wctx->audio_buffer + wctx->audio_buffer_fill_size, input_data, samples * sizeof(float)); > + wctx->audio_buffer_fill_size += samples; > + > + if (wctx->ctx_vad && (wctx->audio_buffer_fill_size - wctx->audio_buffer_vad_size) >= > + WHISPER_SAMPLE_RATE * (wctx->vad_min_speech_duration + wctx->vad_min_silence_duration) / 1000) > + { > + struct whisper_vad_segments *segments = whisper_vad_segments_from_samples( > + wctx->ctx_vad, wctx->vad_params, wctx->audio_buffer, wctx->audio_buffer_fill_size); > + wctx->audio_buffer_vad_size = wctx->audio_buffer_fill_size; > + > + if (!segments) > + { > + av_log(ctx, AV_LOG_ERROR, "failed to detect VAD\n"); > + } > + else > + { > + int n_segments = whisper_vad_segments_n_segments(segments); > + > + if (n_segments > 0) > + { > + const int64_t start_ms = whisper_vad_segments_get_segment_t0(segments, n_segments - 1) * 10; > + const int64_t end_ms = whisper_vad_segments_get_segment_t1(segments, n_segments - 1) * 10; > + int end_pos = (int)(end_ms * WHISPER_SAMPLE_RATE / 1000); > + > + if (end_pos < wctx->audio_buffer_fill_size) > + { > + av_log(ctx, AV_LOG_INFO, "VAD detected %d segments, start: %ld ms, end: %ld ms (buffer: %d ms)\n", > + n_segments, start_ms, end_ms, 1000 * wctx->audio_buffer_fill_size / WHISPER_SAMPLE_RATE); > + run_transcription(ctx, metadata, end_pos); > + } > + } > + > + whisper_vad_free_segments(segments); > + } > + } > + else if (wctx->audio_buffer_fill_size >= wctx->audio_buffer_queue_size) > + { > + run_transcription(ctx, metadata, wctx->audio_buffer_fill_size); > + } > + > + wctx->next_pts = frame->pts + av_rescale_q(frame->nb_samples, (AVRational){1, inlink->sample_rate}, inlink->time_base); > + return ff_filter_frame(outlink, frame); > +} > + > +static int push_last_frame(AVFilterLink *outlink) > +{ > + AVFilterContext *ctx = outlink->src; > + WhisperContext *wctx = ctx->priv; > + AVFrame *frame; > + int n_out = 1; > + > + if (ctx->is_disabled || wctx->audio_buffer_fill_size == 0) > + return 0; > + frame = ff_get_audio_buffer(outlink, n_out); > + if (!frame) > + return AVERROR(ENOMEM); > + > + av_samples_set_silence(frame->extended_data, 0, > + n_out, > + frame->ch_layout.nb_channels, > + frame->format); > + > + frame->pts = wctx->next_pts; > + if (wctx->next_pts != AV_NOPTS_VALUE) > + wctx->next_pts += av_rescale_q(n_out, (AVRational){1, outlink->sample_rate}, outlink->time_base); > + > + run_transcription(ctx, &frame->metadata, wctx->audio_buffer_fill_size); > + > + return ff_filter_frame(outlink, frame); > +} > + > +static int activate(AVFilterContext *ctx) > +{ > + AVFilterLink *inlink = ctx->inputs[0]; > + AVFilterLink *outlink = ctx->outputs[0]; > + WhisperContext *wctx = ctx->priv; > + int64_t pts; > + int status; > + > + FF_FILTER_FORWARD_STATUS_BACK(outlink, inlink); > + > + if (!wctx->eof && ff_inlink_queued_frames(inlink)) > + { > + AVFrame *frame = NULL; > + int ret; > + > + ret = ff_inlink_consume_frame(inlink, &frame); > + if (ret < 0) > + return ret; > + if (ret > 0) > + return filter_frame(inlink, frame); > + } > + > + if (!wctx->eof && ff_inlink_acknowledge_status(inlink, &status, &pts)) > + wctx->eof = status == AVERROR_EOF; > + > + if (wctx->eof) > + { > + push_last_frame(outlink); > + > + ff_outlink_set_status(outlink, AVERROR_EOF, wctx->next_pts); > + return 0; > + } > + > + FF_FILTER_FORWARD_WANTED(outlink, inlink); > + > + return FFERROR_NOT_READY; > +} > + > +#define OFFSET(x) offsetof(WhisperContext, x) > +#define FLAGS AV_OPT_FLAG_AUDIO_PARAM | AV_OPT_FLAG_FILTERING_PARAM > + > +static const AVOption whisper_options[] = { > + {"model", "Path to the whisper.cpp model file", OFFSET(model_path), AV_OPT_TYPE_STRING, .flags = FLAGS}, > + {"language", "Language for transcription ('auto' for auto-detect)", OFFSET(language), AV_OPT_TYPE_STRING, {.str = "auto"}, .flags = FLAGS}, > + {"queue", "Audio queue size in milliseconds", OFFSET(queue), AV_OPT_TYPE_INT, {.i64 = 3000}, 20, INT_MAX, .flags = FLAGS}, This should probably be AV_OPT_TYPE_DURATION. > + {"use_gpu", "Use GPU for processing", OFFSET(use_gpu), AV_OPT_TYPE_BOOL, {.i64 = 1}, 0, 1, .flags = FLAGS}, > + {"gpu_device", "GPU device to use", OFFSET(gpu_device), AV_OPT_TYPE_INT, {.i64 = 0}, 0, INT_MAX, .flags = FLAGS}, > + {"destination", "Output destination", OFFSET(destination), AV_OPT_TYPE_STRING, {.str = ""}, .flags = FLAGS}, > + {"format", "Output format (text|srt|json)", OFFSET(format), AV_OPT_TYPE_STRING, {.str = "text"}, .flags = FLAGS}, > + {"vad_model", "Path to the VAD model file", OFFSET(vad_model_path), AV_OPT_TYPE_STRING, .flags = FLAGS}, > + {"vad_threshold", "VAD threshold", OFFSET(vad_threshold), AV_OPT_TYPE_FLOAT, {.dbl = 0.5}, 0.0, 1.0, .flags = FLAGS}, > + {"vad_min_speech_duration", "Minimum speech duration in milliseconds for VAD", OFFSET(vad_min_speech_duration), AV_OPT_TYPE_INT, {.i64 = 50}, 20, INT_MAX, .flags = FLAGS}, > + {"vad_min_silence_duration", "Minimum silence duration in milliseconds for VAD", OFFSET(vad_min_silence_duration), AV_OPT_TYPE_INT, {.i64 = 500}, 0, INT_MAX, .flags = FLAGS}, These should be AV_OPT_TYPE_DURATION too. > + {NULL}}; > + > +static const AVClass whisper_class = { > + .class_name = "whisper", > + .item_name = av_default_item_name, > + .option = whisper_options, > + .version = LIBAVUTIL_VERSION_INT, > +}; > + > +static const AVFilterPad whisper_outputs[] = { > + { > + .name = "default", > + .type = AVMEDIA_TYPE_AUDIO, > + }, > +}; > + > +const FFFilter ff_af_whisper = { > + .p.name = "whisper", > + .p.description = NULL_IF_CONFIG_SMALL("Transcribe audio using whisper.cpp."), > + .p.priv_class = &whisper_class, > + .p.flags = AVFILTER_FLAG_METADATA_ONLY, > + .init = init, > + .uninit = uninit, > + .activate = activate, > + .priv_size = sizeof(WhisperContext), > + FILTER_INPUTS(ff_audio_default_filterpad), > + FILTER_OUTPUTS(whisper_outputs), > + FILTER_SAMPLEFMTS(AV_SAMPLE_FMT_FLT, AV_SAMPLE_FMT_FLTP), > +}; > diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c > index 409099bf1f..eaf0c8fe6f 100644 > --- a/libavfilter/allfilters.c > +++ b/libavfilter/allfilters.c > @@ -163,6 +163,8 @@ extern const FFFilter ff_af_virtualbass; > extern const FFFilter ff_af_volume; > extern const FFFilter ff_af_volumedetect; > > +extern const FFFilter ff_af_whisper; > + > extern const FFFilter ff_asrc_aevalsrc; > extern const FFFilter ff_asrc_afdelaysrc; > extern const FFFilter ff_asrc_afireqsrc; > -- > 2.43.0 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2025-07-09 13:36 UTC|newest] Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top 2025-07-09 7:23 Vittorio Palmisano 2025-07-09 13:36 ` Marvin Scholz [this message] 2025-07-09 15:24 ` Zhao Zhili
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=7BDD446F-9ECB-4BBF-88EF-21BCFF80D5EF@gmail.com \ --to=epirat07-at-gmail.com@ffmpeg.org \ --cc=ffmpeg-devel@ffmpeg.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git