From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 240F249EA0 for ; Thu, 17 Jul 2025 08:37:16 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id D5B4668EC5C; Thu, 17 Jul 2025 11:37:12 +0300 (EEST) Received: from smtpout3.mo533.mail-out.ovh.net (3.mo533.mail-out.ovh.net [46.105.35.92]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 8D98168EB9D for ; Thu, 17 Jul 2025 11:37:06 +0300 (EEST) Received: from director4.derp.mail-out.ovh.net (director4.derp.mail-out.ovh.net [79.137.60.37]) by mo533.mail-out.ovh.net (Postfix) with ESMTPS id 4bjR9s3GtYz6V8F; Thu, 17 Jul 2025 08:37:05 +0000 (UTC) Received: from director4.derp.mail-out.ovh.net (director4.derp.mail-out.ovh.net. [127.0.0.1]) by director4.derp.mail-out.ovh.net (inspect_sender_mail_agent) with SMTP for ; Thu, 17 Jul 2025 08:37:05 +0000 (UTC) Received: from mta2.priv.ovhmail-u1.ea.mail.ovh.net (unknown [10.110.118.27]) by director4.derp.mail-out.ovh.net (Postfix) with ESMTPS id 4bjR9s08CYz1xq5; Thu, 17 Jul 2025 08:37:05 +0000 (UTC) Received: from mailstore2.priv.ovhmail-u1.ea.mail.ovh.net (unknown [10.2.8.2]) by mta2.priv.ovhmail-u1.ea.mail.ovh.net (Postfix) with ESMTP id AA331BA429B; Thu, 17 Jul 2025 08:37:04 +0000 (UTC) Date: Thu, 17 Jul 2025 08:37:04 +0000 (UTC) From: Marcos Del Sol To: ffmpeg-devel@ffmpeg.org Message-ID: <1176813226.32375243.1752741424508.JavaMail.zimbra@orca.pet> In-Reply-To: <1210481741.184664378.1748342436436.JavaMail.zimbra@orca.pet> References: <20250527102811.369474-1-marcos@orca.pet> <1210481741.184664378.1748342436436.JavaMail.zimbra@orca.pet> MIME-Version: 1.0 X-Originating-IP: [147.156.42.6] Thread-Topic: avformat/webvttdec: improve WebVTT parsing Thread-Index: euTVBre3dUA3AXvnNgAegAV5AgDCBVEoGiRm X-Ovh-Tracer-Id: 13128274390931363414 X-VR-SPAMSTATE: OK X-VR-SPAMSCORE: -100 X-VR-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeitdduvdcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfqggfjpdevjffgvefmvefgnecuuegrihhlohhuthemucehtddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpeffhffvvefkjghfufggtgfgihhtsehtjegttddttdejnecuhfhrohhmpeforghrtghoshcuffgvlhcuufholhcuoehmrghrtghoshesohhrtggrrdhpvghtqeenucggtffrrghtthgvrhhnpedvvdelheeuffelveeuieetffegffevieegvddvteehfeeiudegtdehuddtfeevueenucffohhmrghinhepfiefrdhorhhgpdeftggrthdrtggrthenucfkphepuddvjedrtddrtddruddpudegjedrudehiedrgedvrdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepuddvjedrtddrtddruddpmhgrihhlfhhrohhmpehmrghrtghoshesohhrtggrrdhpvghtpdhnsggprhgtphhtthhopedvpdhrtghpthhtohepfhhfmhhpvghgqdguvghvvghlsehffhhmphgvghdrohhrghdprhgtphhtthhopehmrghrtghoshesohhrtggrrdhpvghtpdfovfetjfhoshhtpehmohehfeefmgdpmhhouggvpehsmhhtphhouhht DKIM-Signature: a=rsa-sha256; bh=GFx4TLrhUl/LDCCt3oddx5h64y/XGQTEcP1Bw+2iR5g=; c=relaxed/relaxed; d=orca.pet; h=From; s=ovhmo-selector-1; t=1752741425; v=1; b=H/5aeHV0axQKGQTdUPaD9Q3ElWwVr05YnrXcRLX/aA8c2cOuZTDBbMrgYINo0qJrz1JyGj8p f6mvghzntMNn8UXNTOpoqKKv8RYEbjcx1o4dHHE4MUeyHULWqe8aXIFv0Y2vhXRDv849yAzY8Vh muAQ9PdqL7wWTXCXs/JSKE2Hne0OoFenBQsO+ZJ7pMYeDdEEqd/FDRyzVfDd272PQCIQFI5ZDCd 7WuDPHf9UMGwvnBFufxWmOPMrCN3Zwst7DSk3PNsi14J3qmBSXkIADEu/mOqf64hL3THROWqoEV 2NuZ9XW/iNRvdvBjhK1I77MsDox9e6bvYaTi+y6Qm3cow== Subject: Re: [FFmpeg-devel] [PATCH] avformat/webvttdec: improve WebVTT parsing X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Marcos Del Sol Vives Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: Hello. Can someone with merging permission have a look at this please? This bug is still impacting me. Thanks, Marcos -----Mensaje original----- De: Marcos Para: ffmpeg-devel CC: Marcos Fecha: martes, 27 de mayo de 2025 12:40 CEST Asunto: Re: [PATCH] avformat/webvttdec: improve WebVTT parsing A note on this change: I found some .vtt files while using yt-dlp that follow a draft version of the WebVTT standard (probably https://www.w3.org/2013/07/webvtt.html) that use "Region:" for regions instead of "REGION", and that was causing the conversion to .srt to fail completely. An example of those files is available at https://statics.3cat.cat/multimedia/vtt/2/4/1745967620742.vtt, which belongs to a state-owned free online streaming service in Spain. While the diff might seem large, I basically moved the current cue parsing to a function because it made error handling much easier. The logic is the exact same. -----Mensaje original----- De: Marcos Para: ffmpeg-devel CC: Marcos Fecha: martes, 27 de mayo de 2025 12:29 CEST Asunto: [PATCH] avformat/webvttdec: improve WebVTT parsing The parser will now strictly check if WebVTT files start with the correct "WEBVTT" marker. Before, files were not checked if they truly started with it. It will also now ignore all non-cue blocks, instead of only a hardcoded list. This is closer to the specification that calls for no action if unknown blocks are encountered. Signed-off-by: Marcos Del Sol Vives --- libavformat/webvttdec.c | 178 ++++++++++++++++++++++------------------ 1 file changed, 98 insertions(+), 80 deletions(-) diff --git a/libavformat/webvttdec.c b/libavformat/webvttdec.c index 6feda1585e..b454b2c1cf 100644 --- a/libavformat/webvttdec.c +++ b/libavformat/webvttdec.c @@ -58,6 +58,79 @@ static int64_t read_ts(const char *s) return AV_NOPTS_VALUE; } +static int webvtt_parse_cue(WebVTTContext *webvtt, AVBPrint *cue, int64_t pos) +{ + int i; + AVPacket *sub; + const char *p, *identifier, *settings; + size_t identifier_len, settings_len; + int64_t ts_start, ts_end; + + p = identifier = cue->str; + + /* optional cue identifier (can be a number like in SRT or some kind of + * chaptering id) */ + for (i = 0; p[i] && p[i] != '\n' && p[i] != '\r'; i++) { + if (!strncmp(p + i, "-->", 3)) { + identifier = NULL; + break; + } + } + if (!identifier) + identifier_len = 0; + else { + identifier_len = strcspn(p, "\r\n"); + p += identifier_len; + if (*p == '\r') + p++; + if (*p == '\n') + p++; + } + + /* cue timestamps */ + if ((ts_start = read_ts(p)) == AV_NOPTS_VALUE) + return AVERROR_INVALIDDATA; + if (!(p = strstr(p, "-->"))) + return AVERROR_INVALIDDATA; + p += 2; + do p++; while (*p == ' ' || *p == '\t'); + if ((ts_end = read_ts(p)) == AV_NOPTS_VALUE) + return AVERROR_INVALIDDATA; + + /* optional cue settings */ + p += strcspn(p, "\n\r\t "); + while (*p == '\t' || *p == ' ') + p++; + settings = p; + settings_len = strcspn(p, "\r\n"); + p += settings_len; + if (*p == '\r') + p++; + if (*p == '\n') + p++; + + /* create packet */ + sub = ff_subtitles_queue_insert(&webvtt->q, p, strlen(p), 0); + if (!sub) + return AVERROR(ENOMEM); + sub->pos = pos; + sub->pts = ts_start; + sub->duration = ts_end - ts_start; + +#define SET_SIDE_DATA(name, type) do { \ + if (name##_len) { \ + uint8_t *buf = av_packet_new_side_data(sub, type, name##_len); \ + if (!buf) \ + return AVERROR(ENOMEM); \ + memcpy(buf, name, name##_len); \ + } \ +} while (0) + + SET_SIDE_DATA(identifier, AV_PKT_DATA_WEBVTT_IDENTIFIER); + SET_SIDE_DATA(settings, AV_PKT_DATA_WEBVTT_SETTINGS); + return 0; +} + static int webvtt_read_header(AVFormatContext *s) { WebVTTContext *webvtt = s->priv_data; @@ -74,13 +147,27 @@ static int webvtt_read_header(AVFormatContext *s) av_bprint_init(&cue, 0, AV_BPRINT_SIZE_UNLIMITED); + res = ff_subtitles_read_chunk(s->pb, &cue); + if (res < 0) { + av_log(s, AV_LOG_ERROR, "Unable to read file header\n"); + goto end; + } + + if (!cue.len) { + av_log(s, AV_LOG_ERROR, "Unable to read file header\n"); + res = AVERROR_EOF; + goto end; + } + + if (!strncmp(cue.str, "\xEF\xBB\xBFWEBVTT", 9) && + !strncmp(cue.str, "WEBVTT", 6)) { + av_log(s, AV_LOG_ERROR, "Invalid file header\n"); + res = AVERROR_INVALIDDATA; + goto end; + } + for (;;) { - int i; - int64_t pos; - AVPacket *sub; - const char *p, *identifier, *settings; - size_t identifier_len, settings_len; - int64_t ts_start, ts_end; + int64_t pos = avio_tell(s->pb); res = ff_subtitles_read_chunk(s->pb, &cue); if (res < 0) @@ -89,81 +176,12 @@ static int webvtt_read_header(AVFormatContext *s) if (!cue.len) break; - p = identifier = cue.str; - pos = avio_tell(s->pb); - - /* ignore header chunk */ - if (!strncmp(p, "\xEF\xBB\xBFWEBVTT", 9) || - !strncmp(p, "WEBVTT", 6) || - !strncmp(p, "STYLE", 5) || - !strncmp(p, "REGION", 6) || - !strncmp(p, "NOTE", 4)) - continue; - - /* optional cue identifier (can be a number like in SRT or some kind of - * chaptering id) */ - for (i = 0; p[i] && p[i] != '\n' && p[i] != '\r'; i++) { - if (!strncmp(p + i, "-->", 3)) { - identifier = NULL; - break; - } - } - if (!identifier) - identifier_len = 0; - else { - identifier_len = strcspn(p, "\r\n"); - p += identifier_len; - if (*p == '\r') - p++; - if (*p == '\n') - p++; + res = webvtt_parse_cue(webvtt, &cue, pos); + if (res < 0) { + if (res != AVERROR_INVALIDDATA) + goto end; + av_log(s, AV_LOG_DEBUG, "Ignoring non-cue block at 0x%"PRIx64"\n", pos); } - - /* cue timestamps */ - if ((ts_start = read_ts(p)) == AV_NOPTS_VALUE) - break; - if (!(p = strstr(p, "-->"))) - break; - p += 2; - do p++; while (*p == ' ' || *p == '\t'); - if ((ts_end = read_ts(p)) == AV_NOPTS_VALUE) - break; - - /* optional cue settings */ - p += strcspn(p, "\n\r\t "); - while (*p == '\t' || *p == ' ') - p++; - settings = p; - settings_len = strcspn(p, "\r\n"); - p += settings_len; - if (*p == '\r') - p++; - if (*p == '\n') - p++; - - /* create packet */ - sub = ff_subtitles_queue_insert(&webvtt->q, p, strlen(p), 0); - if (!sub) { - res = AVERROR(ENOMEM); - goto end; - } - sub->pos = pos; - sub->pts = ts_start; - sub->duration = ts_end - ts_start; - -#define SET_SIDE_DATA(name, type) do { \ - if (name##_len) { \ - uint8_t *buf = av_packet_new_side_data(sub, type, name##_len); \ - if (!buf) { \ - res = AVERROR(ENOMEM); \ - goto end; \ - } \ - memcpy(buf, name, name##_len); \ - } \ -} while (0) - - SET_SIDE_DATA(identifier, AV_PKT_DATA_WEBVTT_IDENTIFIER); - SET_SIDE_DATA(settings, AV_PKT_DATA_WEBVTT_SETTINGS); } ff_subtitles_queue_finalize(s, &webvtt->q); -- 2.34.1 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".