From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 32FBC4E107 for ; Tue, 13 Jan 2026 02:05:17 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'aw0z3b1vS0APpUgKxCO3j2lzGWN4HBkJ0bi61J5yBf4=', expected b'kN7i5Q8dBx1qWux7AF0h/QhwpceSfx+ut8ZExkECqKc=')) header.d=foxmail.com header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1768269900; h=message-id : to : date : mime-version : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=IZn6msO7AVxbTmD/Z0uV41uoUHd978Geo1jnC4Z0t+Y=; b=3JmIQtL7Chth7PK/0dR8fgrGVcIpkE+63rRCfx5SubjKRexmpFVmnhvi9LVomn/CknkfV XEXySltsBOwqFlQF20PQtreb+C1XOcbhalj18pC1MDn/R6MC5ph2cLtiRAcjyL5FXHKWvbG FwtAfEeei9ecqpTKfrxcqeKR4Y8SJraRRpYbm/TeM0m57709L/3oVzk69WV+rProEF3pML8 0XcvVZhLa6SSUpzfve03tywehcDTHXW6LucBxKRBntHIfRoK6W5G6Sxn0GJ8MXfTdpwbL7C kuR+FGEqDXhUuh2SewspBe5G9Yk+GFlsuk8s7zFCEnsxdH3TMYqBQKO+v10Q== Received: from [172.20.0.4] (unknown [172.20.0.4]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 45FF2690E13; Tue, 13 Jan 2026 04:05:00 +0200 (EET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1768269880; b=aAqrtqHcqweM0oudsgw6karEzNclyMRf4xDYpx93w6uAQuAuLq4aT/g92RmhX7XME17zt ty6O9WwupKQ93ReK9jyWjxzqtJMx2+lkipmDh17wQQ7c+lJN6kD50QYMYIXCaE7fCzWS2wo W+Xm3aPOZDasLbqsx82UWXwlRf0SwCmgwDOkQEKrwGvkuQdWa0WQQG86LTgCCpzW/nZ0azB Q0/NyPC99UmwU9UDO7XPdG/nAe2z7jFobtl4iW9FKtGJ8/4yEMnwROQHzdtMtNO2rTNTUU5 oYRZ5iiJ4WXQR4N4Zni2Iy07bKkHcdwCMTlPAZCR9/1yBKnXZJtoGOF45e4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1768269880; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=aw0z3b1vS0APpUgKxCO3j2lzGWN4HBkJ0bi61J5yBf4=; b=Bm1G+EKoYNdgkr6nnUmVwTIBnT0Wqgz+9/1ECYX/2KM6PAYszRvju6yf8R6bvd8e/ISRi GAaOZ1mXwtOA4B0sD7hzsAyKKooWjXC+EL/zYgqMB/c6p/fzz6Ul5fyc418L2AfxMYfzGF6 CcXzL9BvK7WIGcEPtfJye1XZ8L+NGfWznyjiGyjAQRkprUOiAKcBBWE6o5TtQ6qrZP6bXG5 Fb5PHI2JCDcrod/8itegxkC3xoxiEHLbUUFES+RvJe5DtFPg2ajVVA9NbRKbzSLmEMksmje 26pUoWsKkPTSHGsWXszeLyJA0XL4piiuGXUxBtbh5BxlyTZSbPj6aZia84ow== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=foxmail.com; arc=none; dmarc=pass header.from=foxmail.com policy.dmarc=none Authentication-Results: ffmpeg.org; dkim=pass header.d=foxmail.com; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=foxmail.com policy.dmarc=none Received: from out203-205-221-149.mail.qq.com (out203-205-221-149.mail.qq.com [203.205.221.149]) by ffbox0-bg.ffmpeg.org (Postfix) with UTF8SMTPS id 31A5368F4C2 for ; Tue, 13 Jan 2026 04:03:53 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1768269831; bh=kN7i5Q8dBx1qWux7AF0h/QhwpceSfx+ut8ZExkECqKc=; h=From:To:Cc:Subject:Date; b=z4WhWC1T/6XJXWwwpgdB+XLNIsg7/dFCVeyKkZg/wwOB2w4uj8kaqbOfuYo5Y+ClN mrcXLKaM3zgBohoEtJk2YLrtNwTXBhPjt9N/mc2xS+/PMhnGqXdoSJa3GnRuNiSGoN dWX9r2FPRaha3ByViyJko6mYyFpXnZrm1K+CB604= Received: from localhost.localdomain ([120.227.18.65]) by newxmesmtplogicsvrszb51-1.qq.com (NewEsmtp) with SMTP id EAB9221; Tue, 13 Jan 2026 10:03:42 +0800 X-QQ-mid: xmsmtpt1768269822tfv1rsl9g Message-ID: X-QQ-XMAILINFO: Mna2+20dcVhjimF7+nOEOcwG56b7T+PAHZonIxLa7A3Ktxed6eGp23ZzRcaLMm CkopDApPGLNVbODgxoVFVHg7Tl1907SrIMB2a+cQFRNW41DuoNdNC4UjgZ4oz48UrqnyywPB3dKZ fBRtROoIA6UUtkOi60l6JgNO5yQhjkjLXg0YvVOesyiMPh0X8//S85SLwhBb8KUQDFjkWFwbhoHW dPzP8HA+KEI1CbnRB7HgSSg8T/mSqVerMRtOHwHUotOzbXgDCG/JfPOT9fYdTD7+mxrimNb2w9vi Z/aYAfR4g7F10C1uMtJp+wjyvGhg5n/jxWsGVQj+/bSThMDgMdjiyrVAx4IrD87NjYn/N1QsKUM2 xW6FohpbkfZ0UT9RxHTyVyaWd+gPWx6p8WyLSLQA7JVHEYYx2cA5gYC3Rd/PxArYs05vJRSGEZwq Iztvy46STqH+TwH6AewkWHXIwKbtT1pJRo88Dd8Y9eqhd+c3CFXEKQRK2RYYkyApjx7ROzegHLkN vXaqgvd/UfvGQX2p5qSi44oGzwKFwA00DDzcyCkoduqUlTZ/H5rO7X/Ukg2nnZxC5JFKBwoMiVYq iEY3lNPrHgesFRzCEvIQ2nijf0iBVvO4urbPD8GHV8qc0NV5dxreQ1xXGQvSiKoIbGRnv6Yah/JJ 3EeAOzE7Ty4DDh8qV9GdmpHxAwD1lh1BbEkL8cF2B4hMZ0IdBZ19ykdbHYpGlSauIRercRcnTk8K +cvx9nnmdEd3cjn3sOSvWp2B3nIdOtwh294AQKvSKY8SoRk+tJR5i3nVskCoZiJwfjoNBOdAPskl m5wfEIiW8uCTwQ/3k9aEPjcgGVXc7/VTa+7j95TASF6bsH2vUWAjCBLUJ8cqBbTTHhsg9egfmyCq C0fsSoLflk1UxrVZbpS/44dczDkJSjsUZodm3clXz4mzuaOgvE6yHYAMoZ/nWZxxWimltgLStmDq prARfDOFPlEoZ8e70u7trKgnLV2YEBlHt2IiZ2I0aKXWHJAi2NhhEMucDYbHqvzoTkAw8W9lv8I9 XY36N41wS1jTwcAiHI1pE1vhtF5Up5vy0UlpNIBGW4rcvQofek6ClcLDf5wZctr/uYkhVAPcNnWU 4RHcn5 X-QQ-XMRINFO: NS+P29fieYNwqS3WCnRCOn9D1NpZuCnCRA== To: ffmpeg-devel@ffmpeg.org Date: Tue, 13 Jan 2026 10:03:40 +0800 X-OQ-MSGID: <20260113020340.232631-1-hezuoqiang@foxmail.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 Message-ID-Hash: BFIRMIATZKE6FNOXP2F4ERJFCJMJUU4S X-Message-ID-Hash: BFIRMIATZKE6FNOXP2F4ERJFCJMJUU4S X-MailFrom: SRS0=kute=7S=foxmail.com=hezuoqiang@ffmpeg.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] libavformat/nal: add ARM NEON optimization for ff_nal_find_startcode List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: hezuoqiang--- via ffmpeg-devel Cc: Zuoqiang He Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: From: Zuoqiang He This adds an ARM NEON optimized implementation of the NAL startcode search function. Performance testing shows approximately 3.7-4x speedup on ARMv8-A platforms with NEON support. The optimization uses 64-byte NEON vector blocks to quickly scan for the 00 00 01 startcode pattern, falling back to the existing C code for smaller buffers or when NEON is not available. Performance improvement on ARMv8-A (Cortex-A76): ~3.7-4x faster Tested with FATE suite and custom H.264 streams. Signed-off-by: Zuoqiang He --- libavformat/aarch64/Makefile | 2 + libavformat/aarch64/nal.S | 172 +++++++++++++++++++++++++++++++++ libavformat/aarch64/nal_init.c | 42 ++++++++ libavformat/nal.c | 19 +++- 4 files changed, 233 insertions(+), 2 deletions(-) create mode 100644 libavformat/aarch64/Makefile create mode 100644 libavformat/aarch64/nal.S create mode 100644 libavformat/aarch64/nal_init.c diff --git a/libavformat/aarch64/Makefile b/libavformat/aarch64/Makefile new file mode 100644 index 0000000000..f1dc99de09 --- /dev/null +++ b/libavformat/aarch64/Makefile @@ -0,0 +1,2 @@ +OBJS += aarch64/nal_init.o +NEON-OBJS += aarch64/nal.o diff --git a/libavformat/aarch64/nal.S b/libavformat/aarch64/nal.S new file mode 100644 index 0000000000..6dc1570d39 --- /dev/null +++ b/libavformat/aarch64/nal.S @@ -0,0 +1,172 @@ +/* + * ARM NEON-optimized NAL startcode search + * Copyright (c) 2024 + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + + .arch armv8-a + .text + +function ff_nal_find_startcode_neon, export=1 + and x2, x0, #-4 // align to 4-byte boundary + sub x7, x1, #3 // end -= 3 + add x2, x2, #4 // align4 = aligned_p + 4 + mov x3, x0 // p = orig_p + cmp x0, x2 + ccmp x7, x0, #0, cc + bls 2f // skip alignment phase + + // Phase 1: align to 4-byte boundary +1: ldrb w0, [x3] + cbnz w0, 3f + ldrb w0, [x3, #1] + cbnz w0, 3f + ldrb w0, [x3, #2] + cmp w0, #1 + beq 22f // found 00 00 01 +3: add x3, x3, #1 + cmp x2, x3 + ccmp x7, x3, #0, hi + bhi 1b + +2: sub x0, x7, x3 // remaining = end - p + cmp x0, #63 + bgt 43f // enter NEON phase if >= 64 bytes + + // Phase 3: byte-by-byte check for remaining data +4: cmp x7, x3 + bls 8f +5: ldrb w0, [x3] + cbnz w0, 6f + ldrb w0, [x3, #1] + cbnz w0, 6f + ldrb w0, [x3, #2] + cmp w0, #1 + beq 22f +6: add x3, x3, #1 + cmp x7, x3 + bne 5b +8: add x0, x1, #3 // return orig_end + 3 + ret + + // Phase 2: NEON acceleration (64-byte blocks) +43: sub x8, x1, #66 // end64 = end - 66 + cmp x8, x3 + bls 4b + mov w6, #65279 // 0xFEFF + add x5, x3, #64 // chunk_end = p + 64 + movk w6, #0xfefe, lsl #16 // 0xFEFEFEFF + b 10f + +9: add x3, x3, #64 // p += 64 + add x5, x5, #64 // chunk_end += 64 + cmp x8, x3 + bls 4b + +10: // Load 64 bytes (4x16-byte vectors) + ldp q31, q30, [x3] // load first 32 bytes + ldp q29, q28, [x3, #32] // load next 32 bytes + prfm PLDL1KEEP, [x3, #192] // prefetch + + // Check for zero bytes (data == 0) + cmeq v31.16b, v31.16b, #0 // z0 + cmeq v30.16b, v30.16b, #0 // z1 + cmeq v29.16b, v29.16b, #0 // z2 + cmeq v28.16b, v28.16b, #0 // z3 + + // Check for 00 pattern (current byte is 0 AND next byte is 0) + ext v24.16b, v31.16b, v31.16b, #1 // zs0 + ext v27.16b, v30.16b, v30.16b, #1 // zs1 + ext v26.16b, v29.16b, v29.16b, #1 // zs2 + ext v25.16b, v28.16b, v28.16b, #1 // zs3 + + // pattern00 = zero & zero_shift + and v24.16b, v24.16b, v31.16b // p0 + and v27.16b, v27.16b, v30.16b // p1 + and v26.16b, v26.16b, v29.16b // p2 + and v25.16b, v25.16b, v28.16b // p3 + + // Check if any 00 pattern exists (fast ORR test) + orr v27.16b, v24.16b, v27.16b + orr v25.16b, v26.16b, v25.16b + orr v25.16b, v25.16b, v27.16b + dup d31, v25.d[1] + orr v31.8b, v31.8b, v25.8b + fmov x0, d31 + cbz x0, 9b // no 00 pattern, skip to next chunk + + // Detailed check of this 64-byte chunk + mov x0, x3 +11: ldr w2, [x0] + add w4, w2, w6 // x - 0x01010101 + bic w2, w4, w2 // (~x) & (x - 0x01010101) + tst w2, #-2139062144 // & 0x80808080 + beq 12f + + ldrb w2, [x0, #1] + cbnz w2, 13f + ldrb w4, [x0] + ldrb w2, [x0, #2] + cbnz w4, 14f + cmp w2, #1 + beq 18f // found 00 00 01 +14: ldrb w4, [x0, #3] + cbnz w2, 15f + cmp w4, #1 + beq 44f // found 00 00 01 (offset +1) + cbnz w4, 12f +16: ldrb w2, [x0, #4] + cmp w2, #1 + beq 45f // found 00 00 01 (offset +2) +17: cbnz w2, 12f + ldrb w2, [x0, #5] + cmp w2, #1 + beq 46f // found 00 00 01 (offset +3) + +12: add x0, x0, #4 + cmp x0, x5 + bne 11b + b 9b + +13: ldrb w2, [x0, #3] + cbnz w2, 12b + ldrb w2, [x0, #2] + cbz w2, 16b + ldrb w2, [x0, #4] + b 17b + +15: cbnz w4, 12b + ldrb w2, [x0, #4] + b 17b + +22: mov x0, x3 + ret + +45: add x0, x0, #2 + ret + +44: add x0, x0, #1 + ret + +46: add x0, x0, #3 + ret + +18: ret +endfunc diff --git a/libavformat/aarch64/nal_init.c b/libavformat/aarch64/nal_init.c new file mode 100644 index 0000000000..90160b882c --- /dev/null +++ b/libavformat/aarch64/nal_init.c @@ -0,0 +1,42 @@ +/* + * ARM NEON-optimized NAL functions + * Copyright (c) 2024 + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include + +#include "config.h" +#include "libavutil/attributes.h" +#include "libavutil/arm/cpu.h" +#include "libavutil/cpu.h" + +const uint8_t *ff_nal_find_startcode_neon(const uint8_t *p, const uint8_t *end); + +/* External function pointer from nal.c */ +extern const uint8_t *(*ff_nal_find_startcode_internal)(const uint8_t *p, const uint8_t *end); + +void ff_nal_init_arm(void); + +void ff_nal_init_arm(void) +{ + int cpu_flags = av_get_cpu_flags(); + + if (have_neon(cpu_flags)) + ff_nal_find_startcode_internal = ff_nal_find_startcode_neon; +} diff --git a/libavformat/nal.c b/libavformat/nal.c index 26dc5fe688..2e293c0225 100644 --- a/libavformat/nal.c +++ b/libavformat/nal.c @@ -21,14 +21,20 @@ #include #include +#include "libavutil/attributes.h" #include "libavutil/mem.h" #include "libavutil/error.h" #include "libavcodec/defs.h" #include "avio.h" #include "avio_internal.h" +#include "config.h" #include "nal.h" -static const uint8_t *nal_find_startcode_internal(const uint8_t *p, const uint8_t *end) +/* Pointer to the active implementation */ +const uint8_t *(*ff_nal_find_startcode_internal)(const uint8_t *p, const uint8_t *end); + +/* C implementation */ +static const uint8_t *ff_nal_find_startcode_c(const uint8_t *p, const uint8_t *end) { const uint8_t *a = p + 4 - ((intptr_t)p & 3); @@ -66,7 +72,16 @@ static const uint8_t *nal_find_startcode_internal(const uint8_t *p, const uint8_ } const uint8_t *ff_nal_find_startcode(const uint8_t *p, const uint8_t *end){ - const uint8_t *out = nal_find_startcode_internal(p, end); + static int initialized = 0; + if (!initialized) { + ff_nal_find_startcode_internal = ff_nal_find_startcode_c; +#if ARCH_AARCH64 + extern void ff_nal_init_arm(void); + ff_nal_init_arm(); +#endif + initialized = 1; + } + const uint8_t *out = ff_nal_find_startcode_internal(p, end); if(p