From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id E24344003F for ; Tue, 13 Jan 2026 23:08:56 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'aw0z3b1vS0APpUgKxCO3j2lzGWN4HBkJ0bi61J5yBf4=', expected b'kN7i5Q8dBx1qWux7AF0h/QhwpceSfx+ut8ZExkECqKc=')) header.d=foxmail.com header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1768345213; h=message-id : to : date : mime-version : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=IZn6msO7AVxbTmD/Z0uV41uoUHd978Geo1jnC4Z0t+Y=; b=h9rTxCq2GVDFhG5TNQr3pKiON0CSQLhFHuvjKVcb8zcMgyCkxNBtLKRv1gPqKD/Nhgtw0 y0WiL/sUSIUEjJKz8qXcgBBaPtf97b4ivmD6jJVeIYe+r1VLZejDLjXVP6CRD433WVwuoG4 4Pql/Cqh19VNV6hUWIl4z0vKJUgiVt+BJJdySK8OthGPpYkK2BkY6DUagk095u60j97LokZ moC+Nq2LuFfDSJoNYEsnJ1kEI9L4Ty0AHs9qUlj9RRIwVvXdceRHot4KwkWw2U2xfq615bv 7MPU2yK6m3mDTObJK7ApyDvep6JT3zBhcCxyZwsQPFm3cq5BdB62eObe4s8Q== Received: from [172.20.0.4] (unknown [172.20.0.4]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id EF664690F18; Wed, 14 Jan 2026 01:00:12 +0200 (EET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1768345181; b=jatheahkokscn9xGMGDianhxf9b421kTpx0/HxMnGHwMKJpEg0OynAerTisj2Wrg5GNhI xx4P7ppzBnG706vPRkwQn4dlm7Q5bNcp2GxzCxou2t2fcGPKLj9amrM3WTsetlpxEWZx1Lx OZiEaDLe0WtGUTTcgeWCPeBytX6Q8wEFQredlnx+DyRemHAjSUEz5YCmIhsXcsZcOZZZXhs ppP8rFKKkqDBcFO9PHwvQTu+CUY2bACRF8Uj63ZIsIVQtF6Mq1xPeqOUTwNgy+DHTnRJO3a fnzjSIZZ7uOsFWz2kc6Xos//dhf2tQT/78I3KMAWQ9UdMniyd9itdpvWTIqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1768345181; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=aw0z3b1vS0APpUgKxCO3j2lzGWN4HBkJ0bi61J5yBf4=; b=tTucHpLDgjJXhnuAX2QbTq/19/RFAlBq/rlHihioP9XAqvgxJSOT/0ZqALa8Z16QhB6UA nOk/HllTg0fILJMIta7ap5E8F0t69wcxFc8vxly7OwcJNfjW43MbkLyVhEGXrG7fh47a01E d6gNT1WOvnEHTjPFKLr5sY6sTr4461gf/QYgg+qCwUdFbZQToMCm8ebih8hHq7OPKw6SZTb YedHmKXbl5rUVvU/s7Nxwv0C5pfSyIoVUcZy5vvrH0XybDw36ZZ0rjpGoqiQtkVScq2+Ued dpVXhNmtWGwS3vc8kb5uP5/xsxsqwJaz3BS1+htUOrKiAgE4a5eCwJ0VggTw== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=foxmail.com; arc=none; dmarc=pass header.from=foxmail.com policy.dmarc=none Authentication-Results: ffmpeg.org; dkim=pass header.d=foxmail.com; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=foxmail.com policy.dmarc=none Received: from out203-205-221-235.mail.qq.com (out203-205-221-235.mail.qq.com [203.205.221.235]) by ffbox0-bg.ffmpeg.org (Postfix) with UTF8SMTPS id 7BFCC6906B1 for ; Sun, 11 Jan 2026 18:32:39 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1768149155; bh=kN7i5Q8dBx1qWux7AF0h/QhwpceSfx+ut8ZExkECqKc=; h=From:To:Cc:Subject:Date; b=eXbWmIpXdJyNCCg53nd6OiGSgV1zZFBYz6JlIlYTGx/m4KxgJuBA/4tiJM/8iAR4P 4ZAWEVMK3gTTW57i061ojIcklwhY+vK6NB2ubg3SKcAIenohUG17I5cJueoGIf1Kdm 7ObztzfCB1aCCYv63lYRW1+xcIpBJXmB+I9W54OY= Received: from localhost.localdomain ([120.227.18.65]) by newxmesmtplogicsvrsza56-0.qq.com (NewEsmtp) with SMTP id 819B46FD; Mon, 12 Jan 2026 00:32:25 +0800 X-QQ-mid: xmsmtpt1768149145tjgv0ao3a Message-ID: X-QQ-XMAILINFO: NOcEdvLhLw8TQD+ErdTCJ/dKE1OYPv0kz/2NeREBU8nxPtufw9Bh4k9Pz1CPXT cyh0ZLjMOK0sZE63DfTgYxxX/2T+5yftpj28C8cxy5jrcXH6oC9+wSOBuyjjfMRuobEQQYGVxllh dcKo6p/fnmE8hHkuICmFIdPqLdxI00ew/RYO0jF3FScIW39Wd9xFdkbNpgUqAsBo6XqQSZ5hmsyX Jg0lKY9vJwRAE4Wz4rYuAOKMoB1i+lxkV6eF8SzrIMKDJXHG1pPwvMjrmVzlhvfTymiHx4xjRFk2 6Bg4XNZ1URYk+6NZtUUq9zfvJjnvKzMkSB+qIHLpWtHuICnqvrVVfN/kkzzWEELsHI/wIdX4PjTa Z3jj4EExLC8//HyU/edxCGrpAYi7EFkHKPbNktVLic+wM2NiEYrJpmmb9KIZqBZA8FLNiq77eWvv Day3tjcyZZ6aBgh82WHCEbBDUn3lNNvznjfoW2/eK0kQfsCZZkuSIZRHE+FMEUYKCv+bmKsB9VKN 6mXn+UkJztbooq46JtnQZZKrrHFI/RqTyCuF5SdQHeIdR6fCnnRmbY6KtCWmfNgP/49E37CY6XDd aArSzjHpX5uwLxG/+DrbxAqX0jX253Gdq11DoGtEI7FKL1+I8GlwWrohrRkkgm+QlaM1P4BvhHg3 plEthHSVXtKhT4EoiFjY7vKYLcGes+iKGMPE+XKZN2ic/N6mIpMQZf3fNGAyvREnY1EQ3KoiVau9 mEUVTcZQ3yFIPCCooMhUXdH0NhN45WYV3bnRur+rr303+MOM3Y+kW6z04cJvTQiMH+jq9EUXhaTT XxVCHsht5/XTeVYUbhJBsqpKJQpXp/22Gepwg5JAL/Qrh2BTd0XvVH0YKbqemf2XN20GDWN3FxnG xaGQLlngNfIwEgkU78vyj8CpZ0BKObTF5y6WSfSzF3QRLVD5lKMUECCJJVzVUpZUVViGEvzkgY9A L/AfUWzB/MLlrAbDjjZDbhVEYnCsrJ5135ElOrm7y82mUGJm2OtWPB01Elq81+FKX5WWyOp20bBH rSfrMSdYX4N8PL+1hADIy3LLxdPSkOuogB66kBWsaf86N0rk/16Cj9rAZkzXsdN/yqJjRiZagww3 574e5kc/qcaLW4jT4BXupJDlPMERSwwU+Cov8kDdltsjBBawhY5ac+Jn83LQ== X-QQ-XMRINFO: NI4Ajvh11aEjEMj13RCX7UuhPEoou2bs1g== To: ffmpeg-devel@ffmpeg.org Date: Mon, 12 Jan 2026 00:32:22 +0800 X-OQ-MSGID: <20260111163222.227943-1-hezuoqiang@foxmail.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 X-MailFrom: SRS0=t6Qj=7Q=foxmail.com=hezuoqiang@ffmpeg.org X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation Message-ID-Hash: CZPAHXXVBN52WOOYUMHYU34ASPYZH534 X-Message-ID-Hash: CZPAHXXVBN52WOOYUMHYU34ASPYZH534 X-Mailman-Approved-At: Tue, 13 Jan 2026 22:58:50 +0000 X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] libavformat/nal: add ARM NEON optimization for ff_nal_find_startcode List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: hezuoqiang--- via ffmpeg-devel Cc: Zuoqiang He Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: From: Zuoqiang He This adds an ARM NEON optimized implementation of the NAL startcode search function. Performance testing shows approximately 3.7-4x speedup on ARMv8-A platforms with NEON support. The optimization uses 64-byte NEON vector blocks to quickly scan for the 00 00 01 startcode pattern, falling back to the existing C code for smaller buffers or when NEON is not available. Performance improvement on ARMv8-A (Cortex-A76): ~3.7-4x faster Tested with FATE suite and custom H.264 streams. Signed-off-by: Zuoqiang He --- libavformat/aarch64/Makefile | 2 + libavformat/aarch64/nal.S | 172 +++++++++++++++++++++++++++++++++ libavformat/aarch64/nal_init.c | 42 ++++++++ libavformat/nal.c | 19 +++- 4 files changed, 233 insertions(+), 2 deletions(-) create mode 100644 libavformat/aarch64/Makefile create mode 100644 libavformat/aarch64/nal.S create mode 100644 libavformat/aarch64/nal_init.c diff --git a/libavformat/aarch64/Makefile b/libavformat/aarch64/Makefile new file mode 100644 index 0000000000..f1dc99de09 --- /dev/null +++ b/libavformat/aarch64/Makefile @@ -0,0 +1,2 @@ +OBJS += aarch64/nal_init.o +NEON-OBJS += aarch64/nal.o diff --git a/libavformat/aarch64/nal.S b/libavformat/aarch64/nal.S new file mode 100644 index 0000000000..6dc1570d39 --- /dev/null +++ b/libavformat/aarch64/nal.S @@ -0,0 +1,172 @@ +/* + * ARM NEON-optimized NAL startcode search + * Copyright (c) 2024 + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + + .arch armv8-a + .text + +function ff_nal_find_startcode_neon, export=1 + and x2, x0, #-4 // align to 4-byte boundary + sub x7, x1, #3 // end -= 3 + add x2, x2, #4 // align4 = aligned_p + 4 + mov x3, x0 // p = orig_p + cmp x0, x2 + ccmp x7, x0, #0, cc + bls 2f // skip alignment phase + + // Phase 1: align to 4-byte boundary +1: ldrb w0, [x3] + cbnz w0, 3f + ldrb w0, [x3, #1] + cbnz w0, 3f + ldrb w0, [x3, #2] + cmp w0, #1 + beq 22f // found 00 00 01 +3: add x3, x3, #1 + cmp x2, x3 + ccmp x7, x3, #0, hi + bhi 1b + +2: sub x0, x7, x3 // remaining = end - p + cmp x0, #63 + bgt 43f // enter NEON phase if >= 64 bytes + + // Phase 3: byte-by-byte check for remaining data +4: cmp x7, x3 + bls 8f +5: ldrb w0, [x3] + cbnz w0, 6f + ldrb w0, [x3, #1] + cbnz w0, 6f + ldrb w0, [x3, #2] + cmp w0, #1 + beq 22f +6: add x3, x3, #1 + cmp x7, x3 + bne 5b +8: add x0, x1, #3 // return orig_end + 3 + ret + + // Phase 2: NEON acceleration (64-byte blocks) +43: sub x8, x1, #66 // end64 = end - 66 + cmp x8, x3 + bls 4b + mov w6, #65279 // 0xFEFF + add x5, x3, #64 // chunk_end = p + 64 + movk w6, #0xfefe, lsl #16 // 0xFEFEFEFF + b 10f + +9: add x3, x3, #64 // p += 64 + add x5, x5, #64 // chunk_end += 64 + cmp x8, x3 + bls 4b + +10: // Load 64 bytes (4x16-byte vectors) + ldp q31, q30, [x3] // load first 32 bytes + ldp q29, q28, [x3, #32] // load next 32 bytes + prfm PLDL1KEEP, [x3, #192] // prefetch + + // Check for zero bytes (data == 0) + cmeq v31.16b, v31.16b, #0 // z0 + cmeq v30.16b, v30.16b, #0 // z1 + cmeq v29.16b, v29.16b, #0 // z2 + cmeq v28.16b, v28.16b, #0 // z3 + + // Check for 00 pattern (current byte is 0 AND next byte is 0) + ext v24.16b, v31.16b, v31.16b, #1 // zs0 + ext v27.16b, v30.16b, v30.16b, #1 // zs1 + ext v26.16b, v29.16b, v29.16b, #1 // zs2 + ext v25.16b, v28.16b, v28.16b, #1 // zs3 + + // pattern00 = zero & zero_shift + and v24.16b, v24.16b, v31.16b // p0 + and v27.16b, v27.16b, v30.16b // p1 + and v26.16b, v26.16b, v29.16b // p2 + and v25.16b, v25.16b, v28.16b // p3 + + // Check if any 00 pattern exists (fast ORR test) + orr v27.16b, v24.16b, v27.16b + orr v25.16b, v26.16b, v25.16b + orr v25.16b, v25.16b, v27.16b + dup d31, v25.d[1] + orr v31.8b, v31.8b, v25.8b + fmov x0, d31 + cbz x0, 9b // no 00 pattern, skip to next chunk + + // Detailed check of this 64-byte chunk + mov x0, x3 +11: ldr w2, [x0] + add w4, w2, w6 // x - 0x01010101 + bic w2, w4, w2 // (~x) & (x - 0x01010101) + tst w2, #-2139062144 // & 0x80808080 + beq 12f + + ldrb w2, [x0, #1] + cbnz w2, 13f + ldrb w4, [x0] + ldrb w2, [x0, #2] + cbnz w4, 14f + cmp w2, #1 + beq 18f // found 00 00 01 +14: ldrb w4, [x0, #3] + cbnz w2, 15f + cmp w4, #1 + beq 44f // found 00 00 01 (offset +1) + cbnz w4, 12f +16: ldrb w2, [x0, #4] + cmp w2, #1 + beq 45f // found 00 00 01 (offset +2) +17: cbnz w2, 12f + ldrb w2, [x0, #5] + cmp w2, #1 + beq 46f // found 00 00 01 (offset +3) + +12: add x0, x0, #4 + cmp x0, x5 + bne 11b + b 9b + +13: ldrb w2, [x0, #3] + cbnz w2, 12b + ldrb w2, [x0, #2] + cbz w2, 16b + ldrb w2, [x0, #4] + b 17b + +15: cbnz w4, 12b + ldrb w2, [x0, #4] + b 17b + +22: mov x0, x3 + ret + +45: add x0, x0, #2 + ret + +44: add x0, x0, #1 + ret + +46: add x0, x0, #3 + ret + +18: ret +endfunc diff --git a/libavformat/aarch64/nal_init.c b/libavformat/aarch64/nal_init.c new file mode 100644 index 0000000000..90160b882c --- /dev/null +++ b/libavformat/aarch64/nal_init.c @@ -0,0 +1,42 @@ +/* + * ARM NEON-optimized NAL functions + * Copyright (c) 2024 + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include + +#include "config.h" +#include "libavutil/attributes.h" +#include "libavutil/arm/cpu.h" +#include "libavutil/cpu.h" + +const uint8_t *ff_nal_find_startcode_neon(const uint8_t *p, const uint8_t *end); + +/* External function pointer from nal.c */ +extern const uint8_t *(*ff_nal_find_startcode_internal)(const uint8_t *p, const uint8_t *end); + +void ff_nal_init_arm(void); + +void ff_nal_init_arm(void) +{ + int cpu_flags = av_get_cpu_flags(); + + if (have_neon(cpu_flags)) + ff_nal_find_startcode_internal = ff_nal_find_startcode_neon; +} diff --git a/libavformat/nal.c b/libavformat/nal.c index 26dc5fe688..2e293c0225 100644 --- a/libavformat/nal.c +++ b/libavformat/nal.c @@ -21,14 +21,20 @@ #include #include +#include "libavutil/attributes.h" #include "libavutil/mem.h" #include "libavutil/error.h" #include "libavcodec/defs.h" #include "avio.h" #include "avio_internal.h" +#include "config.h" #include "nal.h" -static const uint8_t *nal_find_startcode_internal(const uint8_t *p, const uint8_t *end) +/* Pointer to the active implementation */ +const uint8_t *(*ff_nal_find_startcode_internal)(const uint8_t *p, const uint8_t *end); + +/* C implementation */ +static const uint8_t *ff_nal_find_startcode_c(const uint8_t *p, const uint8_t *end) { const uint8_t *a = p + 4 - ((intptr_t)p & 3); @@ -66,7 +72,16 @@ static const uint8_t *nal_find_startcode_internal(const uint8_t *p, const uint8_ } const uint8_t *ff_nal_find_startcode(const uint8_t *p, const uint8_t *end){ - const uint8_t *out = nal_find_startcode_internal(p, end); + static int initialized = 0; + if (!initialized) { + ff_nal_find_startcode_internal = ff_nal_find_startcode_c; +#if ARCH_AARCH64 + extern void ff_nal_init_arm(void); + ff_nal_init_arm(); +#endif + initialized = 1; + } + const uint8_t *out = ff_nal_find_startcode_internal(p, end); if(p