From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 57EDD4E188 for ; Tue, 13 Jan 2026 23:07:20 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'6GILsqxlPWK1A32XDwX0kwGGaxDhG1f2doKEkyIMpJE=', expected b'tC8DAyR1jO6zENP3hpW4sHDoxu3Jld75YT5n4up5o5w=')) header.d=foxmail.com header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1768345169; h=message-id : to : date : mime-version : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=P3HvEwg72sCJou8WkLjKnY/JBGL1hsIWEjAknweQXMI=; b=3LZXKJddkYG1xmrz0Lpsg/pzbkitV45cK2OwG1gLYiBTR4RXQeFVosPR88PnFz1ZUxQpo TYHCvQLtgQIeHdDjTrBVPk33sk2W6lMZk3S4/4X7+ZbtLIFpn+6jK5pXZN8sw6ZVa68naiA Yl8kf3lkmvCA1qCvTKr6bw0l8e0WmEIeQWn2YAy7IDkShYH+C+jJWO6OYvY8+/HWW7McXJo Fjbfw8pIxcyAKOzhlXqEKdMDxejYDPQz3FNA2Z4RjCS1ud+y1KSd1vhBT5wrtiGjRttz3SH +EDwyjBDXUk1QVM8wjFz8UmygWHr8PxWblYc8R9Uwfa3cMrqmX9C5/Ho4TvA== Received: from [172.20.0.4] (unknown [172.20.0.4]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 30094690E89; Wed, 14 Jan 2026 00:59:29 +0200 (EET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1768345145; b=lY4PiWz27fJbmmi6OR2hWIZAQ8v+AtCPdyj8k1WMddEhobWcHbpk4jOCpiGQMcOZ7vK9p 8AT+U+UnABpxpvAkP5i8awivyBoheU6i0cueadGeTFoyS0pZ5/YoSwDA4Cj63+eJOQvpL32 Cbi4MqVqjQiLv9kwsyd0mNxQyUdHhJT/m65oc7TB4CGto3brGWrkL8VpZsSZzjYvDTPJzUF YRvtTU4CkDOPK83/vtIVuCYWZ613KcQuMuwmVk4dfwigftgA1w84ASnNo4QNpJlFMvu6n1V Zg8xRLlwhg9NGdIsZtdW0rNzkXxxz+LNcrwpvDnJBbhvjtOf4JrS8X4rKTIQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1768345145; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=6GILsqxlPWK1A32XDwX0kwGGaxDhG1f2doKEkyIMpJE=; b=dM0YTQv6qScLpAD1txL0DdnS7+P5VOiPE7GecMKEoblNgoLmgXYr7LsqgQHdPgKXL9hVI vWrulnZHR8Dv95s4D5wy8neM+U7eelfsG4CEuKVD1Sk+DBsHRLllOw9IZ9EdsRd1uE680J1 Hek2//HAIw9QLu96XzaTlakCLkpgp1Lv/Wj7CozHrVUDlpvrYR5lhnTgDzFsnsa0Nwdchak Yn3W+2Q2wo/Ib+KYR69KyXBtYB6nAkBssGUuQ0PM//ypuUeLd/4JvmOvTKuDbuZjorRroPy FwkSqvOHhLnNEb5U89uqm+t+u6eh4UJl6wypGTerrlXEW0RX0fBeK+Tol50w== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=foxmail.com; arc=none; dmarc=pass header.from=foxmail.com policy.dmarc=none Authentication-Results: ffmpeg.org; dkim=pass header.d=foxmail.com; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=foxmail.com policy.dmarc=none Received: from out162-62-57-137.mail.qq.com (out162-62-57-137.mail.qq.com [162.62.57.137]) by ffbox0-bg.ffmpeg.org (Postfix) with UTF8SMTPS id 6C48F690D5F for ; Sun, 11 Jan 2026 17:37:30 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1768145847; bh=tC8DAyR1jO6zENP3hpW4sHDoxu3Jld75YT5n4up5o5w=; h=From:To:Cc:Subject:Date; b=nq98b7SC56VagVKTpRyyRqe5nZUt5zXy1I6Ct4y2D14YyxkFdk73pVAOWtD8R0FVV f46+LblkvM0pVtmEqc8E6NS4HspT+PxMkvCzpsqZxAiqrWjMb7tOJwqfs3ntFWieJK zRV6jQN4FWQJh3yBAS0e3jKk3x8GSLGeOdlkGr8M= Received: from localhost.localdomain ([120.227.18.65]) by newxmesmtplogicsvrsza53-0.qq.com (NewEsmtp) with SMTP id 9582486F; Sun, 11 Jan 2026 23:37:24 +0800 X-QQ-mid: xmsmtpt1768145844tr94nsec8 Message-ID: X-QQ-XMAILINFO: N6OC4kIKwXU1GfrVYblOzDgt2TXuUVJKU1uyVxYqFJEVgmm3gMY4XetpFznEb3 igVD2SaEaaopP0uOEmc42u4pX3Lcn4nlYl309F/fiZwBWirIc1e917GznbGGrUGZyEqtHJSrnGGK UBfrrq/toPqQiRSOjh8f+AVmA9HnpKhyrPHc1UJ4oxtuYkye1i0h0YYv0gRfFSYosHkUIU4qfLH9 21Qw5aM998P102dCEDXqeyqctklqBaoDrh8ukXkP4QkNT2RNT8VZzjKPmDRnXNxXFjmSicnIS6uC 81zMmCIOSTEEg0nXpKOSHZPk+C3HLchMEkoX6phyuAqkJyhn0ksyq9UrzDpMMxDdNwMfHGYKtrTN i1vfugUjEdRpDmTQ5jzzByNQuH47GKec/9iidZ/XehxT08uGBWBfK2H2+LOuYEozZV+pNvSJEbeX slKWqw2MVGDlpuYPOmrBQQeqGuJU0rZQWGuWIQpLRYc0EFtIoRhK/AivE4Wq3Uoo1gbdzCzxLaSr 8Dul1DipeMB2B3lxTmGH8SJZFYgiajFgS4FA4PYh0PEL/Nl9ePo8PvFxe5383Eo5bme7OWrQ6Z9o BkiyveXloZkFvBOxHsumHwW0ip/urKv86GOogJyXqJSeMsuXn+oq4KwGUpTOpI5nesftF7YemQ0Z NlbVLbwuhyE1vrKNIAa6yfq047YIgCMJeAqHXPnNYPuyrnJcOmMPp354hD+tzvgVp2pDO/JiRZRp twyXR/muHjFD0sso/qUBkh15ig0Lt+P45zQMgMqKM8ZRSUmNHwBL4kHRDyIrVZOIEIBHhYO0v56P pj8ead7jZsEvYcv0GT9O+GS+FZyCGuX89yxSHj02GC9E8Pi8qUb2w4WKo8XqaAO2p3ciRTZuyyRh y2Ji4HPj5dde3G7Kde4k97gyUFSYoQIxeT5GtgZk47Ac6UT3mT6cvARXUuaAA9ETh1QiLpqjdXww jANe+VyO+jf/gM3n5ixxT1mTAqyQdwMvHEYwWmwdFpl4ureQ77DGgWvKv9ucbFNX1/VJMNCJsDDK TJ0/7F0gYOOSAM8U6/Yjn2V1bkp0hSQBGVcZDNI5SlINcDz3kPRdT3ghxZYCd+t3MABCOtQ7xWXN YJeDFBsZWWdK6HSxSbr12HaIqNHULM7TROIY5+0d3pDOt9lXbK4SlgO3VsSEFG1+age8zP X-QQ-XMRINFO: NI4Ajvh11aEjEMj13RCX7UuhPEoou2bs1g== To: ffmpeg-devel@ffmpeg.org Date: Sun, 11 Jan 2026 23:37:14 +0800 X-OQ-MSGID: <20260111153714.221761-1-hezuoqiang@foxmail.com> X-Mailer: git-send-email 2.47.3 MIME-Version: 1.0 X-MailFrom: SRS0=t6Qj=7Q=foxmail.com=hezuoqiang@ffmpeg.org X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation Message-ID-Hash: 6ZQTRPZXYESUMNYP4PXYHVDU6DV4MK6Q X-Message-ID-Hash: 6ZQTRPZXYESUMNYP4PXYHVDU6DV4MK6Q X-Mailman-Approved-At: Tue, 13 Jan 2026 22:58:49 +0000 X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] libavformat/nal: add ARM NEON optimization for ff_nal_find_startcode List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: hezuoqiang--- via ffmpeg-devel Cc: Zuoqiang He Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: From: Zuoqiang He This adds an ARM NEON optimized implementation of the NAL startcode search function. Performance testing shows approximately 3.7-4x speedup on ARMv8-A platforms with NEON support. The optimization uses 64-byte NEON vector blocks to quickly scan for the 00 00 01 startcode pattern, falling back to the existing C code for smaller buffers or when NEON is not available. Performance improvement on ARMv8-A (Cortex-A76): ~3.7-4x faster Tested with FATE suite and custom H.264 streams. Signed-off-by: Zuoqiang He --- libavformat/aarch64/Makefile | 2 + libavformat/aarch64/nal.S | 170 +++++++++++++++++++++++++++++++++ libavformat/aarch64/nal_init.c | 42 ++++++++ libavformat/nal.c | 19 +++- 4 files changed, 231 insertions(+), 2 deletions(-) create mode 100644 libavformat/aarch64/Makefile create mode 100644 libavformat/aarch64/nal.S create mode 100644 libavformat/aarch64/nal_init.c diff --git a/libavformat/aarch64/Makefile b/libavformat/aarch64/Makefile new file mode 100644 index 0000000000..f1dc99de09 --- /dev/null +++ b/libavformat/aarch64/Makefile @@ -0,0 +1,2 @@ +OBJS += aarch64/nal_init.o +NEON-OBJS += aarch64/nal.o diff --git a/libavformat/aarch64/nal.S b/libavformat/aarch64/nal.S new file mode 100644 index 0000000000..2558894743 --- /dev/null +++ b/libavformat/aarch64/nal.S @@ -0,0 +1,170 @@ +/* + * ARM NEON-optimized NAL startcode search + * Copyright (c) 2024 + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + + .arch armv8-a + .text + +function ff_nal_find_startcode_neon, export=1 + and x2, x0, #-4 // align to 4-byte boundary + sub x7, x1, #3 // end -= 3 + add x2, x2, #4 // align4 = aligned_p + 4 + mov x3, x0 // p = orig_p + cmp x0, x2 + ccmp x7, x0, #0, cc + bls 2f // skip alignment phase + + // Phase 1: align to 4-byte boundary +1: ldrb w0, [x3] + cbnz w0, 3f + ldrb w0, [x3, #1] + cbnz w0, 3f + ldrb w0, [x3, #2] + cmp w0, #1 + beq 9f // found 00 00 01 +3: add x3, x3, #1 + cmp x2, x3 + ccmp x7, x3, #0, hi + bhi 1b + +2: sub x0, x7, x3 // remaining = end - p + cmp x0, #63 + bgt 4f // enter NEON phase if >= 64 bytes + + // Phase 3: byte-by-byte check for remaining data +5: cmp x7, x3 + bls 8f +6: ldrb w0, [x3] + cbnz w0, 7f + ldrb w0, [x3, #1] + cbnz w0, 7f + ldrb w0, [x3, #2] + cmp w0, #1 + beq 9f +7: add x3, x3, #1 + cmp x7, x3 + bne 6b +8: add x0, x1, #3 // return orig_end + 3 + ret + + // Phase 2: NEON acceleration (64-byte blocks) +4: sub x8, x1, #66 // end64 = end - 66 + cmp x8, x3 + bls 5b + mov w6, #65279 // 0xFEFF + add x5, x3, #64 // chunk_end = p + 64 + movk w6, #0xfefe, lsl #16 // 0xFEFEFEFF + b 1f + +10: add x3, x3, #64 // p += 64 + add x5, x5, #64 // chunk_end += 64 + cmp x8, x3 + bls 5b + +1: // Load 64 bytes (4x16-byte vectors) + ldp q31, q30, [x3] // load first 32 bytes + ldp q29, q28, [x3, #32] // load next 32 bytes + prfm PLDL1KEEP, [x3, #192] // prefetch + + // Check for zero bytes (data == 0) + cmeq v31.16b, v31.16b, #0 // z0 + cmeq v30.16b, v30.16b, #0 // z1 + cmeq v29.16b, v29.16b, #0 // z2 + cmeq v28.16b, v28.16b, #0 // z3 + + // Check for 00 pattern (current byte is 0 AND next byte is 0) + ext v24.16b, v31.16b, v31.16b, #1 // zs0 + ext v27.16b, v30.16b, v30.16b, #1 // zs1 + ext v26.16b, v29.16b, v29.16b, #1 // zs2 + ext v25.16b, v28.16b, v28.16b, #1 // zs3 + + // pattern00 = zero & zero_shift + and v24.16b, v24.16b, v31.16b // p0 + and v27.16b, v27.16b, v30.16b // p1 + and v26.16b, v26.16b, v29.16b // p2 + and v25.16b, v25.16b, v28.16b // p3 + + // Check if any 00 pattern exists (fast ORR test) + orr v27.16b, v24.16b, v27.16b + orr v25.16b, v26.16b, v25.16b + orr v25.16b, v25.16b, v27.16b + dup d31, v25.d[1] + orr v31.8b, v31.8b, v25.8b + fmov x0, d31 + cbz x0, 10b // no 00 pattern, skip + + // Detailed check of this 64-byte chunk + mov x0, x3 +2: ldr w2, [x0] + add w4, w2, w6 // x - 0x01010101 + bic w2, w4, w2 // (~x) & (x - 0x01010101) + tst w2, #-2139062144 // & 0x80808080 + beq 3f + + ldrb w2, [x0, #1] + cbnz w2, 4f + ldrb w4, [x0] + ldrb w2, [x0, #2] + cbnz w4, 5f + cmp w2, #1 + beq 9f // found 00 00 01 +5: ldrb w4, [x0, #3] + cbnz w2, 6f + cmp w4, #1 + beq 11f // found 00 00 01 (offset +1) + cbnz w4, 3f +7: ldrb w2, [x0, #4] + cmp w2, #1 + beq 12f // found 00 00 01 (offset +2) +8: cbnz w2, 3f + ldrb w2, [x0, #5] + cmp w2, #1 + beq 13f // found 00 00 01 (offset +3) + +3: add x0, x0, #4 + cmp x0, x5 + bne 2b + b 10b + +4: ldrb w2, [x0, #3] + cbnz w2, 3b + ldrb w2, [x0, #2] + cbz w2, 7b + ldrb w2, [x0, #4] + b 8b + +6: cbnz w4, 3b + ldrb w2, [x0, #4] + b 8b + +9: mov x0, x3 + ret + +11: add x0, x0, #1 + ret + +12: add x0, x0, #2 + ret + +13: add x0, x0, #3 + ret +endfunc diff --git a/libavformat/aarch64/nal_init.c b/libavformat/aarch64/nal_init.c new file mode 100644 index 0000000000..90160b882c --- /dev/null +++ b/libavformat/aarch64/nal_init.c @@ -0,0 +1,42 @@ +/* + * ARM NEON-optimized NAL functions + * Copyright (c) 2024 + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include + +#include "config.h" +#include "libavutil/attributes.h" +#include "libavutil/arm/cpu.h" +#include "libavutil/cpu.h" + +const uint8_t *ff_nal_find_startcode_neon(const uint8_t *p, const uint8_t *end); + +/* External function pointer from nal.c */ +extern const uint8_t *(*ff_nal_find_startcode_internal)(const uint8_t *p, const uint8_t *end); + +void ff_nal_init_arm(void); + +void ff_nal_init_arm(void) +{ + int cpu_flags = av_get_cpu_flags(); + + if (have_neon(cpu_flags)) + ff_nal_find_startcode_internal = ff_nal_find_startcode_neon; +} diff --git a/libavformat/nal.c b/libavformat/nal.c index 26dc5fe688..2e293c0225 100644 --- a/libavformat/nal.c +++ b/libavformat/nal.c @@ -21,14 +21,20 @@ #include #include +#include "libavutil/attributes.h" #include "libavutil/mem.h" #include "libavutil/error.h" #include "libavcodec/defs.h" #include "avio.h" #include "avio_internal.h" +#include "config.h" #include "nal.h" -static const uint8_t *nal_find_startcode_internal(const uint8_t *p, const uint8_t *end) +/* Pointer to the active implementation */ +const uint8_t *(*ff_nal_find_startcode_internal)(const uint8_t *p, const uint8_t *end); + +/* C implementation */ +static const uint8_t *ff_nal_find_startcode_c(const uint8_t *p, const uint8_t *end) { const uint8_t *a = p + 4 - ((intptr_t)p & 3); @@ -66,7 +72,16 @@ static const uint8_t *nal_find_startcode_internal(const uint8_t *p, const uint8_ } const uint8_t *ff_nal_find_startcode(const uint8_t *p, const uint8_t *end){ - const uint8_t *out = nal_find_startcode_internal(p, end); + static int initialized = 0; + if (!initialized) { + ff_nal_find_startcode_internal = ff_nal_find_startcode_c; +#if ARCH_AARCH64 + extern void ff_nal_init_arm(void); + ff_nal_init_arm(); +#endif + initialized = 1; + } + const uint8_t *out = ff_nal_find_startcode_internal(p, end); if(p