[FFmpeg-devel] [PR] libavutil/nal: optimize NAL startcode search with ARM NEON (PR #21475)

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PR] libavutil/nal: optimize NAL startcode search with ARM NEON (PR #21475)
@ 2026-01-15 17:19 hezuoqiang via ffmpeg-devel
  0 siblings, 0 replies; only message in thread
From: hezuoqiang via ffmpeg-devel @ 2026-01-15 17:19 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: hezuoqiang

PR #21475 opened by hezuoqiang
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/21475
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/21475.patch

This optimization targets ff_nal_find_startcode in libavutil/nal, which
directly returns pointers to complete startcode sequences (00 00 01) used
by the H.264 demuxer, distinct from ff_startcode_find_candidate which only
locates zero bytes and requires upper-layer validation.

Technical Approach:
- Uses NEON SIMD to detect "00" byte-pair patterns instead of single zeros
- Processes data in 64-byte vector blocks for maximum throughput
- Falls back to existing C implementation for small buffers or when NEON is unavailable

Performance Analysis (22.88 MB 1080p H.264 video):
- Single zero bytes in stream: 95,673 (98.1% false positive rate)
- Valid NALU startcodes: 1,224
- Using "00" pattern reduces false positives from 98.1% to 22.8%
- Only 22.8% of 64-byte blocks require detailed checking
- 77.2% of blocks can be skipped entirely after NEON fast-path

Benchmark Results (1000 iterations, Raspberry Pi 5 - Cortex-A76):
- Baseline (ff_startcode_find_candidate + C validation): 5,454,680 μs
- NEON optimized (ff_nal_find_startcode_neon): 1,741,280 μs
- Speedup: 3.13x


From c0ba98d30a1428ef12ef7dd9a4935ad6d3195703 Mon Sep 17 00:00:00 2001
From: Zuoqiang He <hezuoqiang@foxmail.com>
Date: Fri, 16 Jan 2026 00:31:58 +0800
Subject: [PATCH] libavutil/nal: optimize NAL startcode search with ARM NEON
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This optimization targets ff_nal_find_startcode in libavutil/nal, which
directly returns pointers to complete startcode sequences (00 00 01) used
by the H.264 demuxer, distinct from ff_startcode_find_candidate which only
locates zero bytes and requires upper-layer validation.

Technical Approach:
- Uses NEON SIMD to detect "00" byte-pair patterns instead of single zeros
- Processes data in 64-byte vector blocks for maximum throughput
- Falls back to existing C implementation for small buffers or when NEON is unavailable

Performance Analysis (22.88 MB 1080p H.264 video):
- Single zero bytes in stream: 95,673 (98.1% false positive rate)
- Valid NALU startcodes: 1,224
- Using "00" pattern reduces false positives from 98.1% to 22.8%
- Only 22.8% of 64-byte blocks require detailed checking
- 77.2% of blocks can be skipped entirely after NEON fast-path

Benchmark Results (1000 iterations, Raspberry Pi 5 - Cortex-A76):
- Baseline (ff_startcode_find_candidate + C validation): 5,454,680 μs
- NEON optimized (ff_nal_find_startcode_neon): 1,741,280 μs
- Speedup: 3.13x

Signed-off-by: Zuoqiang He <hezuoqiang@foxmail.com>
---
 libavformat/nal.c          |  45 +--------
 libavutil/Makefile         |   1 +
 libavutil/aarch64/Makefile |   1 +
 libavutil/aarch64/nal.S    | 169 +++++++++++++++++++++++++++++++
 libavutil/nal.c            | 100 +++++++++++++++++++
 libavutil/nal.h            |  60 +++++++++++
 tests/checkasm/Makefile    |   1 +
 tests/checkasm/checkasm.c  |   1 +
 tests/checkasm/checkasm.h  |   1 +
 tests/checkasm/nal.c       | 197 +++++++++++++++++++++++++++++++++++++
 tests/fate/checkasm.mak    |   1 +
 11 files changed, 536 insertions(+), 41 deletions(-)
 create mode 100644 libavutil/aarch64/nal.S
 create mode 100644 libavutil/nal.c
 create mode 100644 libavutil/nal.h
 create mode 100644 tests/checkasm/nal.c

diff --git a/libavformat/nal.c b/libavformat/nal.c
index 26dc5fe688..cabee37e0d 100644
--- a/libavformat/nal.c
+++ b/libavformat/nal.c
@@ -18,57 +18,20 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  */
 
-#include <stdint.h>
+#include "config.h"
 #include <string.h>
 
 #include "libavutil/mem.h"
 #include "libavutil/error.h"
+#include "libavutil/nal.h"
 #include "libavcodec/defs.h"
 #include "avio.h"
 #include "avio_internal.h"
 #include "nal.h"
 
-static const uint8_t *nal_find_startcode_internal(const uint8_t *p, const uint8_t *end)
+const uint8_t *ff_nal_find_startcode(const uint8_t *p, const uint8_t *end)
 {
-    const uint8_t *a = p + 4 - ((intptr_t)p & 3);
-
-    for (end -= 3; p < a && p < end; p++) {
-        if (p[0] == 0 && p[1] == 0 && p[2] == 1)
-            return p;
-    }
-
-    for (end -= 3; p < end; p += 4) {
-        uint32_t x = *(const uint32_t*)p;
-//      if ((x - 0x01000100) & (~x) & 0x80008000) // little endian
-//      if ((x - 0x00010001) & (~x) & 0x00800080) // big endian
-        if ((x - 0x01010101) & (~x) & 0x80808080) { // generic
-            if (p[1] == 0) {
-                if (p[0] == 0 && p[2] == 1)
-                    return p;
-                if (p[2] == 0 && p[3] == 1)
-                    return p+1;
-            }
-            if (p[3] == 0) {
-                if (p[2] == 0 && p[4] == 1)
-                    return p+2;
-                if (p[4] == 0 && p[5] == 1)
-                    return p+3;
-            }
-        }
-    }
-
-    for (end += 3; p < end; p++) {
-        if (p[0] == 0 && p[1] == 0 && p[2] == 1)
-            return p;
-    }
-
-    return end + 3;
-}
-
-const uint8_t *ff_nal_find_startcode(const uint8_t *p, const uint8_t *end){
-    const uint8_t *out = nal_find_startcode_internal(p, end);
-    if(p<out && out<end && !out[-1]) out--;
-    return out;
+    return av_nal_find_startcode(p, end);
 }
 
 static int nal_parse_units(AVIOContext *pb, NALUList *list,
diff --git a/libavutil/Makefile b/libavutil/Makefile
index c5241895ff..9d7c9d8820 100644
--- a/libavutil/Makefile
+++ b/libavutil/Makefile
@@ -164,6 +164,7 @@ OBJS = adler32.o                                                        \
        md5.o                                                            \
        mem.o                                                            \
        murmur3.o                                                        \
+       nal.o                                                            \
        opt.o                                                            \
        parseutils.o                                                     \
        pixdesc.o                                                        \
diff --git a/libavutil/aarch64/Makefile b/libavutil/aarch64/Makefile
index b70702902f..27d4b8bc65 100644
--- a/libavutil/aarch64/Makefile
+++ b/libavutil/aarch64/Makefile
@@ -6,6 +6,7 @@ ARMV8-OBJS += aarch64/crc.o
 
 NEON-OBJS += aarch64/float_dsp_neon.o                                 \
              aarch64/tx_float_neon.o                                  \
+             aarch64/nal.o                                            \
 
 SVE-OBJS += aarch64/cpu_sve.o                                         \
 
diff --git a/libavutil/aarch64/nal.S b/libavutil/aarch64/nal.S
new file mode 100644
index 0000000000..f6b3e4afcd
--- /dev/null
+++ b/libavutil/aarch64/nal.S
@@ -0,0 +1,169 @@
+/*
+ * ARM NEON-optimized NAL startcode search
+ * Copyright (c) 2024
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+function ff_nal_find_startcode_neon, export=1
+        and             x2, x0, #-4              // align to 4-byte boundary
+        sub             x7, x1, #3               // end -= 3
+        add             x2, x2, #4               // align4 = aligned_p + 4
+        mov             x3, x0                   // p = orig_p
+        cmp             x0, x2
+        ccmp            x7, x0, #0, cc
+        bls             2f                       // skip alignment phase
+
+        // Phase 1: align to 4-byte boundary
+1:      ldrb            w0, [x3]
+        cbnz            w0, 3f
+        ldrb            w0, [x3, #1]
+        cbnz            w0, 3f
+        ldrb            w0, [x3, #2]
+        cmp             w0, #1
+        beq             22f                      // found 00 00 01
+3:      add             x3, x3, #1
+        cmp             x2, x3
+        ccmp            x7, x3, #0, hi
+        bhi             1b
+
+2:      sub             x0, x7, x3               // remaining = end - p
+        cmp             x0, #63
+        bgt             43f                      // enter NEON phase if >= 64 bytes
+
+        // Phase 3: byte-by-byte check for remaining data
+4:      cmp             x7, x3
+        bls             8f
+5:      ldrb            w0, [x3]
+        cbnz            w0, 6f
+        ldrb            w0, [x3, #1]
+        cbnz            w0, 6f
+        ldrb            w0, [x3, #2]
+        cmp             w0, #1
+        beq             22f
+6:      add             x3, x3, #1
+        cmp             x7, x3
+        bne             5b
+8:      mov             x0, x1                   // return orig_end
+        ret
+
+        // Phase 2: NEON acceleration (64-byte blocks)
+43:     sub             x8, x1, #66              // end64 = end - 66
+        cmp             x8, x3
+        bls             4b
+        mov             w6, #65279               // 0xFEFF
+        add             x5, x3, #64              // chunk_end = p + 64
+        movk            w6, #0xfefe, lsl #16     // 0xFEFEFEFF
+        b               10f
+
+9:      add             x3, x3, #64              // p += 64
+        add             x5, x5, #64              // chunk_end += 64
+        cmp             x8, x3
+        bls             4b
+
+10:     // Load 64 bytes (4x16-byte vectors)
+        ldp             q31, q30, [x3]           // load first 32 bytes
+        ldp             q29, q28, [x3, #32]      // load next 32 bytes
+        prfm            PLDL1KEEP, [x3, #192]    // prefetch
+
+        // Check for zero bytes (data == 0)
+        cmeq            v31.16b, v31.16b, #0     // z0
+        cmeq            v30.16b, v30.16b, #0     // z1
+        cmeq            v29.16b, v29.16b, #0     // z2
+        cmeq            v28.16b, v28.16b, #0     // z3
+
+        // Check for 00 pattern (current byte is 0 AND next byte is 0)
+        ext             v24.16b, v31.16b, v31.16b, #1    // zs0
+        ext             v27.16b, v30.16b, v30.16b, #1    // zs1
+        ext             v26.16b, v29.16b, v29.16b, #1    // zs2
+        ext             v25.16b, v28.16b, v28.16b, #1    // zs3
+
+        // pattern00 = zero & zero_shift
+        and             v24.16b, v24.16b, v31.16b        // p0
+        and             v27.16b, v27.16b, v30.16b        // p1
+        and             v26.16b, v26.16b, v29.16b        // p2
+        and             v25.16b, v25.16b, v28.16b        // p3
+
+        // Check if any 00 pattern exists (fast ORR test)
+        orr             v27.16b, v24.16b, v27.16b
+        orr             v25.16b, v26.16b, v25.16b
+        orr             v25.16b, v25.16b, v27.16b
+        dup             d31, v25.d[1]
+        orr             v31.8b, v31.8b, v25.8b
+        fmov            x0, d31
+        cbz             x0, 9b                   // no 00 pattern, skip to next chunk
+
+        // Detailed check of this 64-byte chunk
+        mov             x0, x3
+11:     ldr             w2, [x0]
+        add             w4, w2, w6               // x - 0x01010101
+        bic             w2, w4, w2               // (~x) & (x - 0x01010101)
+        tst             w2, #-2139062144         // & 0x80808080
+        beq             12f
+
+        ldrb            w2, [x0, #1]
+        cbnz            w2, 13f
+        ldrb            w4, [x0]
+        ldrb            w2, [x0, #2]
+        cbnz            w4, 14f
+        cmp             w2, #1
+        beq             18f                      // found 00 00 01
+14:     ldrb            w4, [x0, #3]
+        cbnz            w2, 15f
+        cmp             w4, #1
+        beq             44f                      // found 00 00 01 (offset +1)
+        cbnz            w4, 12f
+16:     ldrb            w2, [x0, #4]
+        cmp             w2, #1
+        beq             45f                      // found 00 00 01 (offset +2)
+17:     cbnz            w2, 12f
+        ldrb            w2, [x0, #5]
+        cmp             w2, #1
+        beq             46f                      // found 00 00 01 (offset +3)
+
+12:     add             x0, x0, #4
+        cmp             x0, x5
+        bne             11b
+        b               9b
+
+13:     ldrb            w2, [x0, #3]
+        cbnz            w2, 12b
+        ldrb            w2, [x0, #2]
+        cbz             w2, 16b
+        ldrb            w2, [x0, #4]
+        b               17b
+
+15:     cbnz            w4, 12b
+        ldrb            w2, [x0, #4]
+        b               17b
+
+22:     mov             x0, x3
+        ret
+
+45:     add             x0, x0, #2
+        ret
+
+44:     add             x0, x0, #1
+        ret
+
+46:     add             x0, x0, #3
+        ret
+
+18:     ret
+endfunc
diff --git a/libavutil/nal.c b/libavutil/nal.c
new file mode 100644
index 0000000000..ec2d8dc46f
--- /dev/null
+++ b/libavutil/nal.c
@@ -0,0 +1,100 @@
+/*
+ * NAL utility functions
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <stdint.h>
+#include <string.h>
+
+#include "config.h"
+#include "attributes.h"
+#include "cpu.h"
+#include "thread.h"
+#if ARCH_AARCH64
+#include "aarch64/cpu.h"
+#endif
+#include "nal.h"
+
+const uint8_t *ff_nal_find_startcode_c(const uint8_t *p, const uint8_t *end)
+{
+    const uint8_t *a = p + 4 - ((intptr_t)p & 3);
+
+    for (end -= 3; p < a && p < end; p++) {
+        if (p[0] == 0 && p[1] == 0 && p[2] == 1)
+            return p;
+    }
+
+    for (end -= 3; p < end; p += 4) {
+        uint32_t x = *(const uint32_t*)p;
+//      if ((x - 0x01000100) & (~x) & 0x80008000) // little endian
+//      if ((x - 0x00010001) & (~x) & 0x00800080) // big endian
+        if ((x - 0x01010101) & (~x) & 0x80808080) { // generic
+            if (p[1] == 0) {
+                if (p[0] == 0 && p[2] == 1)
+                    return p;
+                if (p[2] == 0 && p[3] == 1)
+                    return p+1;
+            }
+            if (p[3] == 0) {
+                if (p[2] == 0 && p[4] == 1)
+                    return p+2;
+                if (p[4] == 0 && p[5] == 1)
+                    return p+3;
+            }
+        }
+    }
+
+    for (end += 3; p < end; p++) {
+        if (p[0] == 0 && p[1] == 0 && p[2] == 1)
+            return p;
+    }
+
+    return end + 3;
+}
+
+// Function pointer to the active implementation
+static const uint8_t *(*nal_find_startcode_func)(const uint8_t *p, const uint8_t *end);
+
+// Thread-safe initialization using ff_thread_once
+static AVOnce nal_func_init_once = AV_ONCE_INIT;
+
+static void nal_find_startcode_init(void)
+{
+#if ARCH_AARCH64
+    int cpu_flags = av_get_cpu_flags();
+    if (have_neon(cpu_flags))
+        nal_find_startcode_func = ff_nal_find_startcode_neon;
+    else
+        nal_find_startcode_func = ff_nal_find_startcode_c;
+#else
+    nal_find_startcode_func = ff_nal_find_startcode_c;
+#endif
+}
+
+const uint8_t *av_nal_find_startcode(const uint8_t *p, const uint8_t *end)
+{
+    // Initialize function pointer on first call (thread-safe)
+    ff_thread_once(&nal_func_init_once, nal_find_startcode_init);
+
+    // Call the optimized implementation
+    p = nal_find_startcode_func(p, end);
+
+    if (p < end && !p[-1])
+        p--;
+    return p;
+}
diff --git a/libavutil/nal.h b/libavutil/nal.h
new file mode 100644
index 0000000000..ccc3f7b70d
--- /dev/null
+++ b/libavutil/nal.h
@@ -0,0 +1,60 @@
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVUTIL_NAL_H
+#define AVUTIL_NAL_H
+
+#include <stdint.h>
+
+/**
+ * @file
+ * @ingroup lavu_misc
+ * NAL (Network Abstraction Layer) utility functions
+ */
+
+/**
+ * @addtogroup lavu_misc
+ * @{
+ */
+
+/**
+ * Find a H.264/H.265 NAL startcode (00 00 01 or 00 00 00 01) in a buffer.
+ *
+ * @param p    Pointer to start searching from
+ * @param end  Pointer to end of buffer (must have at least 3 bytes padding)
+ * @return     Pointer to startcode, or end+3 if not found
+ *
+ * @note This function searches for the pattern 00 00 01 (three-byte startcode)
+ *       or 00 00 00 01 (four-byte startcode). When found, it returns a pointer
+ *       to the startcode. If the byte before the startcode is also 0, it returns
+ *       that position instead (to handle the four-byte case).
+ *       If no startcode is found, returns end + 3.
+ */
+const uint8_t *av_nal_find_startcode(const uint8_t *p, const uint8_t *end);
+
+/* Internal implementations exposed for testing */
+const uint8_t *ff_nal_find_startcode_c(const uint8_t *p, const uint8_t *end);
+#if ARCH_AARCH64
+const uint8_t *ff_nal_find_startcode_neon(const uint8_t *p, const uint8_t *end);
+#endif
+
+/**
+ * @}
+ */
+
+#endif /* AVUTIL_NAL_H */
diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index 48f358d40d..54f034caae 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -95,6 +95,7 @@ AVUTILOBJS                              += crc.o
 AVUTILOBJS                              += fixed_dsp.o
 AVUTILOBJS                              += float_dsp.o
 AVUTILOBJS                              += lls.o
+AVUTILOBJS                              += nal.o
 
 CHECKASMOBJS-$(CONFIG_AVUTIL)  += $(AVUTILOBJS)
 
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 7dcdaeb2a4..835ae89905 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -348,6 +348,7 @@ static const struct {
         { "fixed_dsp", checkasm_check_fixed_dsp },
         { "float_dsp", checkasm_check_float_dsp },
         { "lls",       checkasm_check_lls },
+        { "nal",       checkasm_check_nal },
         { "av_tx",     checkasm_check_av_tx },
 #endif
     { NULL }
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index e3addec21e..711036873b 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -120,6 +120,7 @@ void checkasm_check_idet(void);
 void checkasm_check_jpeg2000dsp(void);
 void checkasm_check_llauddsp(void);
 void checkasm_check_lls(void);
+void checkasm_check_nal(void);
 void checkasm_check_llviddsp(void);
 void checkasm_check_llvidencdsp(void);
 void checkasm_check_lpc(void);
diff --git a/tests/checkasm/nal.c b/tests/checkasm/nal.c
new file mode 100644
index 0000000000..97ef2909cb
--- /dev/null
+++ b/tests/checkasm/nal.c
@@ -0,0 +1,197 @@
+/*
+ * Copyright (c) 2024
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <string.h>
+
+#include "libavutil/common.h"
+#include "libavcodec/defs.h"
+#include "libavutil/intreadwrite.h"
+#include "libavutil/mem_internal.h"
+#include "libavutil/cpu.h"
+#if ARCH_AARCH64
+#include "libavutil/aarch64/cpu.h"
+#endif
+
+#include "checkasm.h"
+
+#include "libavutil/nal.h"
+
+#define BUF_SIZE (8192 + AV_INPUT_BUFFER_PADDING_SIZE)
+
+void checkasm_check_nal(void)
+{
+    LOCAL_ALIGNED_8(uint8_t, buf, [BUF_SIZE]);
+    const uint8_t *ref_res, *new_res;
+
+#if ARCH_AARCH64
+    int cpu_flags = av_get_cpu_flags();
+    int have_neon_impl = have_neon(cpu_flags);
+
+    if (have_neon_impl) {
+        declare_func(const uint8_t *, const uint8_t *p, const uint8_t *end);
+
+        // Set C version as reference implementation
+        func_ref = ff_nal_find_startcode_c;
+
+        // Test 1: Startcode at beginning
+        memset(buf, 0xFF, BUF_SIZE);
+        AV_WN32A(buf, 0x01000000);
+        if (check_func(ff_nal_find_startcode_neon, "startcode_at_beginning")) {
+            ref_res = call_ref(buf, buf + 8192);
+            new_res = call_new(buf, buf + 8192);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 8192);
+        }
+
+        // Test 2: Startcode at offset 4 (three-byte)
+        memset(buf, 0xFF, BUF_SIZE);
+        AV_WN32A(buf + 4, 0x010000);
+        if (check_func(ff_nal_find_startcode_neon, "startcode_at_offset_4")) {
+            ref_res = call_ref(buf, buf + 8192);
+            new_res = call_new(buf, buf + 8192);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 8192);
+        }
+
+        // Test 3: Multiple startcodes, find first one
+        memset(buf, 0, BUF_SIZE);
+        AV_WN32A(buf + 100, 0x01000000);
+        AV_WN32A(buf + 500, 0x01000000);
+        AV_WN32A(buf + 1000, 0x01000000);
+        if (check_func(ff_nal_find_startcode_neon, "multiple_startcodes")) {
+            ref_res = call_ref(buf, buf + 8192);
+            new_res = call_new(buf, buf + 8192);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 8192);
+        }
+
+        // Test 4: No startcode (all 0xFF) - CRITICAL TEST
+        memset(buf, 0xFF, 256);
+        memset(buf + 256, 0, AV_INPUT_BUFFER_PADDING_SIZE);
+        if (check_func(ff_nal_find_startcode_neon, "no_startcode_0xFF")) {
+            ref_res = call_ref(buf, buf + 256);
+            new_res = call_new(buf, buf + 256);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 256);
+        }
+
+        // Test 5: No startcode (all zeros)
+        memset(buf, 0, 256);
+        memset(buf + 256, 0, AV_INPUT_BUFFER_PADDING_SIZE);
+        if (check_func(ff_nal_find_startcode_neon, "no_startcode_zeros")) {
+            ref_res = call_ref(buf, buf + 256);
+            new_res = call_new(buf, buf + 256);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 256);
+        }
+
+        // Test 6: Startcode near end
+        memset(buf, 0xFF, BUF_SIZE);
+        AV_WN32A(buf + 8188, 0x01000000);
+        if (check_func(ff_nal_find_startcode_neon, "startcode_near_end")) {
+            ref_res = call_ref(buf, buf + 8192);
+            new_res = call_new(buf, buf + 8192);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 8192);
+        }
+
+        // Test 7: Search from middle
+        memset(buf, 0, BUF_SIZE);
+        AV_WN32A(buf + 100, 0x01000000);
+        AV_WN32A(buf + 500, 0x01000000);
+        if (check_func(ff_nal_find_startcode_neon, "search_from_middle")) {
+            ref_res = call_ref(buf + 200, buf + 8192);
+            new_res = call_new(buf + 200, buf + 8192);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf + 200, buf + 8192);
+        }
+
+        // Test 8: Small buffer (16 bytes)
+        memset(buf, 0xFF, 16);
+        memset(buf + 16, 0, AV_INPUT_BUFFER_PADDING_SIZE);
+        if (check_func(ff_nal_find_startcode_neon, "small_buffer_16")) {
+            ref_res = call_ref(buf, buf + 16);
+            new_res = call_new(buf, buf + 16);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 16);
+        }
+
+        // Test 9: Very small buffer (4 bytes)
+        memset(buf, 0xFF, 4);
+        memset(buf + 4, 0, AV_INPUT_BUFFER_PADDING_SIZE);
+        if (check_func(ff_nal_find_startcode_neon, "tiny_buffer_4")) {
+            ref_res = call_ref(buf, buf + 4);
+            new_res = call_new(buf, buf + 4);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 4);
+        }
+
+        // Test 10: Three-byte startcode pattern
+        memset(buf, 0xFF, BUF_SIZE);
+        buf[50] = 0x00;
+        buf[51] = 0x00;
+        buf[52] = 0x01;
+        if (check_func(ff_nal_find_startcode_neon, "three_byte_startcode")) {
+            ref_res = call_ref(buf, buf + 8192);
+            new_res = call_new(buf, buf + 8192);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 8192);
+        }
+
+        // Test 11: Random data with startcode
+        for (int i = 0; i < 8192; i++) {
+            buf[i] = rnd() & 0xFF;
+        }
+        memset(buf + 8192, 0, AV_INPUT_BUFFER_PADDING_SIZE);
+        int pos = (rnd() & 0x1FFF) + 100;
+        AV_WN32A(buf + pos, 0x01000000);
+        if (check_func(ff_nal_find_startcode_neon, "random_with_startcode")) {
+            ref_res = call_ref(buf, buf + 8192);
+            new_res = call_new(buf, buf + 8192);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 8192);
+        }
+
+        // Test 12: Large buffer with no startcode
+        memset(buf, 0xAA, 8192);
+        memset(buf + 8192, 0, AV_INPUT_BUFFER_PADDING_SIZE);
+        if (check_func(ff_nal_find_startcode_neon, "large_no_startcode")) {
+            ref_res = call_ref(buf, buf + 8192);
+            new_res = call_new(buf, buf + 8192);
+            if (ref_res != new_res)
+                fail();
+            bench_new(buf, buf + 8192);
+        }
+    }
+#endif
+
+    report("nal");
+}
diff --git a/tests/fate/checkasm.mak b/tests/fate/checkasm.mak
index b08c1947cd..aac7e62ffd 100644
--- a/tests/fate/checkasm.mak
+++ b/tests/fate/checkasm.mak
@@ -36,6 +36,7 @@ FATE_CHECKASM = fate-checkasm-aacencdsp                                 \
                 fate-checkasm-jpeg2000dsp                               \
                 fate-checkasm-llauddsp                                  \
                 fate-checkasm-lls                                       \
+                fate-checkasm-nal                                       \
                 fate-checkasm-llviddsp                                  \
                 fate-checkasm-llvidencdsp                               \
                 fate-checkasm-lpc                                       \
-- 
2.52.0

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-01-15 17:20 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-15 17:19 [FFmpeg-devel] [PR] libavutil/nal: optimize NAL startcode search with ARM NEON (PR #21475) hezuoqiang via ffmpeg-devel

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git