[FFmpeg-devel] 回复：Re: [PATCH] libavformat/nal: add ARM NEON optimization forff_nal_find_startcode

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

From: hezuoqiang via ffmpeg-devel <ffmpeg-devel@ffmpeg.org>
To: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
Cc: "Rémi Denis-Courmont" <remi@remlab.net>,
	hezuoqiang <hezuoqiang@foxmail.com>
Subject: [FFmpeg-devel] 回复：Re: [PATCH] libavformat/nal: add ARM NEON optimization forff_nal_find_startcode
Date: Wed, 14 Jan 2026 01:20:13 +0800
Message-ID: <tencent_B233DB36441FC505F2576352F6D4C220AE09@qq.com> (raw)
In-Reply-To: <2154BAC0-A450-438A-AE4B-67DCD5024002@remlab.net>

Hi James,

Thank you for your review. I'd like to clarify the difference between the two approaches:

**Clarification:**

My patch optimizes `ff_nal_find_startcode` in libavformat/nal.c, which is different from the `ff_startcode_find_candidate` hook you mentioned under libavcodec/h264dsp.c.

- `ff_startcode_find_candidate`: Returns offset to first zero byte, requires upper layer validation
- `ff_nal_find_startcode`: Returns pointer to complete startcode (00 00 01), used by H.264 demuxer

**Test Environment:**
- Platform: Raspberry Pi 5 (ARM Cortex-A76, AArch64)
- Compiler: GCC 14.2.0 with -O3 -march=armv8-a
- Test file: 1080p H.264 video, 22.88 MB
- Total NALU startcodes found: 1,224

**Test Methodology:**

I compared two approaches:

**Method 1 (baseline):** Use `ff_startcode_find_candidate` + C validation (current FFmpeg approach)

```c
// Simplified pseudo-code
std::vector<size_t&gt; find_all_startcode_positions(const uint8_t* data, size_t size) {
 &nbsp; std::vector<size_t&gt; positions;
 &nbsp; size_t i = 0;

 &nbsp; while (i < size) {
 &nbsp; &nbsp; &nbsp; // Step 1: Fast search for zero byte
 &nbsp; &nbsp; &nbsp; int offset = ff_startcode_find_candidate(data + i, size - i);
 &nbsp; &nbsp; &nbsp; if (offset &gt;= size - i) break;
 &nbsp; &nbsp; &nbsp; i += offset;

 &nbsp; &nbsp; &nbsp; // Step 2: Validate if it's a complete startcode (00 00 01)
 &nbsp; &nbsp; &nbsp; if (i + 2 < size &amp;&amp; data[i] == 0 &amp;&amp; data[i+1] == 0) {
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (data[i+2] == 1) {
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; positions.push_back(i);
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i += 3;
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; continue;
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } else if (i + 3 < size &amp;&amp; data[i+2] == 0 &amp;&amp; data[i+3] == 1) {
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; positions.push_back(i);
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i += 4;
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; continue;
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }
 &nbsp; &nbsp; &nbsp; }
 &nbsp; &nbsp; &nbsp; i++;
 &nbsp; }
 &nbsp; return positions;
}
```
Method 2 (NEON optimized): Use ff_nal_find_startcode_neon directly
```cpp
std::vector<size_t&gt; find_all_startcode_positions_neon(const uint8_t* data, size_t size) {
 &nbsp; std::vector<size_t&gt; positions;
 &nbsp; const uint8_t* p = data;
 &nbsp; const uint8_t* end = data + size;

 &nbsp; while (p < end) {
 &nbsp; &nbsp; &nbsp; // Directly find complete startcode
 &nbsp; &nbsp; &nbsp; const uint8_t* start = ff_nal_find_startcode_neon(p, end);

 &nbsp; &nbsp; &nbsp; // Skip zero bytes before NALU header
 &nbsp; &nbsp; &nbsp; while (start < end &amp;&amp; *start == 0) start++;
 &nbsp; &nbsp; &nbsp; if (start &gt;= end) break;

 &nbsp; &nbsp; &nbsp; positions.push_back(start - data);
 &nbsp; &nbsp; &nbsp; p = start;
 &nbsp; }
 &nbsp; return positions;
}
```
Performance Results (1000 iterations):
- Method 1 (find zero + validate): 5,454,680 μs
- Method 2 (NEON direct search): &nbsp;1,741,280 μs
- Speedup: 3.13x

Why this optimization is effective:

The NEON version detects "00" pattern (two consecutive zeros) instead of single zeros:

Test file analysis (22.88 MB 1080p H.264):
- Single zero bytes: 95,673 (98.1% false positive rate)
- Valid startcodes: 1,224
- With "00" pattern: Only 22.8% of 64-byte blocks need detailed checking
- 77.2% of blocks can be skipped entirely

This optimization specifically improves H.264 demuxing performance on ARM platforms.

Should I modify the commit message to better clarify this distinction?

Best regards,
He Zuoqiang





         原始邮件
         
       
发件人：Rémi Denis-Courmont via ffmpeg-devel <ffmpeg-devel@ffmpeg.org&gt;
发件时间：2026年1月13日 18:26
收件人：hezuoqiang--- via ffmpeg-devel <ffmpeg-devel@ffmpeg.org&gt;
抄送：Zuoqiang He <hezuoqiang@foxmail.com&gt;, Rémi Denis-Courmont <remi@remlab.net&gt;
主题：[FFmpeg-devel] Re: [PATCH] libavformat/nal: add ARM NEON optimization forff_nal_find_startcode



       Nihao,

There&nbsp;already&nbsp;is&nbsp;a&nbsp;hook&nbsp;for&nbsp;this&nbsp;purpose&nbsp;under&nbsp;h264dsp,&nbsp;and&nbsp;it's&nbsp;already&nbsp;used&nbsp;on&nbsp;some&nbsp;other&nbsp;ISAs.&nbsp;So&nbsp;there&nbsp;should&nbsp;be&nbsp;no&nbsp;need&nbsp;to&nbsp;add&nbsp;a&nbsp;new&nbsp;one.

It's&nbsp;also&nbsp;probably&nbsp;faster&nbsp;to&nbsp;just&nbsp;look&nbsp;for&nbsp;a&nbsp;nul&nbsp;byte&nbsp;in&nbsp;assembler&nbsp;and&nbsp;let&nbsp;the&nbsp;C&nbsp;code&nbsp;manually&nbsp;check&nbsp;for&nbsp;the&nbsp;full&nbsp;32-bit&nbsp;start&nbsp;code.&nbsp;This&nbsp;is&nbsp;basically&nbsp;just&nbsp;`strnlen()`.

Br,
_______________________________________________
ffmpeg-devel&nbsp;mailing&nbsp;list&nbsp;--&nbsp;ffmpeg-devel@ffmpeg.org
To&nbsp;unsubscribe&nbsp;send&nbsp;an&nbsp;email&nbsp;to&nbsp;ffmpeg-devel-leave@ffmpeg.org
_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

     prev parent reply	other threads:[~2026-01-13 17:21 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-13  2:03 [FFmpeg-devel] [PATCH] libavformat/nal: add ARM NEON optimization for ff_nal_find_startcode hezuoqiang--- via ffmpeg-devel
2026-01-13  2:48 ` [FFmpeg-devel] " Zhao Zhili via ffmpeg-devel
2026-01-13 10:26 ` Rémi Denis-Courmont via ffmpeg-devel
2026-01-13 17:20   ` hezuoqiang via ffmpeg-devel [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=tencent_B233DB36441FC505F2576352F6D4C220AE09@qq.com \
    --to=ffmpeg-devel@ffmpeg.org \
    --cc=hezuoqiang@foxmail.com \
    --cc=remi@remlab.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git