Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PR] hevc intra pred neon optimizations (PR #21757)
@ 2026-02-14 12:18 Jun Zhao via ffmpeg-devel
  0 siblings, 0 replies; only message in thread
From: Jun Zhao via ffmpeg-devel @ 2026-02-14 12:18 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Jun Zhao

PR #21757 opened by Jun Zhao (mypopydev)
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/21757
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/21757.patch

 This series adds AArch64 NEON optimizations for all HEVC 8-bit intra prediction modes (DC, Planar, and 33 angular modes)
   across 4x4 to 32x32 block sizes, with checkasm tests.


From ffd88a9f7799e6f9e4f4f41c57a38c78a1d66e66 Mon Sep 17 00:00:00 2001
From: Jun Zhao <barryjzhao@tencent.com>
Date: Tue, 27 Jan 2026 18:24:08 +0800
Subject: [PATCH 1/8] lavc/hevc: add aarch64 NEON for DC and Planar prediction

Add NEON-optimized implementations for HEVC intra prediction DC and
Planar modes at 8-bit depth, supporting all block sizes (4x4 to 32x32).

DC prediction:
- Computes average of top and left reference samples using uaddlv
- Vectorized edge smoothing for luma blocks (ushll/add/ushr/xtn)
- Separate luma/chroma code paths to skip smoothing for chroma

Planar prediction:
- Implements bilinear interpolation using weighted reference samples
- Uses precomputed weight tables for each block size
- 32x32: fully unrolled with incremental base update and NEON-domain
  left[y] broadcasts, eliminating GP-to-NEON transfers

Also adds the initialization framework and checkasm test infrastructure.

Speedup over C on Apple M4 (checkasm --bench, 10-run average):

  DC prediction:
    4x4: 2.16x    8x8: 3.22x    16x16: 3.50x    32x32: 2.90x

  Planar prediction:
    4x4: 1.42x    8x8: 3.39x    16x16: 3.75x    32x32: 2.84x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
---
 libavcodec/aarch64/Makefile                |   2 +
 libavcodec/aarch64/hevcpred_init_aarch64.c |  88 +++
 libavcodec/aarch64/hevcpred_neon.S         | 757 +++++++++++++++++++++
 libavcodec/hevc/pred.c                     |   3 +
 libavcodec/hevc/pred.h                     |   1 +
 tests/checkasm/Makefile                    |   3 +-
 tests/checkasm/checkasm.c                  |   1 +
 tests/checkasm/checkasm.h                  |   1 +
 tests/checkasm/hevc_pred.c                 | 227 ++++++
 tests/fate/checkasm.mak                    |   1 +
 10 files changed, 1083 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/aarch64/hevcpred_init_aarch64.c
 create mode 100644 libavcodec/aarch64/hevcpred_neon.S
 create mode 100644 tests/checkasm/hevc_pred.c

diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index e3abdbfd72..000bab4e1e 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -76,6 +76,8 @@ NEON-OBJS-$(CONFIG_HEVC_DECODER)        += aarch64/hevcdsp_deblock_neon.o      \
                                            aarch64/hevcdsp_dequant_neon.o      \
                                            aarch64/hevcdsp_idct_neon.o         \
                                            aarch64/hevcdsp_init_aarch64.o      \
+                                           aarch64/hevcpred_neon.o             \
+                                           aarch64/hevcpred_init_aarch64.o     \
                                            aarch64/h26x/epel_neon.o            \
                                            aarch64/h26x/qpel_neon.o            \
                                            aarch64/h26x/sao_neon.o
diff --git a/libavcodec/aarch64/hevcpred_init_aarch64.c b/libavcodec/aarch64/hevcpred_init_aarch64.c
new file mode 100644
index 0000000000..0d5517aa9b
--- /dev/null
+++ b/libavcodec/aarch64/hevcpred_init_aarch64.c
@@ -0,0 +1,88 @@
+/*
+ * HEVC Intra Prediction NEON initialization
+ *
+ * Copyright (c) 2026 FFmpeg Project
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/attributes.h"
+#include "libavutil/avassert.h"
+#include "libavutil/aarch64/cpu.h"
+#include "libavcodec/hevc/pred.h"
+
+// DC prediction
+void ff_hevc_pred_dc_4x4_8_neon(uint8_t *src, const uint8_t *top,
+                               const uint8_t *left, ptrdiff_t stride,
+                               int c_idx);
+void ff_hevc_pred_dc_8x8_8_neon(uint8_t *src, const uint8_t *top,
+                               const uint8_t *left, ptrdiff_t stride,
+                               int c_idx);
+void ff_hevc_pred_dc_16x16_8_neon(uint8_t *src, const uint8_t *top,
+                                const uint8_t *left, ptrdiff_t stride,
+                                int c_idx);
+void ff_hevc_pred_dc_32x32_8_neon(uint8_t *src, const uint8_t *top,
+                                const uint8_t *left, ptrdiff_t stride,
+                                int c_idx);
+
+// Planar prediction
+void ff_hevc_pred_planar_4x4_8_neon(uint8_t *src, const uint8_t *top,
+                                   const uint8_t *left, ptrdiff_t stride);
+void ff_hevc_pred_planar_8x8_8_neon(uint8_t *src, const uint8_t *top,
+                                   const uint8_t *left, ptrdiff_t stride);
+void ff_hevc_pred_planar_16x16_8_neon(uint8_t *src, const uint8_t *top,
+                                    const uint8_t *left, ptrdiff_t stride);
+void ff_hevc_pred_planar_32x32_8_neon(uint8_t *src, const uint8_t *top,
+                                    const uint8_t *left, ptrdiff_t stride);
+
+static void pred_dc_neon(uint8_t *src, const uint8_t *top,
+                         const uint8_t *left, ptrdiff_t stride,
+                         int log2_size, int c_idx)
+{
+    switch (log2_size) {
+    case 2:
+        ff_hevc_pred_dc_4x4_8_neon(src, top, left, stride, c_idx);
+        break;
+    case 3:
+        ff_hevc_pred_dc_8x8_8_neon(src, top, left, stride, c_idx);
+        break;
+    case 4:
+        ff_hevc_pred_dc_16x16_8_neon(src, top, left, stride, c_idx);
+        break;
+    case 5:
+        ff_hevc_pred_dc_32x32_8_neon(src, top, left, stride, c_idx);
+        break;
+    default:
+        av_unreachable("log2_size must be 2, 3, 4 or 5");
+    }
+}
+
+av_cold void ff_hevc_pred_init_aarch64(HEVCPredContext *hpc, int bit_depth)
+{
+    int cpu_flags = av_get_cpu_flags();
+
+    if (!have_neon(cpu_flags))
+        return;
+
+    if (bit_depth == 8) {
+        hpc->pred_dc        = pred_dc_neon;
+        hpc->pred_planar[0] = ff_hevc_pred_planar_4x4_8_neon;
+        hpc->pred_planar[1] = ff_hevc_pred_planar_8x8_8_neon;
+        hpc->pred_planar[2] = ff_hevc_pred_planar_16x16_8_neon;
+        hpc->pred_planar[3] = ff_hevc_pred_planar_32x32_8_neon;
+    }
+}
diff --git a/libavcodec/aarch64/hevcpred_neon.S b/libavcodec/aarch64/hevcpred_neon.S
new file mode 100644
index 0000000000..77ddd69dbc
--- /dev/null
+++ b/libavcodec/aarch64/hevcpred_neon.S
@@ -0,0 +1,757 @@
+/*
+ * HEVC Intra Prediction NEON optimizations
+ *
+ * Copyright (c) 2026 FFmpeg Project
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/aarch64/asm.S"
+
+/* HEVC Intra Prediction functions
+ *
+ * Function signatures (different from H264):
+ * pred_dc:      void (uint8_t *src, const uint8_t *top, const uint8_t *left,
+ *                     ptrdiff_t stride, int log2_size, int c_idx)
+ * pred_planar:  void (uint8_t *src, const uint8_t *top, const uint8_t *left,
+ *                     ptrdiff_t stride)
+ * pred_angular: void (uint8_t *src, const uint8_t *top, const uint8_t *left,
+ *                     ptrdiff_t stride, int c_idx, int mode)
+ */
+
+// =============================================================================
+// DC Prediction
+// =============================================================================
+
+/*
+ * DC prediction algorithm:
+ * 1. dc = sum(top[0..size-1]) + sum(left[0..size-1]) + size
+ * 2. dc >>= (log2_size + 1)
+ * 3. Fill block with dc value
+ * 4. If c_idx == 0 && size < 32: smooth edges
+ *    - POS(0,0) = (left[0] + 2*dc + top[0] + 2) >> 2
+ *    - First row: (top[x] + 3*dc + 2) >> 2
+ *    - First col: (left[y] + 3*dc + 2) >> 2
+*/
+
+// -----------------------------------------------------------------------------
+// pred_dc_4x4_8: DC prediction
+// Arguments:
+// x0: src
+// x1: top
+// x2: left
+// x3: stride
+// w4: c_idx
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_dc_4x4_8_neon, export=1
+        // Load top[0..3] and left[0..3]
+        ldr             s0, [x1]                // top[0..3]
+        ldr             s1, [x2]                // left[0..3]
+
+        // Sum using NEON
+        uaddlv          h2, v0.8b               // sum top (only 4 valid bytes)
+        uaddlv          h3, v1.8b               // sum left (only 4 valid bytes)
+        add             v2.4h, v2.4h, v3.4h     // total sum
+
+        // Add rounding (4) and shift by 3
+        movi            v3.4h, #4
+        add             v2.4h, v2.4h, v3.4h
+        ushr            v2.4h, v2.4h, #3        // >> 3
+        dup             v2.8b, v2.b[0]          // broadcast dc
+
+        // Store 4 rows
+        str             s2, [x0]
+        str             s2, [x0, x3]
+        add             x5, x0, x3, lsl #1
+        str             s2, [x5]
+        str             s2, [x5, x3]
+
+        // Edge smoothing for luma only
+        cbnz            w4, 9f
+
+        umov            w6, v2.b[0]             // dc
+
+        // Vectorized edge smoothing
+        add             w9, w6, w6, lsl #1      // 3*dc
+        add             w9, w9, #2              // 3*dc + 2
+        dup             v3.8h, w9               // broadcast to 16-bit
+
+        // First row: (top[x] + 3*dc + 2) >> 2
+        ushll           v4.8h, v0.8b, #0        // widen top
+        add             v4.8h, v4.8h, v3.8h
+        ushr            v4.8h, v4.8h, #2
+        xtn             v4.8b, v4.8h
+        str             s4, [x0]                // store smoothed row, overwrite corner below
+
+        // Corner: POS(0,0) = (left[0] + 2*dc + top[0] + 2) >> 2
+        ldrb            w7, [x2]                // left[0]
+        ldrb            w8, [x1]                // top[0]
+        add             w10, w6, w6             // 2*dc
+        add             w10, w10, w7
+        add             w10, w10, w8
+        add             w10, w10, #2
+        lsr             w10, w10, #2
+        strb            w10, [x0]
+
+        // First column: (left[y] + 3*dc + 2) >> 2 for y=1..3
+        ushll           v5.8h, v1.8b, #0        // widen left
+        add             v5.8h, v5.8h, v3.8h
+        ushr            v5.8h, v5.8h, #2
+        xtn             v5.8b, v5.8h
+        add             x5, x0, x3
+        st1             {v5.b}[1], [x5]
+        add             x5, x5, x3
+        st1             {v5.b}[2], [x5]
+        add             x5, x5, x3
+        st1             {v5.b}[3], [x5]
+
+9:      ret
+endfunc
+	
+// -----------------------------------------------------------------------------
+// pred_dc_8x8_8: DC prediction
+// Arguments:
+// x0: src
+// x1: top
+// x2: left
+// x3: stride
+// w4: c_idx
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_dc_8x8_8_neon, export=1
+        // Load top[0..7] and left[0..7]
+        ldr             d0, [x1]                // top[0..7]
+        ldr             d1, [x2]                // left[0..7]
+
+        // Sum all pixels
+        uaddlv          h2, v0.8b               // sum top
+        uaddlv          h3, v1.8b               // sum left
+        add             v2.4h, v2.4h, v3.4h     // total sum
+
+        // Add rounding (8) and shift by 4
+        movi            v3.4h, #8
+        add             v2.4h, v2.4h, v3.4h     // + 8
+        ushr            v2.4h, v2.4h, #4        // >> 4
+        umov            w6, v2.h[0]             // dc as scalar
+        dup             v2.8b, w6               // broadcast dc
+
+        // Check if edge smoothing needed (luma only)
+        cbnz            w4, 2f
+
+        // === Luma path: fill + edge smoothing combined ===
+
+        // Precompute smoothed values
+        add             w9, w6, w6, lsl #1      // 3*dc
+        add             w9, w9, #2              // 3*dc + 2
+        dup             v3.8h, w9               // broadcast to 16-bit
+
+        // Smoothed first row
+        ushll           v4.8h, v0.8b, #0
+        add             v4.8h, v4.8h, v3.8h
+        ushr            v4.8h, v4.8h, #2
+        xtn             v4.8b, v4.8h            // smoothed row
+
+        // Corner: POS(0,0) = (left[0] + 2*dc + top[0] + 2) >> 2
+        ldrb            w7, [x2]
+        ldrb            w8, [x1]
+        add             w10, w6, w6
+        add             w10, w10, w7
+        add             w10, w10, w8
+        add             w10, w10, #2
+        lsr             w10, w10, #2
+        ins             v4.b[0], w10
+
+        // Smoothed column values
+        ushll           v5.8h, v1.8b, #0
+        add             v5.8h, v5.8h, v3.8h
+        ushr            v5.8h, v5.8h, #2
+        xtn             v5.8b, v5.8h            // smoothed col values in v5.b[0..7]
+
+        // Store row 0 (smoothed)
+        str             d4, [x0]
+
+        // Store DC fill for rows 1-7 first
+        add             x5, x0, x3
+.rept 7
+        str             d2, [x5]
+        add             x5, x5, x3
+.endr
+
+        // Scatter-store column bytes
+        add             x5, x0, x3
+.irp n, 1, 2, 3, 4, 5, 6, 7
+        st1             {v5.b}[\n], [x5]
+        add             x5, x5, x3
+.endr
+        ret
+
+2:      // === Chroma path: plain DC fill ===
+        str             d2, [x0]
+.rept 7
+        add             x0, x0, x3
+        str             d2, [x0]
+.endr
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_dc_16x16_8: DC prediction
+// Arguments:
+// x0: src
+// x1: top
+// x2: left
+// x3: stride
+// w4: c_idx
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_dc_16x16_8_neon, export=1
+        // Load top[0..15] and left[0..15]
+        ldr             q0, [x1]                // top[0..15]
+        ldr             q1, [x2]                // left[0..15]
+
+        // Sum all pixels
+        uaddlv          h2, v0.16b              // sum top
+        uaddlv          h3, v1.16b              // sum left
+        add             v2.4h, v2.4h, v3.4h
+
+        // Add rounding (16) and shift by 5
+        movi            v3.4h, #16
+        add             v2.4h, v2.4h, v3.4h
+        ushr            v2.4h, v2.4h, #5
+        umov            w6, v2.h[0]             // dc as scalar
+        dup             v2.16b, w6              // broadcast dc
+
+        // Check if edge smoothing needed (luma only)
+        cbnz            w4, 2f
+
+        // === Luma path: fill + edge smoothing combined ===
+
+        // Precompute smoothed first row
+        add             w9, w6, w6, lsl #1      // 3*dc
+        add             w9, w9, #2              // 3*dc + 2
+        dup             v3.8h, w9
+
+        ushll           v4.8h, v0.8b, #0
+        ushll2          v5.8h, v0.16b, #0
+        add             v4.8h, v4.8h, v3.8h
+        add             v5.8h, v5.8h, v3.8h
+        ushr            v4.8h, v4.8h, #2
+        ushr            v5.8h, v5.8h, #2
+        xtn             v4.8b, v4.8h
+        xtn2            v4.16b, v5.8h           // smoothed first row
+
+        // Corner
+        ldrb            w7, [x2]
+        ldrb            w8, [x1]
+        add             w10, w6, w6
+        add             w10, w10, w7
+        add             w10, w10, w8
+        add             w10, w10, #2
+        lsr             w10, w10, #2
+        ins             v4.b[0], w10
+
+        // Smoothed column
+        ushll           v5.8h, v1.8b, #0
+        ushll2          v6.8h, v1.16b, #0
+        add             v5.8h, v5.8h, v3.8h
+        add             v6.8h, v6.8h, v3.8h
+        ushr            v5.8h, v5.8h, #2
+        ushr            v6.8h, v6.8h, #2
+        xtn             v5.8b, v5.8h
+        xtn2            v5.16b, v6.8h           // smoothed column values
+
+        // Store row 0 (smoothed)
+        str             q4, [x0]
+
+        // Store DC fill for all 15 remaining rows first
+        add             x5, x0, x3
+.rept 15
+        str             q2, [x5]
+        add             x5, x5, x3
+.endr
+
+        // Now scatter-store column bytes over the DC fill
+        add             x5, x0, x3
+.irp n, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+        st1             {v5.b}[\n], [x5]
+        add             x5, x5, x3
+.endr
+        ret
+
+2:      // === Chroma path: plain DC fill ===
+        str             q2, [x0]
+.rept 15
+        add             x0, x0, x3
+        str             q2, [x0]
+.endr
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_dc_32x32_8: DC prediction (no edge smoothing)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left
+// x3: stride
+// w4: c_idx
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_dc_32x32_8_neon, export=1
+        // Load top[0..31] and left[0..31]
+        ldp             q0, q1, [x1]            // top[0..31]
+        ldp             q2, q3, [x2]            // left[0..31]
+
+        // Sum all pixels
+        uaddlv          h0, v0.16b
+        uaddlv          h1, v1.16b
+        uaddlv          h2, v2.16b
+        uaddlv          h3, v3.16b
+        add             v0.4h, v0.4h, v1.4h
+        add             v2.4h, v2.4h, v3.4h
+        add             v0.4h, v0.4h, v2.4h
+
+        // Add rounding (32) and shift by 6
+        movi            v2.4h, #32
+        add             v0.4h, v0.4h, v2.4h
+        ushr            v0.4h, v0.4h, #6
+        dup             v0.16b, v0.b[0]
+        mov             v1.16b, v0.16b
+
+        // Store 32 rows
+        mov             w6, #32
+2:
+        stp             q0, q1, [x0]
+        add             x0, x0, x3
+        subs            w6, w6, #1
+        b.ne            2b
+
+        // No edge smoothing for 32x32 (size >= 32)
+        ret
+endfunc
+
+// =============================================================================
+// Planar Prediction
+// =============================================================================
+
+/*
+ * Planar prediction algorithm:
+ * For each pixel (x, y):
+ * POS(x,y) = ((size-1-x)*left[y] + (x+1)*top[size] +
+ *             (size-1-y)*top[x] + (y+1)*left[size] + size) >> (log2_size+1)
+ */
+// -----------------------------------------------------------------------------
+// pred_planar_4x4_8: Planar prediction
+// Arguments:
+// x0: src
+// x1: top
+// x2: left
+// x3: stride
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_planar_4x4_8_neon, export=1
+        // Load reference samples
+        ldr             s0, [x1]                // top[0..3]
+        ldr             s1, [x2]                // left[0..3]
+        ldrb            w4, [x1, #4]            // top[4]
+        ldrb            w5, [x2, #4]            // left[4]
+
+        // Setup weight vectors for x direction: [3,2,1,0] and [1,2,3,4]
+        movrel          x6, planar_weights_4
+        ldr             d4, [x6]                // weights_dec = [3,2,1,0, ...]
+        ldr             d5, [x6, #8]            // weights_inc = [1,2,3,4, ...]
+
+        // Broadcast top[4] and left[4]
+        dup             v6.8b, w4               // top[size]
+        dup             v7.8b, w5               // left[size]
+
+        // Rounding constant (hoisted out of loop)
+        movi            v20.8h, #4
+
+        // Process row by row
+        mov             w8, #0                  // y = 0
+
+1:
+        // For row y:
+        // weight_y_dec = size - 1 - y = 3 - y
+        // weight_y_inc = y + 1
+
+        mov             w9, #3
+        sub             w9, w9, w8              // 3 - y
+        add             w10, w8, #1             // y + 1
+
+        // Load left[y]
+        ldrb            w11, [x2, w8, uxtw]
+        dup             v2.8b, w11              // broadcast left[y]
+
+        // (size-1-x) * left[y] : use weights_dec * left[y]
+        umull           v16.8h, v4.8b, v2.8b    // v16 = weights_dec * left[y]
+
+        // (x+1) * top[size] : use weights_inc * top[4]
+        umull           v17.8h, v5.8b, v6.8b    // v17 = weights_inc * top[size]
+
+        // (size-1-y) * top[x]
+        dup             v3.8b, w9               // broadcast (3 - y)
+        umull           v18.8h, v3.8b, v0.8b    // v18 = (3-y) * top[x]
+
+        // (y+1) * left[size]
+        dup             v3.8b, w10              // broadcast (y + 1)
+        umull           v19.8h, v3.8b, v7.8b    // v19 = (y+1) * left[size]
+
+        // Sum all terms + rounding
+        add             v16.8h, v16.8h, v17.8h
+        add             v18.8h, v18.8h, v19.8h
+        add             v16.8h, v16.8h, v18.8h
+        add             v16.8h, v16.8h, v20.8h  // + 4 (rounding)
+
+        // Shift right by 3 (log2_size + 1 = 3)
+        shrn            v16.8b, v16.8h, #3
+
+        // Store 4 pixels
+        str             s16, [x0]
+        add             x0, x0, x3
+
+        add             w8, w8, #1
+        cmp             w8, #4
+        b.lt            1b
+
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_planar_8x8_8: Planar prediction
+// Arguments:
+// x0: src
+// x1: top
+// x2: left
+// x3: stride
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_planar_8x8_8_neon, export=1
+        // Load reference samples
+        ldr             d0, [x1]                // top[0..7]
+        ldr             d1, [x2]                // left[0..7]
+        ldrb            w4, [x1, #8]            // top[8]
+        ldrb            w5, [x2, #8]            // left[8]
+
+        // Setup weight vectors
+        movrel          x6, planar_weights_8
+        ldr             d4, [x6]                // weights_dec = [7,6,5,4,3,2,1,0]
+        ldr             d5, [x6, #8]            // weights_inc = [1,2,3,4,5,6,7,8]
+
+        dup             v6.8b, w4               // top[size]
+        dup             v7.8b, w5               // left[size]
+
+        // Rounding constant (hoisted out of loop)
+        movi            v20.8h, #8
+
+        mov             w8, #0
+
+1:
+        mov             w9, #7
+        sub             w9, w9, w8
+        add             w10, w8, #1
+
+        ldrb            w11, [x2, w8, uxtw]
+        dup             v2.8b, w11
+
+        umull           v16.8h, v4.8b, v2.8b
+        umull           v17.8h, v5.8b, v6.8b
+        dup             v3.8b, w9
+        umull           v18.8h, v3.8b, v0.8b
+        dup             v3.8b, w10
+        umull           v19.8h, v3.8b, v7.8b
+
+        add             v16.8h, v16.8h, v17.8h
+        add             v18.8h, v18.8h, v19.8h
+        add             v16.8h, v16.8h, v18.8h
+        add             v16.8h, v16.8h, v20.8h
+
+        shrn            v16.8b, v16.8h, #4
+
+        str             d16, [x0]
+        add             x0, x0, x3
+
+        add             w8, w8, #1
+        cmp             w8, #8
+        b.lt            1b
+
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_planar_16x16_8: Planar prediction
+// Arguments:
+// x0: src
+// x1: top
+// x2: left
+// x3: stride
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_planar_16x16_8_neon, export=1
+        // Load reference samples
+        ldr             q0, [x1]                // top[0..15]
+        ldr             q1, [x2]                // left[0..15]
+        ldrb            w4, [x1, #16]           // top[16]
+        ldrb            w5, [x2, #16]           // left[16]
+
+        // Setup weight vectors for 16 elements
+        movrel          x6, planar_weights_16
+        ldr             q4, [x6]                // weights_dec [15..0]
+        ldr             q5, [x6, #16]           // weights_inc [1..16]
+
+        dup             v6.16b, w4
+        dup             v7.16b, w5
+
+        // Rounding constant (hoisted out of loop)
+        movi            v20.8h, #16
+
+        mov             w8, #0
+
+1:
+        mov             w9, #15
+        sub             w9, w9, w8
+        add             w10, w8, #1
+
+        ldrb            w11, [x2, w8, uxtw]
+        dup             v2.16b, w11
+
+        // Need to process in two halves due to 16-bit intermediate results
+        // Low 8 elements
+        umull           v16.8h, v4.8b, v2.8b
+        umull           v17.8h, v5.8b, v6.8b
+        dup             v3.8b, w9
+        umull           v18.8h, v3.8b, v0.8b
+        dup             v3.8b, w10
+        umull           v19.8h, v3.8b, v7.8b
+
+        add             v16.8h, v16.8h, v17.8h
+        add             v18.8h, v18.8h, v19.8h
+        add             v16.8h, v16.8h, v18.8h
+        add             v16.8h, v16.8h, v20.8h
+        shrn            v16.8b, v16.8h, #5
+
+        // High 8 elements
+        umull2          v21.8h, v4.16b, v2.16b
+        umull2          v22.8h, v5.16b, v6.16b
+        dup             v3.16b, w9               // broadcast (size-1-y) to both low and high halves
+        umull2          v23.8h, v3.16b, v0.16b
+        dup             v3.16b, w10              // broadcast (y+1) to both low and high halves
+        umull2          v24.8h, v3.16b, v7.16b
+
+        add             v21.8h, v21.8h, v22.8h
+        add             v23.8h, v23.8h, v24.8h
+        add             v21.8h, v21.8h, v23.8h
+        add             v21.8h, v21.8h, v20.8h
+        shrn2           v16.16b, v21.8h, #5
+
+        str             q16, [x0]
+        add             x0, x0, x3
+
+        add             w8, w8, #1
+        cmp             w8, #16
+        b.lt            1b
+
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_planar_32x32_8: Planar prediction
+//
+// Formula: POS(x,y) = ((31-x)*left[y] + (x+1)*top[32]
+//                     + (31-y)*top[x]  + (y+1)*left[32] + 32) >> 6
+//
+// Decomposed as:  base[x] = weight_inc[x]*top[32] + 31*top[x] + 32
+//                 Per row:  base[x] += left[32]    (incremental for (y+1)*left[32])
+//                           base[x] -= top[x]      (incremental for (31-y)*top[x])
+//                           result   = base[x] + weight_dec[x]*left[y]
+//
+// Both row_add and the (31-y)*top[x] term are folded into the base,
+// eliminating all GP→NEON scalar broadcasts except for left[y].
+// The loop is fully unrolled (32 rows via macro) to avoid branch overhead
+// and enable NEON-domain left[y] broadcasts from preloaded registers.
+//
+// Arguments:
+// x0: src
+// x1: top
+// x2: left
+// x3: stride
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_planar_32x32_8_neon, export=1
+        stp             d8, d9, [sp, #-64]!
+        stp             d10, d11, [sp, #16]
+        stp             d12, d13, [sp, #32]
+        stp             d14, d15, [sp, #48]
+
+        // Load top[0..31]
+        ldp             q0, q1, [x1]            // top[0..15], top[16..31]
+        ldrb            w4, [x1, #32]           // top[32]
+        ldrb            w5, [x2, #32]           // left[32]
+
+        // Load weight vectors
+        movrel          x6, planar_weights_32
+        ldp             q4, q5, [x6]            // weight_dec = {31,30,...,0}
+        ldp             q6, q7, [x6, #32]       // weight_inc = {1,2,...,32}
+
+        // Precompute term_A = weight_inc * top[32]  (16-bit)
+        dup             v2.16b, w4
+        umull           v8.8h, v6.8b, v2.8b
+        umull2          v9.8h, v6.16b, v2.16b
+        umull           v10.8h, v7.8b, v2.8b
+        umull2          v11.8h, v7.16b, v2.16b
+
+        // Widen top[x] for incremental subtraction
+        ushll           v12.8h, v0.8b, #0
+        ushll2          v13.8h, v0.16b, #0
+
+        // 31*top[x] = top[x]<<5 - top[x]
+        ushll           v6.8h, v0.8b, #5
+        ushll2          v7.8h, v0.16b, #5
+        sub             v6.8h, v6.8h, v12.8h    // 31*top[0..7]
+        sub             v7.8h, v7.8h, v13.8h    // 31*top[8..15]
+
+        // base[0..15] = term_A + 31*top[0..15] + 32
+        movi            v3.8h, #32
+        add             v8.8h, v8.8h, v6.8h
+        add             v8.8h, v8.8h, v3.8h
+        add             v9.8h, v9.8h, v7.8h
+        add             v9.8h, v9.8h, v3.8h
+
+        // Same for top[16..31]
+        ushll           v6.8h, v1.8b, #0
+        ushll2          v7.8h, v1.16b, #0
+        ushll           v14.8h, v1.8b, #5
+        ushll2          v15.8h, v1.16b, #5
+        sub             v14.8h, v14.8h, v6.8h   // 31*top[16..23]
+        sub             v15.8h, v15.8h, v7.8h   // 31*top[24..31]
+        add             v10.8h, v10.8h, v14.8h
+        add             v10.8h, v10.8h, v3.8h
+        add             v11.8h, v11.8h, v15.8h
+        add             v11.8h, v11.8h, v3.8h
+
+        // Compute combined decrement: top[x] - left[32]
+        // Each row: base += left[32] and base -= top[x]
+        // Combined: base -= (top[x] - left[32])
+        dup             v3.8h, w5               // left[32] as 16-bit
+        sub             v12.8h, v12.8h, v3.8h   // top[0..7] - left[32]
+        sub             v13.8h, v13.8h, v3.8h   // top[8..15] - left[32]
+        sub             v6.8h, v6.8h, v3.8h     // top[16..23] - left[32]
+        sub             v7.8h, v7.8h, v3.8h     // top[24..31] - left[32]
+
+        // Now base needs initial +=left[32] for y=0 (row_add = 1*left[32])
+        add             v8.8h, v8.8h, v3.8h
+        add             v9.8h, v9.8h, v3.8h
+        add             v10.8h, v10.8h, v3.8h
+        add             v11.8h, v11.8h, v3.8h
+
+        // Persistent registers:
+        //   v8-v11  = base[0..31] (includes running row_add, decremented by combined each row)
+        //   v12,v13 = top[0..15] - left[32] (combined decrement)
+        //   v6,v7   = top[16..31] - left[32] (combined decrement)
+        //   v4,v5   = weight_dec[0..31] (8-bit)
+        //   v1,v3   = left[0..31] preloaded (8-bit)
+
+        // Load left[0..31] into v1,v3
+        ldp             q1, q3, [x2]
+
+.macro planar32_row lane, leftreg
+        dup             v2.16b, \leftreg\().b[\lane]
+        umull           v16.8h, v4.8b, v2.8b
+        umull2          v17.8h, v4.16b, v2.16b
+        umull           v18.8h, v5.8b, v2.8b
+        umull2          v19.8h, v5.16b, v2.16b
+        add             v16.8h, v16.8h, v8.8h
+        add             v17.8h, v17.8h, v9.8h
+        add             v18.8h, v18.8h, v10.8h
+        add             v19.8h, v19.8h, v11.8h
+        shrn            v14.8b, v16.8h, #6
+        shrn2           v14.16b, v17.8h, #6
+        shrn            v15.8b, v18.8h, #6
+        shrn2           v15.16b, v19.8h, #6
+        stp             q14, q15, [x0]
+        add             x0, x0, x3
+        sub             v8.8h, v8.8h, v12.8h
+        sub             v9.8h, v9.8h, v13.8h
+        sub             v10.8h, v10.8h, v6.8h
+        sub             v11.8h, v11.8h, v7.8h
+.endm
+
+        // Rows 0-15 from v1
+        planar32_row 0, v1
+        planar32_row 1, v1
+        planar32_row 2, v1
+        planar32_row 3, v1
+        planar32_row 4, v1
+        planar32_row 5, v1
+        planar32_row 6, v1
+        planar32_row 7, v1
+        planar32_row 8, v1
+        planar32_row 9, v1
+        planar32_row 10, v1
+        planar32_row 11, v1
+        planar32_row 12, v1
+        planar32_row 13, v1
+        planar32_row 14, v1
+        planar32_row 15, v1
+
+        // Rows 16-31 from v3
+        planar32_row 0, v3
+        planar32_row 1, v3
+        planar32_row 2, v3
+        planar32_row 3, v3
+        planar32_row 4, v3
+        planar32_row 5, v3
+        planar32_row 6, v3
+        planar32_row 7, v3
+        planar32_row 8, v3
+        planar32_row 9, v3
+        planar32_row 10, v3
+        planar32_row 11, v3
+        planar32_row 12, v3
+        planar32_row 13, v3
+        planar32_row 14, v3
+        planar32_row 15, v3
+
+.purgem planar32_row
+
+        ldp             d14, d15, [sp, #48]
+        ldp             d10, d11, [sp, #16]
+        ldp             d12, d13, [sp, #32]
+        ldp             d8, d9, [sp], #64
+        ret
+endfunc
+
+
+// =============================================================================
+// Weight tables for planar prediction
+// =============================================================================
+
+const planar_weights_4, align=4
+        .byte   3, 2, 1, 0, 0, 0, 0, 0          // weights_dec for 4x4
+        .byte   1, 2, 3, 4, 0, 0, 0, 0          // weights_inc for 4x4
+endconst
+
+const planar_weights_8, align=4
+        .byte   7, 6, 5, 4, 3, 2, 1, 0          // weights_dec
+        .byte   1, 2, 3, 4, 5, 6, 7, 8          // weights_inc
+endconst
+
+const planar_weights_16, align=4
+        .byte   15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+        .byte   1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
+endconst
+
+const planar_weights_32, align=4
+        .byte   31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16
+        .byte   15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+        .byte   1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
+        .byte   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32
+endconst
diff --git a/libavcodec/hevc/pred.c b/libavcodec/hevc/pred.c
index 8d588382fa..2fd18a9db6 100644
--- a/libavcodec/hevc/pred.c
+++ b/libavcodec/hevc/pred.c
@@ -78,4 +78,7 @@ void ff_hevc_pred_init(HEVCPredContext *hpc, int bit_depth)
 #if ARCH_MIPS
     ff_hevc_pred_init_mips(hpc, bit_depth);
 #endif
+#if ARCH_AARCH64
+    ff_hevc_pred_init_aarch64(hpc, bit_depth);
+#endif
 }
diff --git a/libavcodec/hevc/pred.h b/libavcodec/hevc/pred.h
index 1ac8f9666b..c4bd72b1a3 100644
--- a/libavcodec/hevc/pred.h
+++ b/libavcodec/hevc/pred.h
@@ -44,5 +44,6 @@ typedef struct HEVCPredContext {
 
 void ff_hevc_pred_init(HEVCPredContext *hpc, int bit_depth);
 void ff_hevc_pred_init_mips(HEVCPredContext *hpc, int bit_depth);
+void ff_hevc_pred_init_aarch64(HEVCPredContext *hpc, int bit_depth);
 
 #endif /* AVCODEC_HEVC_PRED_H */
diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index 48de4d22a0..883255bfe1 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -42,7 +42,8 @@ AVCODECOBJS-$(CONFIG_HUFFYUV_DECODER)   += huffyuvdsp.o
 AVCODECOBJS-$(CONFIG_JPEG2000_DECODER)  += jpeg2000dsp.o
 AVCODECOBJS-$(CONFIG_OPUS_DECODER)      += opusdsp.o
 AVCODECOBJS-$(CONFIG_PIXBLOCKDSP)       += pixblockdsp.o
-AVCODECOBJS-$(CONFIG_HEVC_DECODER)      += hevc_add_res.o hevc_deblock.o hevc_dequant.o hevc_idct.o hevc_sao.o hevc_pel.o
+AVCODECOBJS-$(CONFIG_HEVC_DECODER)      += hevc_add_res.o hevc_deblock.o hevc_dequant.o \
+					   hevc_idct.o hevc_pel.o hevc_pred.o hevc_sao.o
 AVCODECOBJS-$(CONFIG_PNG_DECODER)       += png.o
 AVCODECOBJS-$(CONFIG_RV34DSP)           += rv34dsp.o
 AVCODECOBJS-$(CONFIG_RV40_DECODER)      += rv40dsp.o
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index bdaaa8695d..f9be3142b6 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -190,6 +190,7 @@ static const struct {
         { "hevc_dequant", checkasm_check_hevc_dequant },
         { "hevc_idct", checkasm_check_hevc_idct },
         { "hevc_pel", checkasm_check_hevc_pel },
+        { "hevc_pred", checkasm_check_hevc_pred },
         { "hevc_sao", checkasm_check_hevc_sao },
     #endif
     #if CONFIG_HPELDSP
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index 2a6c7e8ea6..ed9fb23327 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -113,6 +113,7 @@ void checkasm_check_hevc_deblock(void);
 void checkasm_check_hevc_dequant(void);
 void checkasm_check_hevc_idct(void);
 void checkasm_check_hevc_pel(void);
+void checkasm_check_hevc_pred(void);
 void checkasm_check_hevc_sao(void);
 void checkasm_check_hpeldsp(void);
 void checkasm_check_huffyuvdsp(void);
diff --git a/tests/checkasm/hevc_pred.c b/tests/checkasm/hevc_pred.c
new file mode 100644
index 0000000000..178dc8cdee
--- /dev/null
+++ b/tests/checkasm/hevc_pred.c
@@ -0,0 +1,227 @@
+/*
+ * Copyright (c) 2026 FFmpeg Project
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <string.h>
+#include "checkasm.h"
+#include "libavcodec/hevc/pred.h"
+#include "libavutil/intreadwrite.h"
+#include "libavutil/mem_internal.h"
+
+static const uint32_t pixel_mask[3] = { 0xffffffff, 0x01ff01ff, 0x03ff03ff };
+
+#define SIZEOF_PIXEL ((bit_depth + 7) / 8)
+#define BUF_SIZE (2 * 64 * 64)  /* Enough for 32x32 with stride=64 */
+#define PRED_SIZE 128           /* Increased to 4 * MAX_TB_SIZE to accommodate C code reads */
+
+#define randomize_buffers()                        \
+    do {                                           \
+        uint32_t mask = pixel_mask[bit_depth - 8]; \
+        for (int i = 0; i < BUF_SIZE; i += 4) {    \
+            uint32_t r = rnd() & mask;             \
+            AV_WN32A(buf0 + i, r);                 \
+            AV_WN32A(buf1 + i, r);                 \
+        }                                          \
+        /* Start from -4 so that AV_WN32A writes  \
+         * top[-4..-1] and left[-4..-1], ensuring  \
+         * top[-1] and left[-1] contain known data \
+         * since angular pred references them      \
+         * (e.g. mode 10/26 edge filtering,        \
+         * mode 18 diagonal, V/H neg extension). */\
+        for (int i = -4; i < PRED_SIZE; i += 4) {  \
+            uint32_t r = rnd() & mask;             \
+            AV_WN32A(top + i, r);                  \
+            AV_WN32A(left + i, r);                 \
+        }                                          \
+    } while (0)
+
+static void check_pred_dc(HEVCPredContext *h,
+                          uint8_t *buf0, uint8_t *buf1,
+                          uint8_t *top, uint8_t *left, int bit_depth)
+{
+    const char *const block_name[] = { "4x4", "8x8", "16x16", "32x32" };
+    const int block_size[] = { 4, 8, 16, 32 };
+    int log2_size;
+
+    declare_func(void, uint8_t *src, const uint8_t *top,
+                 const uint8_t *left, ptrdiff_t stride,
+                 int log2_size, int c_idx);
+
+    /* Test all 4 sizes: 4x4, 8x8, 16x16, 32x32 */
+    for (log2_size = 2; log2_size <= 5; log2_size++) {
+        int size = block_size[log2_size - 2];
+        ptrdiff_t stride = 64 * SIZEOF_PIXEL;
+
+        if (check_func(h->pred_dc, "hevc_pred_dc_%s_%d",
+                       block_name[log2_size - 2], bit_depth)) {
+            /* Test with c_idx=0 (luma, with edge smoothing for size < 32) */
+            randomize_buffers();
+            call_ref(buf0, top, left, stride, log2_size, 0);
+            call_new(buf1, top, left, stride, log2_size, 0);
+            if (memcmp(buf0, buf1, size * stride))
+                fail();
+
+            /* Test with c_idx=1 (chroma, no edge smoothing) */
+            randomize_buffers();
+            call_ref(buf0, top, left, stride, log2_size, 1);
+            call_new(buf1, top, left, stride, log2_size, 1);
+            if (memcmp(buf0, buf1, size * stride))
+                fail();
+
+            bench_new(buf1, top, left, stride, log2_size, 0);
+        }
+    }
+}
+
+static void check_pred_planar(HEVCPredContext *h,
+                              uint8_t *buf0, uint8_t *buf1,
+                              uint8_t *top, uint8_t *left, int bit_depth)
+{
+    const char *const block_name[] = { "4x4", "8x8", "16x16", "32x32" };
+    const int block_size[] = { 4, 8, 16, 32 };
+    int i;
+
+    declare_func(void, uint8_t *src, const uint8_t *top,
+                 const uint8_t *left, ptrdiff_t stride);
+
+    /* Test all 4 sizes: 4x4, 8x8, 16x16, 32x32 */
+    for (i = 0; i < 4; i++) {
+        int size = block_size[i];
+        ptrdiff_t stride = 64 * SIZEOF_PIXEL;
+
+        if (check_func(h->pred_planar[i], "hevc_pred_planar_%s_%d",
+                       block_name[i], bit_depth)) {
+            randomize_buffers();
+            call_ref(buf0, top, left, stride);
+            call_new(buf1, top, left, stride);
+            if (memcmp(buf0, buf1, size * stride))
+                fail();
+
+            bench_new(buf1, top, left, stride);
+        }
+    }
+}
+
+/*
+ * Angular prediction modes are divided into categories:
+ *
+ * Mode 10: Horizontal pure copy (H pure)
+ * Mode 26: Vertical pure copy (V pure)
+ * Modes 2-9: Horizontal positive angle (H pos) - uses left reference
+ * Modes 11-17: Horizontal negative angle (H neg) - needs reference extension
+ * Modes 18-25: Vertical negative angle (V neg) - needs reference extension
+ * Modes 27-34: Vertical positive angle (V pos) - uses top reference
+ *
+ * Each category has 4 NEON functions for 4x4, 8x8, 16x16, 32x32 sizes.
+ */
+static void check_pred_angular(HEVCPredContext *h,
+                               uint8_t *buf0, uint8_t *buf1,
+                               uint8_t *top, uint8_t *left, int bit_depth)
+{
+    const char *const block_name[] = { "4x4", "8x8", "16x16", "32x32" };
+    const int block_size[] = { 4, 8, 16, 32 };
+    int i, mode;
+
+    declare_func(void, uint8_t *src, const uint8_t *top,
+                 const uint8_t *left, ptrdiff_t stride, int c_idx, int mode);
+
+    /* Test all 4 sizes */
+    for (i = 0; i < 4; i++) {
+        int size = block_size[i];
+        ptrdiff_t stride = 64 * SIZEOF_PIXEL;
+
+        /* Test all 33 angular modes (2-34) */
+        for (mode = 2; mode <= 34; mode++) {
+            const char *mode_category;
+
+            /* Determine mode category for descriptive test name */
+            if (mode == 10)
+                mode_category = "Hpure";
+            else if (mode == 26)
+                mode_category = "Vpure";
+            else if (mode >= 2 && mode <= 9)
+                mode_category = "Hpos";
+            else if (mode >= 11 && mode <= 17)
+                mode_category = "Hneg";
+            else if (mode >= 18 && mode <= 25)
+                mode_category = "Vneg";
+            else /* mode >= 27 && mode <= 34 */
+                mode_category = "Vpos";
+
+            if (check_func(h->pred_angular[i],
+                           "hevc_pred_angular_%s_%s_mode%d_%d",
+                           block_name[i], mode_category, mode, bit_depth)) {
+                /* Test with c_idx=0 (luma) */
+                randomize_buffers();
+                call_ref(buf0, top, left, stride, 0, mode);
+                call_new(buf1, top, left, stride, 0, mode);
+                if (memcmp(buf0, buf1, size * stride))
+                    fail();
+
+                /* Test with c_idx=1 (chroma) for modes 10/26 to cover
+                 * the edge filtering skip path */
+                if (mode == 10 || mode == 26) {
+                    randomize_buffers();
+                    call_ref(buf0, top, left, stride, 1, mode);
+                    call_new(buf1, top, left, stride, 1, mode);
+                    if (memcmp(buf0, buf1, size * stride))
+                        fail();
+                }
+
+                bench_new(buf1, top, left, stride, 0, mode);
+            }
+        }
+    }
+}
+
+void checkasm_check_hevc_pred(void)
+{
+    LOCAL_ALIGNED_32(uint8_t, buf0, [BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, buf1, [BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, top_buf, [PRED_SIZE + 16]);
+    LOCAL_ALIGNED_32(uint8_t, left_buf, [PRED_SIZE + 16]);
+    /* Add offset of 8 bytes to allow negative indexing (top[-1], left[-1]) */
+    uint8_t *top = top_buf + 8;
+    uint8_t *left = left_buf + 8;
+    int bit_depth;
+
+    for (bit_depth = 8; bit_depth <= 10; bit_depth += 2) {
+        HEVCPredContext h;
+
+        ff_hevc_pred_init(&h, bit_depth);
+        check_pred_dc(&h, buf0, buf1, top, left, bit_depth);
+    }
+    report("pred_dc");
+
+    for (bit_depth = 8; bit_depth <= 10; bit_depth += 2) {
+        HEVCPredContext h;
+
+        ff_hevc_pred_init(&h, bit_depth);
+        check_pred_planar(&h, buf0, buf1, top, left, bit_depth);
+    }
+    report("pred_planar");
+
+    for (bit_depth = 8; bit_depth <= 10; bit_depth += 2) {
+        HEVCPredContext h;
+
+        ff_hevc_pred_init(&h, bit_depth);
+        check_pred_angular(&h, buf0, buf1, top, left, bit_depth);
+    }
+    report("pred_angular");
+}
diff --git a/tests/fate/checkasm.mak b/tests/fate/checkasm.mak
index 16c6f1f775..515274e9fa 100644
--- a/tests/fate/checkasm.mak
+++ b/tests/fate/checkasm.mak
@@ -30,6 +30,7 @@ FATE_CHECKASM = fate-checkasm-aacencdsp                                 \
                 fate-checkasm-hevc_dequant                              \
                 fate-checkasm-hevc_idct                                 \
                 fate-checkasm-hevc_pel                                  \
+                fate-checkasm-hevc_pred                                 \
                 fate-checkasm-hevc_sao                                  \
                 fate-checkasm-hpeldsp                                   \
                 fate-checkasm-huffyuvdsp                                \
-- 
2.52.0


From 1556baa22abfef1a9c3618d80f189fc7c8042760 Mon Sep 17 00:00:00 2001
From: Jun Zhao <barryjzhao@tencent.com>
Date: Tue, 27 Jan 2026 18:27:45 +0800
Subject: [PATCH 2/8] lavc/hevc: add aarch64 NEON for angular modes 10 and 26

Add NEON-optimized implementations for HEVC angular intra prediction
modes 10 (pure horizontal) and 26 (pure vertical) at 8-bit depth.

Mode 10 (Horizontal):
- Broadcasts left[y] to fill each row
- Applies edge smoothing for luma blocks smaller than 32x32

Mode 26 (Vertical):
- Copies top reference row to all output rows
- Applies edge smoothing for luma blocks smaller than 32x32

Both modes use size-specific load/store operations for efficiency.

Speedup over C on Apple M4 (checkasm --bench, 10-run average):

  Mode 10 (Horizontal):
    4x4: 4.81x    8x8: 4.97x    16x16: 6.50x    32x32: 16.47x

  Mode 26 (Vertical):
    4x4: 1.39x    8x8: 1.59x    16x16: 2.03x    32x32: 3.36x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
---
 libavcodec/aarch64/hevcpred_init_aarch64.c |  75 ++++++
 libavcodec/aarch64/hevcpred_neon.S         | 266 +++++++++++++++++++++
 2 files changed, 341 insertions(+)

diff --git a/libavcodec/aarch64/hevcpred_init_aarch64.c b/libavcodec/aarch64/hevcpred_init_aarch64.c
index 0d5517aa9b..4186917a77 100644
--- a/libavcodec/aarch64/hevcpred_init_aarch64.c
+++ b/libavcodec/aarch64/hevcpred_init_aarch64.c
@@ -49,6 +49,14 @@ void ff_hevc_pred_planar_16x16_8_neon(uint8_t *src, const uint8_t *top,
 void ff_hevc_pred_planar_32x32_8_neon(uint8_t *src, const uint8_t *top,
                                     const uint8_t *left, ptrdiff_t stride);
 
+// Mode 10 and 26
+void ff_hevc_pred_angular_mode_10_8_neon(uint8_t *src, const uint8_t *top,
+                                        const uint8_t *left, ptrdiff_t stride,
+                                        int c_idx, int log2_size);
+void ff_hevc_pred_angular_mode_26_8_neon(uint8_t *src, const uint8_t *top,
+                                        const uint8_t *left, ptrdiff_t stride,
+                                        int c_idx, int log2_size);
+
 static void pred_dc_neon(uint8_t *src, const uint8_t *top,
                          const uint8_t *left, ptrdiff_t stride,
                          int log2_size, int c_idx)
@@ -71,6 +79,63 @@ static void pred_dc_neon(uint8_t *src, const uint8_t *top,
     }
 }
 
+typedef void (*pred_angular_func)(uint8_t *src, const uint8_t *top,
+                                  const uint8_t *left, ptrdiff_t stride,
+                                  int c_idx, int mode);
+static pred_angular_func pred_angular_c[4];
+
+static void pred_angular_0_neon(uint8_t *src, const uint8_t *top,
+                                const uint8_t *left, ptrdiff_t stride,
+                                int c_idx, int mode)
+{
+    if (mode == 10) {
+        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 2);
+    } else if (mode == 26) {
+        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 2);
+    } else {
+        pred_angular_c[0](src, top, left, stride, c_idx, mode);
+    }
+}
+
+static void pred_angular_1_neon(uint8_t *src, const uint8_t *top,
+                                const uint8_t *left, ptrdiff_t stride,
+                                int c_idx, int mode)
+{
+    if (mode == 10) {
+        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 3);
+    } else if (mode == 26) {
+        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 3);
+    } else {
+        pred_angular_c[1](src, top, left, stride, c_idx, mode);
+    }
+}
+
+static void pred_angular_2_neon(uint8_t *src, const uint8_t *top,
+                                const uint8_t *left, ptrdiff_t stride,
+                                int c_idx, int mode)
+{
+    if (mode == 10) {
+        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 4);
+    } else if (mode == 26) {
+        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 4);
+    } else {
+        pred_angular_c[2](src, top, left, stride, c_idx, mode);
+    }
+}
+
+static void pred_angular_3_neon(uint8_t *src, const uint8_t *top,
+                                const uint8_t *left, ptrdiff_t stride,
+                                int c_idx, int mode)
+{
+    if (mode == 10) {
+        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 5);
+    } else if (mode == 26) {
+        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 5);
+    } else {
+        pred_angular_c[3](src, top, left, stride, c_idx, mode);
+    }
+}
+
 av_cold void ff_hevc_pred_init_aarch64(HEVCPredContext *hpc, int bit_depth)
 {
     int cpu_flags = av_get_cpu_flags();
@@ -84,5 +149,15 @@ av_cold void ff_hevc_pred_init_aarch64(HEVCPredContext *hpc, int bit_depth)
         hpc->pred_planar[1] = ff_hevc_pred_planar_8x8_8_neon;
         hpc->pred_planar[2] = ff_hevc_pred_planar_16x16_8_neon;
         hpc->pred_planar[3] = ff_hevc_pred_planar_32x32_8_neon;
+
+        pred_angular_c[0] = hpc->pred_angular[0];
+        pred_angular_c[1] = hpc->pred_angular[1];
+        pred_angular_c[2] = hpc->pred_angular[2];
+        pred_angular_c[3] = hpc->pred_angular[3];
+
+        hpc->pred_angular[0] = pred_angular_0_neon;
+        hpc->pred_angular[1] = pred_angular_1_neon;
+        hpc->pred_angular[2] = pred_angular_2_neon;
+        hpc->pred_angular[3] = pred_angular_3_neon;
     }
 }
diff --git a/libavcodec/aarch64/hevcpred_neon.S b/libavcodec/aarch64/hevcpred_neon.S
index 77ddd69dbc..a7aecb1076 100644
--- a/libavcodec/aarch64/hevcpred_neon.S
+++ b/libavcodec/aarch64/hevcpred_neon.S
@@ -755,3 +755,269 @@ const planar_weights_32, align=4
         .byte   1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
         .byte   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32
 endconst
+
+// =============================================================================
+// Angular Prediction
+// =============================================================================
+	
+// -----------------------------------------------------------------------------
+// pred_angular_mode_10_8: Horizontal prediction (mode 10)
+// Caller must ensure top[-1] and left[-1] are valid (used for edge smoothing
+// when c_idx == 0 and size < 32).
+// Arguments:
+// x0: src
+// x1: top (unused for H reference modes)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: log2_size
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_mode_10_8_neon, export=1
+        cmp             w5, #2
+        b.eq            .Lmode10_4x4
+        cmp             w5, #3
+        b.eq            .Lmode10_8x8
+        cmp             w5, #4
+        b.eq            .Lmode10_16x16
+
+        // --- size 32 ---
+        mov             w7, #0
+.Lmode10_32x32_row:
+        ldrb            w8, [x2, w7, uxtw]     // left[y]
+        dup             v0.16b, w8
+        st1             {v0.16b}, [x0]
+        str             q0, [x0, #16]
+        add             x0, x0, x3
+        add             w7, w7, #1
+        cmp             w7, #32
+        b.lt            .Lmode10_32x32_row
+        // size 32 never does edge smoothing
+        ret
+
+        // --- size 16 ---
+.Lmode10_16x16:
+        mov             x6, x0                  // save src base
+        mov             w7, #0
+.Lmode10_16x16_row:
+        ldrb            w8, [x2, w7, uxtw]
+        dup             v0.16b, w8
+        st1             {v0.16b}, [x0], x3
+        add             w7, w7, #1
+        cmp             w7, #16
+        b.lt            .Lmode10_16x16_row
+        b               .Lmode10_edge_smooth
+
+        // --- size 8, fully unrolled ---
+.Lmode10_8x8:
+        mov             x6, x0                  // save src base
+.irp idx, 0, 1, 2, 3, 4, 5, 6, 7
+        ldrb            w8, [x2, #\idx]
+        dup             v0.8b, w8
+        st1             {v0.8b}, [x0], x3
+.endr
+        b               .Lmode10_edge_smooth
+
+        // --- size 4, fully unrolled ---
+.Lmode10_4x4:
+        mov             x6, x0                  // save src base
+.irp idx, 0, 1, 2, 3
+        ldrb            w8, [x2, #\idx]
+        dup             v0.8b, w8
+        str             s0, [x0]
+        add             x0, x0, x3
+.endr
+
+.Lmode10_edge_smooth:
+        cbnz            w4, 9f
+
+        mov             x0, x6                  // restore src base
+
+        ldrb            w8, [x2]                // left[0]
+        dup             v5.16b, w8
+
+        ldrb            w9, [x1, #-1]           // top[-1]
+        dup             v1.16b, w9
+        uxtl            v1.8h, v1.8b
+
+        cmp             w5, #2
+        b.eq            .Lmode10_smooth_4
+        cmp             w5, #3
+        b.eq            .Lmode10_smooth_8
+
+        // size 16 edge smoothing
+        ldr             q2, [x1]                // top[0..15]
+        uxtl            v3.8h, v2.8b
+        uxtl2           v4.8h, v2.16b
+        sub             v3.8h, v3.8h, v1.8h
+        sub             v4.8h, v4.8h, v1.8h
+        sshr            v3.8h, v3.8h, #1
+        sshr            v4.8h, v4.8h, #1
+        uaddw           v3.8h, v3.8h, v5.8b
+        uaddw2          v4.8h, v4.8h, v5.16b
+        sqxtun          v2.8b, v3.8h
+        sqxtun2         v2.16b, v4.8h
+        st1             {v2.16b}, [x0]
+        ret
+
+.Lmode10_smooth_4:
+        ldr             s2, [x1]
+        uxtl            v3.8h, v2.8b
+        sub             v3.8h, v3.8h, v1.8h
+        sshr            v3.8h, v3.8h, #1
+        uaddw           v3.8h, v3.8h, v5.8b
+        sqxtun          v2.8b, v3.8h
+        st1             {v2.s}[0], [x0]
+        ret
+
+.Lmode10_smooth_8:
+        ldr             d2, [x1]
+        uxtl            v3.8h, v2.8b
+        sub             v3.8h, v3.8h, v1.8h
+        sshr            v3.8h, v3.8h, #1
+        uaddw           v3.8h, v3.8h, v5.8b
+        sqxtun          v2.8b, v3.8h
+        st1             {v2.8b}, [x0]
+
+9:      ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_mode_26_8: Vertical prediction (mode 26)
+// Caller must ensure top[-1] and left[-1] are valid (used for edge smoothing
+// when c_idx == 0 and size < 32).
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: log2_size
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_mode_26_8_neon, export=1
+        mov             w6, #1
+        lsl             w6, w6, w5      // size
+        mov             x7, x0          // x7 = write pointer (preserve x0)
+
+        cmp             w5, #2
+        b.ne            3f
+        // size 4: copy top[0..3], 4 rows
+        ldr             s0, [x1]        // Load top[0..3] once
+        mov             w9, w6          // Loop counter = 4
+
+104:    st1             {v0.s}[0], [x7], x3 // Store 4 bytes, increment stride
+        subs            w9, w9, #1
+        b.gt            104b
+        b               .Lmode26_edge_smooth
+
+3:      cmp             w5, #3
+        b.ne            4f
+        // size 8: copy top[0..7], 8 rows
+        ldr             d0, [x1]        // Load top[0..7] once
+        mov             w9, w6          // Loop counter = 8
+
+108:    st1             {v0.8b}, [x7], x3
+        subs            w9, w9, #1
+        b.gt            108b
+        b               .Lmode26_edge_smooth
+
+4:      cmp             w5, #4
+        b.ne            0f
+        // size 16: copy top[0..15], 16 rows
+        ldr             q0, [x1]        // Load top[0..15] once
+        mov             w9, w6          // Loop counter = 16
+
+116:    st1             {v0.16b}, [x7], x3
+        subs            w9, w9, #1
+        b.gt            116b
+        b               .Lmode26_edge_smooth
+
+0:      // size >= 32: load and copy
+        ldp             q0, q1, [x1]    // Load top[0..31] once
+        mov             x7, x0
+        mov             w9, w6          // Loop counter = size
+
+132:    stp             q0, q1, [x7]    // Store 32 bytes
+        add             x7, x7, x3      // Advance output pointer
+        subs            w9, w9, #1
+        b.gt            132b
+
+.Lmode26_edge_smooth:
+        cbnz            w4, 9f
+        cmp             w5, #5
+        b.ge            9f
+
+        // Restore x0 to original src (x7 has moved to src + size*stride)
+        mul             x9, x3, x6
+        sub             x0, x7, x9              // x0 = x7 - size*stride = original src
+
+        ldrb            w8, [x1]
+        ldrb            w9, [x2, #-1]
+        dup             v5.16b, w8              // v5 = top[0] (keep bytes for uaddw)
+        dup             v1.16b, w9
+        uxtl            v1.8h, v1.8b            // Unsigned extend left[-1] to halfwords in v1
+
+        cmp             w5, #2
+        b.eq            224f
+        cmp             w5, #3
+        b.eq            228f
+
+        ldr             q2, [x2]
+        uxtl            v3.8h, v2.8b           // Unsigned extend left[0..7] to halfwords
+        uxtl2           v4.8h, v2.16b          // Unsigned extend left[8..15] to halfwords
+        sub             v3.8h, v3.8h, v1.8h    // Subtract left[-1] (result can be negative)
+        sub             v4.8h, v4.8h, v1.8h
+        sshr            v3.8h, v3.8h, #1       // Arithmetic shift right by 1
+        sshr            v4.8h, v4.8h, #1
+        uaddw           v3.8h, v3.8h, v5.8b    // Add top[0] from v5 (unsigned extend, top[0] is unsigned)
+        uaddw2          v4.8h, v4.8h, v5.16b
+        sqxtun          v3.8b, v3.8h           // Saturate back to bytes
+        sqxtun2         v3.16b, v4.8h
+        
+        st1             {v3.b}[0], [x0], x3
+        st1             {v3.b}[1], [x0], x3
+        st1             {v3.b}[2], [x0], x3
+        st1             {v3.b}[3], [x0], x3
+        st1             {v3.b}[4], [x0], x3
+        st1             {v3.b}[5], [x0], x3
+        st1             {v3.b}[6], [x0], x3
+        st1             {v3.b}[7], [x0], x3
+        st1             {v3.b}[8], [x0], x3
+        st1             {v3.b}[9], [x0], x3
+        st1             {v3.b}[10], [x0], x3
+        st1             {v3.b}[11], [x0], x3
+        st1             {v3.b}[12], [x0], x3
+        st1             {v3.b}[13], [x0], x3
+        st1             {v3.b}[14], [x0], x3
+        st1             {v3.b}[15], [x0], x3
+        b               9f
+        
+224:    ldr             s2, [x2]
+        uxtl            v3.8h, v2.8b           // Unsigned extend left[0..3] to halfwords
+        sub             v3.8h, v3.8h, v1.8h    // Subtract left[-1] (result can be negative)
+        sshr            v3.8h, v3.8h, #1       // Arithmetic shift right by 1
+        uaddw           v3.8h, v3.8h, v5.8b    // Add top[0] from v5 (unsigned extend, top[0] is unsigned)
+        sqxtun          v3.8b, v3.8h           // Saturate back to bytes
+        st1             {v3.b}[0], [x0], x3
+        st1             {v3.b}[1], [x0], x3
+        st1             {v3.b}[2], [x0], x3
+        st1             {v3.b}[3], [x0], x3
+        b               9f
+
+228:    ldr             d2, [x2]
+        uxtl            v3.8h, v2.8b           // Unsigned extend left[0..7] to halfwords
+        sub             v3.8h, v3.8h, v1.8h    // Subtract left[-1] (result can be negative)
+        sshr            v3.8h, v3.8h, #1       // Arithmetic shift right by 1
+        uaddw           v3.8h, v3.8h, v5.8b    // Add top[0] from v5 (unsigned extend, top[0] is unsigned)
+        sqxtun          v3.8b, v3.8h           // Saturate back to bytes
+        st1             {v3.b}[0], [x0], x3
+        st1             {v3.b}[1], [x0], x3
+        st1             {v3.b}[2], [x0], x3
+        st1             {v3.b}[3], [x0], x3
+        st1             {v3.b}[4], [x0], x3
+        st1             {v3.b}[5], [x0], x3
+        st1             {v3.b}[6], [x0], x3
+        st1             {v3.b}[7], [x0], x3
+        b               9f
+
+9:      ret
+endfunc
-- 
2.52.0


From bfc914bce0c593a2981c646a2197f6e5016b9e04 Mon Sep 17 00:00:00 2001
From: Jun Zhao <barryjzhao@tencent.com>
Date: Tue, 27 Jan 2026 18:33:17 +0800
Subject: [PATCH 3/8] lavc/hevc: add aarch64 NEON for angular mode 18

Add NEON-optimized implementation for HEVC angular intra prediction
mode 18 (diagonal mode, angle=-32) at 8-bit depth.

Mode 18 is a special case where:
- angle = -32, so idx = -(y+1), fact = 0 (no interpolation needed)
- Row y copies from ref[-y..size-1-y], where ref is built from
  reversed left samples and top samples

Supports all block sizes (4x4, 8x8, 16x16, 32x32):
- 4x4/8x8: Uses register-based ref array with ext instructions
- 16x16/32x32: Uses stack-based ref array for larger reference range

Speedup over C on Apple M4 (checkasm --bench, 10-run average):

    4x4: 2.94x    8x8: 4.93x    16x16: 2.21x    32x32: 3.24x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
---
 libavcodec/aarch64/hevcpred_init_aarch64.c |  22 ++
 libavcodec/aarch64/hevcpred_neon.S         | 225 +++++++++++++++++++++
 2 files changed, 247 insertions(+)

diff --git a/libavcodec/aarch64/hevcpred_init_aarch64.c b/libavcodec/aarch64/hevcpred_init_aarch64.c
index 4186917a77..42e57314f9 100644
--- a/libavcodec/aarch64/hevcpred_init_aarch64.c
+++ b/libavcodec/aarch64/hevcpred_init_aarch64.c
@@ -57,6 +57,20 @@ void ff_hevc_pred_angular_mode_26_8_neon(uint8_t *src, const uint8_t *top,
                                         const uint8_t *left, ptrdiff_t stride,
                                         int c_idx, int log2_size);
 
+// Mode 18 (diagonal, angle=-32)
+void ff_hevc_pred_angular_mode_18_4x4_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int log2_size);
+void ff_hevc_pred_angular_mode_18_8x8_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int log2_size);
+void ff_hevc_pred_angular_mode_18_16x16_8_neon(uint8_t *src, const uint8_t *top,
+                                              const uint8_t *left, ptrdiff_t stride,
+                                              int c_idx, int log2_size);
+void ff_hevc_pred_angular_mode_18_32x32_8_neon(uint8_t *src, const uint8_t *top,
+                                              const uint8_t *left, ptrdiff_t stride,
+                                              int c_idx, int log2_size);
+
 static void pred_dc_neon(uint8_t *src, const uint8_t *top,
                          const uint8_t *left, ptrdiff_t stride,
                          int log2_size, int c_idx)
@@ -92,6 +106,8 @@ static void pred_angular_0_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 2);
     } else if (mode == 26) {
         ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 2);
+    } else if (mode == 18) {
+        ff_hevc_pred_angular_mode_18_4x4_8_neon(src, top, left, stride, c_idx, 2);
     } else {
         pred_angular_c[0](src, top, left, stride, c_idx, mode);
     }
@@ -105,6 +121,8 @@ static void pred_angular_1_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 3);
     } else if (mode == 26) {
         ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 3);
+    } else if (mode == 18) {
+        ff_hevc_pred_angular_mode_18_8x8_8_neon(src, top, left, stride, c_idx, 3);
     } else {
         pred_angular_c[1](src, top, left, stride, c_idx, mode);
     }
@@ -118,6 +136,8 @@ static void pred_angular_2_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 4);
     } else if (mode == 26) {
         ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 4);
+    } else if (mode == 18) {
+        ff_hevc_pred_angular_mode_18_16x16_8_neon(src, top, left, stride, c_idx, 4);
     } else {
         pred_angular_c[2](src, top, left, stride, c_idx, mode);
     }
@@ -131,6 +151,8 @@ static void pred_angular_3_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 5);
     } else if (mode == 26) {
         ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 5);
+    } else if (mode == 18) {
+        ff_hevc_pred_angular_mode_18_32x32_8_neon(src, top, left, stride, c_idx, 5);
     } else {
         pred_angular_c[3](src, top, left, stride, c_idx, mode);
     }
diff --git a/libavcodec/aarch64/hevcpred_neon.S b/libavcodec/aarch64/hevcpred_neon.S
index a7aecb1076..1df1a48e47 100644
--- a/libavcodec/aarch64/hevcpred_neon.S
+++ b/libavcodec/aarch64/hevcpred_neon.S
@@ -1021,3 +1021,228 @@ function ff_hevc_pred_angular_mode_26_8_neon, export=1
 
 9:      ret
 endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_mode_18_4x4_8: Mode 18 prediction for 4x4 block
+// Row 0: top[-1], top[0], top[1], top[2]
+// Row 1: left[0], top[-1], top[0], top[1]
+// Row 2: left[1], left[0], top[-1], top[0]
+// Row 3: left[2], left[1], left[0], top[-1]
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_mode_18_4x4_8_neon, export=1
+        // Build ref array in register
+        // ref[-4..-1] = left[3], left[2], left[1], left[0]  (reversed)
+        // ref[0..4] = top[-1..3]
+
+        // Load left[0..3] and reverse
+        ldr             s0, [x2]                // left[0..3]
+        rev32           v0.8b, v0.8b            // v0 = {left[3], left[2], left[1], left[0], ...}
+
+        // Load top[-1..3]
+        sub             x4, x1, #1
+        ldr             d1, [x4]                // top[-1..6]
+
+        // Combine: {left[3,2,1,0], top[-1,0,1,2,3,...]}
+        ins             v0.s[1], v1.s[0]        // v0 = {left[3,2,1,0], top[-1,0,1,2], ...}
+
+        // Row 0: ref[0..3] = top[-1..2] = v0[4..7]
+        ext             v2.8b, v0.8b, v0.8b, #4
+        str             s2, [x0]
+        add             x0, x0, x3
+
+        // Row 1: ref[-1..2] = v0[3..6]
+        ext             v2.8b, v0.8b, v0.8b, #3
+        str             s2, [x0]
+        add             x0, x0, x3
+
+        // Row 2: ref[-2..1] = v0[2..5]
+        ext             v2.8b, v0.8b, v0.8b, #2
+        str             s2, [x0]
+        add             x0, x0, x3
+
+        // Row 3: ref[-3..0] = v0[1..4]
+        ext             v2.8b, v0.8b, v0.8b, #1
+        str             s2, [x0]
+
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_mode_18_8x8_8: Mode 18 prediction for 8x8 block
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_mode_18_8x8_8_neon, export=1
+        // ref[-8..-1] = left[7..0] (reversed)
+        // ref[0..8] = top[-1..7]
+
+        // Load left[0..7] and reverse
+        ldr             d0, [x2]                // left[0..7]
+        rev64           v0.8b, v0.8b            // {left[7..0]}
+
+        // Load top[-1..7]
+        sub             x4, x1, #1
+        ldr             q1, [x4]                // top[-1..14]
+
+        // Combine into v2 (16 bytes): {left[7..0], top[-1..7]}
+        mov             v2.d[0], v0.d[0]        // v2[0..7] = left[7..0]
+        mov             v2.d[1], v1.d[0]        // v2[8..15] = top[-1..6]
+
+        // Row 0: ref[0..7] = top[-1..6] = v2[8..15]
+        mov             v3.d[0], v2.d[1]
+        str             d3, [x0]
+        add             x0, x0, x3
+
+        // Row 1-7: use ext with decreasing offset
+        ext             v3.16b, v2.16b, v2.16b, #7
+        str             d3, [x0]
+        add             x0, x0, x3
+
+        ext             v3.16b, v2.16b, v2.16b, #6
+        str             d3, [x0]
+        add             x0, x0, x3
+
+        ext             v3.16b, v2.16b, v2.16b, #5
+        str             d3, [x0]
+        add             x0, x0, x3
+
+        ext             v3.16b, v2.16b, v2.16b, #4
+        str             d3, [x0]
+        add             x0, x0, x3
+
+        ext             v3.16b, v2.16b, v2.16b, #3
+        str             d3, [x0]
+        add             x0, x0, x3
+
+        ext             v3.16b, v2.16b, v2.16b, #2
+        str             d3, [x0]
+        add             x0, x0, x3
+
+        ext             v3.16b, v2.16b, v2.16b, #1
+        str             d3, [x0]
+
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_mode_18_16x16_8: Mode 18 prediction for 16x16 block
+// ref[-16..-1] = left[15..0] reversed, ref[0..15] = top[-1..14]
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_mode_18_16x16_8_neon, export=1
+        // Register-based approach using EXT to slide a window across {v0:v1}.
+        // v0 = left[15..0] (reversed), v1 = top[-1..14]
+        // Row k: need ref[-k..15-k] = EXT(v0, v1, #16-k) for k=1..15, row 0 = v1.
+
+        ldr             q0, [x2]                // left[0..15]
+        rev64           v0.16b, v0.16b          // reverse in 64-bit lanes
+        ext             v0.16b, v0.16b, v0.16b, #8  // v0 = left[15..0]
+        sub             x4, x1, #1
+        ldr             q1, [x4]                // v1 = top[-1..14]
+
+        // Row 0: ref[0..15] = v1
+        str             q1, [x0]
+        add             x0, x0, x3
+        // Row 1: EXT(v0, v1, #15) = {v0[15], v1[0..14]} = {left[0], top[-1..13]}
+        ext             v2.16b, v0.16b, v1.16b, #15
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 2
+        ext             v2.16b, v0.16b, v1.16b, #14
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 3
+        ext             v2.16b, v0.16b, v1.16b, #13
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 4
+        ext             v2.16b, v0.16b, v1.16b, #12
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 5
+        ext             v2.16b, v0.16b, v1.16b, #11
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 6
+        ext             v2.16b, v0.16b, v1.16b, #10
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 7
+        ext             v2.16b, v0.16b, v1.16b, #9
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 8
+        ext             v2.16b, v0.16b, v1.16b, #8
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 9
+        ext             v2.16b, v0.16b, v1.16b, #7
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 10
+        ext             v2.16b, v0.16b, v1.16b, #6
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 11
+        ext             v2.16b, v0.16b, v1.16b, #5
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 12
+        ext             v2.16b, v0.16b, v1.16b, #4
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 13
+        ext             v2.16b, v0.16b, v1.16b, #3
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 14
+        ext             v2.16b, v0.16b, v1.16b, #2
+        str             q2, [x0]
+        add             x0, x0, x3
+        // Row 15
+        ext             v2.16b, v0.16b, v1.16b, #1
+        str             q2, [x0]
+
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_mode_18_32x32_8: Mode 18 prediction for 32x32 block
+// ref[-32..-1] = left[31..0] reversed, ref[0..31] = top[-1..30]
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_mode_18_32x32_8_neon, export=1
+        // Use memory-based approach: load from combined memory layout
+        // For row y: load 32 bytes from top[-1-y..30-y]
+        // When y > 0, some bytes come from extended left reference
+
+        // Build ref array on stack (64 bytes: ref[-32..31])
+        sub             sp, sp, #64
+
+        // Store left[31..0] reversed at sp[0..31] (ref[-32..-1])
+        ldp             q0, q1, [x2]            // left[0..31]
+        rev64           v0.16b, v0.16b
+        ext             v0.16b, v0.16b, v0.16b, #8  // left[15..0]
+        rev64           v1.16b, v1.16b
+        ext             v1.16b, v1.16b, v1.16b, #8  // left[31..16]
+        stp             q1, q0, [sp]            // {left[31..16], left[15..0]}
+
+        // Store top[-1..30] at sp[32..63] (ref[0..31])
+        sub             x4, x1, #1
+        ldp             q2, q3, [x4]            // top[-1..30]
+        stp             q2, q3, [sp, #32]
+
+        // ref_base = sp + 32 (so ref[0] = sp[32], ref[-1] = sp[31], etc.)
+        add             x4, sp, #32
+
+        mov             w5, #0
+1:
+        // Row y: load from ref[-y..31-y] = &ref_base[-y]
+        sub             x6, x4, w5, sxtw
+        ldp             q0, q1, [x6]
+        stp             q0, q1, [x0]
+        add             x0, x0, x3
+
+        add             w5, w5, #1
+        cmp             w5, #32
+        b.lt            1b
+
+        add             sp, sp, #64
+        ret
+endfunc
-- 
2.52.0


From 960b8c1ec464b2fa4ad55ee395363d85a57d2a68 Mon Sep 17 00:00:00 2001
From: Jun Zhao <barryjzhao@tencent.com>
Date: Tue, 27 Jan 2026 18:36:55 +0800
Subject: [PATCH 4/8] lavc/hevc: add aarch64 NEON for angular V positive (modes
 27-34)

Add NEON-optimized implementations for HEVC angular intra prediction
modes 27-34 (vertical positive angles) at 8-bit depth.

These modes use the top reference with positive angles, computing:
- idx = ((y+1) * angle) >> 5
- fact = ((y+1) * angle) & 31
- Interpolate between ref[idx] and ref[idx+1] using fact

Mode 34 (angle=32) is optimized as a pure diagonal copy since fact=0.

Supports all block sizes (4x4, 8x8, 16x16, 32x32).

Speedup over C on Apple M4 (checkasm --bench, 10-run average):

    4x4:   1.42x - 2.30x    8x8:   3.19x - 3.39x
    16x16: 1.69x - 7.02x    32x32: 3.12x - 10.30x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
---
 libavcodec/aarch64/hevcpred_init_aarch64.c |  22 ++
 libavcodec/aarch64/hevcpred_neon.S         | 304 +++++++++++++++++++++
 2 files changed, 326 insertions(+)

diff --git a/libavcodec/aarch64/hevcpred_init_aarch64.c b/libavcodec/aarch64/hevcpred_init_aarch64.c
index 42e57314f9..3d27c251e1 100644
--- a/libavcodec/aarch64/hevcpred_init_aarch64.c
+++ b/libavcodec/aarch64/hevcpred_init_aarch64.c
@@ -71,6 +71,20 @@ void ff_hevc_pred_angular_mode_18_32x32_8_neon(uint8_t *src, const uint8_t *top,
                                               const uint8_t *left, ptrdiff_t stride,
                                               int c_idx, int log2_size);
 
+// Positive angle vertical modes (mode 27-34)
+void ff_hevc_pred_angular_v_pos_4x4_8_neon(uint8_t *src, const uint8_t *top,
+                                          const uint8_t *left, ptrdiff_t stride,
+                                          int c_idx, int mode);
+void ff_hevc_pred_angular_v_pos_8x8_8_neon(uint8_t *src, const uint8_t *top,
+                                          const uint8_t *left, ptrdiff_t stride,
+                                          int c_idx, int mode);
+void ff_hevc_pred_angular_v_pos_16x16_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int mode);
+void ff_hevc_pred_angular_v_pos_32x32_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int mode);
+
 static void pred_dc_neon(uint8_t *src, const uint8_t *top,
                          const uint8_t *left, ptrdiff_t stride,
                          int log2_size, int c_idx)
@@ -108,6 +122,8 @@ static void pred_angular_0_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 2);
     } else if (mode == 18) {
         ff_hevc_pred_angular_mode_18_4x4_8_neon(src, top, left, stride, c_idx, 2);
+    } else if (mode >= 27) {
+        ff_hevc_pred_angular_v_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
     } else {
         pred_angular_c[0](src, top, left, stride, c_idx, mode);
     }
@@ -123,6 +139,8 @@ static void pred_angular_1_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 3);
     } else if (mode == 18) {
         ff_hevc_pred_angular_mode_18_8x8_8_neon(src, top, left, stride, c_idx, 3);
+    } else if (mode >= 27) {
+        ff_hevc_pred_angular_v_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
     } else {
         pred_angular_c[1](src, top, left, stride, c_idx, mode);
     }
@@ -138,6 +156,8 @@ static void pred_angular_2_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 4);
     } else if (mode == 18) {
         ff_hevc_pred_angular_mode_18_16x16_8_neon(src, top, left, stride, c_idx, 4);
+    } else if (mode >= 27) {
+        ff_hevc_pred_angular_v_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
     } else {
         pred_angular_c[2](src, top, left, stride, c_idx, mode);
     }
@@ -153,6 +173,8 @@ static void pred_angular_3_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 5);
     } else if (mode == 18) {
         ff_hevc_pred_angular_mode_18_32x32_8_neon(src, top, left, stride, c_idx, 5);
+    } else if (mode >= 27) {
+        ff_hevc_pred_angular_v_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
     } else {
         pred_angular_c[3](src, top, left, stride, c_idx, mode);
     }
diff --git a/libavcodec/aarch64/hevcpred_neon.S b/libavcodec/aarch64/hevcpred_neon.S
index 1df1a48e47..37ddab42bf 100644
--- a/libavcodec/aarch64/hevcpred_neon.S
+++ b/libavcodec/aarch64/hevcpred_neon.S
@@ -1246,3 +1246,307 @@ function ff_hevc_pred_angular_mode_18_32x32_8_neon, export=1
         add             sp, sp, #64
         ret
 endfunc
+
+// =============================================================================
+// Angular Prediction - Vertical reference modes (Mode 27-34)
+// =============================================================================
+
+// Angle table for V reference positive angles (mode 27-34)
+// angle = intra_pred_angle_v[mode - 27]
+const intra_pred_angle_v, align=4
+        .byte   2       // mode 27
+        .byte   5       // mode 28
+        .byte   9       // mode 29
+        .byte   13      // mode 30
+        .byte   17      // mode 31
+        .byte   21      // mode 32
+        .byte   26      // mode 33
+        .byte   32      // mode 34
+endconst
+
+// -----------------------------------------------------------------------------
+// pred_angular_v_pos_4x4_8: Vertical reference positive angle prediction (mode 27-34)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_v_pos_4x4_8_neon, export=1
+        // Load angle from table
+        movrel          x6, intra_pred_angle_v
+        sub             w7, w5, #27            // mode - 27 (index into angle table)
+        ldrsb           w8, [x6, w7, sxtw]     // angle = intra_pred_angle_v[mode-27]
+
+        // For mode 34 (angle=32), fact is always 0, optimize as pure copy
+        cmp             w8, #32
+        b.eq            .Lv_pos_4x4_mode34
+
+        mov             w10, #0                 // angle_acc = 0
+        movi            v18.16b, #32            // constant 32 for NEON-domain weight computation
+
+.macro v_pos_4x4_row
+        add             w10, w10, w8            // angle_acc = (y+1) * angle
+        asr             w11, w10, #5            // idx = angle_acc >> 5
+        and             w12, w10, #31           // fact = angle_acc & 31
+
+        // Load reference pixels top[idx..idx+4]
+        add             x13, x1, w11, sxtw      // x13 = top + idx
+        ldr             s0, [x13]
+        ldr             s1, [x13, #1]
+
+        // Unconditional interpolation: ((32-fact)*ref[idx+1] + fact*ref[idx+2] + 16) >> 5
+        // When fact=0, this simplifies to ref[idx+1] exactly due to rounding in rshrn.
+        dup             v17.8b, w12             // broadcast fact
+        sub             v16.8b, v18.8b, v17.8b
+
+        umull           v20.8h, v0.8b, v16.8b   // (32-fact) * ref[idx+1]
+        umlal           v20.8h, v1.8b, v17.8b   // + fact * ref[idx+2]
+        rshrn           v0.8b, v20.8h, #5       // (result + 16) >> 5
+
+        st1             {v0.s}[0], [x0], x3
+.endm
+        v_pos_4x4_row
+        v_pos_4x4_row
+        v_pos_4x4_row
+        v_pos_4x4_row
+.purgem v_pos_4x4_row
+
+        ret
+
+.Lv_pos_4x4_mode34:
+        // Mode 34: angle=32, each row copies from top[y+1..y+4]
+        // Row 0: top[1..4], Row 1: top[2..5], Row 2: top[3..6], Row 3: top[4..7]
+        ldr             s0, [x1, #1]
+        st1             {v0.s}[0], [x0], x3
+        ldr             s0, [x1, #2]
+        st1             {v0.s}[0], [x0], x3
+        ldr             s0, [x1, #3]
+        st1             {v0.s}[0], [x0], x3
+        ldr             s0, [x1, #4]
+        str             s0, [x0]
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_v_pos_8x8_8: Vertical reference positive angle prediction (mode 27-34)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_v_pos_8x8_8_neon, export=1
+        // Load angle from table
+        movrel          x6, intra_pred_angle_v
+        sub             w7, w5, #27            // mode - 27 (index into angle table)
+        ldrsb           w8, [x6, w7, sxtw]     // angle = intra_pred_angle_v[mode-27]
+
+        // Mode 34 optimization
+        cmp             w8, #32
+        b.eq            .Lv_pos_8x8_mode34
+
+        mov             w9, #0                  // y = 0
+        mov             w10, #0                 // angle_acc = 0
+        movi            v18.16b, #32            // constant 32 for NEON-domain weight computation
+
+.Lv_pos_8x8_row_loop:
+        add             w10, w10, w8            // angle_acc = (y+1) * angle
+        asr             w11, w10, #5            // idx
+        and             w12, w10, #31           // fact
+
+        add             x13, x1, w11, sxtw
+        ldr             d0, [x13]               // ref[idx+1..idx+8]
+        ldr             d1, [x13, #1]           // ref[idx+2..idx+9]
+
+        // Unconditional interpolation: ((32-fact)*ref[idx+1] + fact*ref[idx+2] + 16) >> 5
+        dup             v17.8b, w12
+        sub             v16.8b, v18.8b, v17.8b
+
+        umull           v20.8h, v0.8b, v16.8b
+        umlal           v20.8h, v1.8b, v17.8b
+        rshrn           v0.8b, v20.8h, #5
+
+        st1             {v0.8b}, [x0], x3
+
+        add             w9, w9, #1
+        cmp             w9, #8
+        b.lt            .Lv_pos_8x8_row_loop
+
+        ret
+
+.Lv_pos_8x8_mode34:
+        // Mode 34: each row copies from top[y+1..y+8]
+        ldr             d0, [x1, #1]
+        st1             {v0.8b}, [x0], x3
+        ldr             d0, [x1, #2]
+        st1             {v0.8b}, [x0], x3
+        ldr             d0, [x1, #3]
+        st1             {v0.8b}, [x0], x3
+        ldr             d0, [x1, #4]
+        st1             {v0.8b}, [x0], x3
+        ldr             d0, [x1, #5]
+        st1             {v0.8b}, [x0], x3
+        ldr             d0, [x1, #6]
+        st1             {v0.8b}, [x0], x3
+        ldr             d0, [x1, #7]
+        st1             {v0.8b}, [x0], x3
+        ldr             d0, [x1, #8]
+        str             d0, [x0]
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_v_pos_16x16_8: Vertical reference positive angle prediction (mode 27-34)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_v_pos_16x16_8_neon, export=1
+        // Load angle from table
+        movrel          x6, intra_pred_angle_v
+        sub             w7, w5, #27            // mode - 27 (index into angle table)
+        ldrsb           w8, [x6, w7, sxtw]     // angle = intra_pred_angle_v[mode-27]
+
+        // Mode 34 optimization
+        cmp             w8, #32
+        b.eq            .Lv_pos_16x16_mode34
+
+        mov             w9, #0                  // y = 0
+        mov             w10, #0                 // angle_acc = 0
+        movi            v18.16b, #32            // constant 32 for NEON-domain weight computation
+
+.Lv_pos_16x16_row_loop:
+        add             w10, w10, w8            // angle_acc = (y+1) * angle
+        asr             w11, w10, #5            // idx
+        and             w12, w10, #31           // fact
+
+        add             x13, x1, w11, sxtw
+        ldr             q0, [x13]               // ref[idx+1..idx+16]
+        ldr             q1, [x13, #1]           // ref[idx+2..idx+17]
+
+        // Unconditional interpolation: ((32-fact)*ref[idx+1] + fact*ref[idx+2] + 16) >> 5
+        dup             v17.16b, w12
+        sub             v16.16b, v18.16b, v17.16b
+
+        // Low 8 bytes
+        umull           v20.8h, v0.8b, v16.8b
+        umlal           v20.8h, v1.8b, v17.8b
+        rshrn           v2.8b, v20.8h, #5
+
+        // High 8 bytes
+        umull2          v21.8h, v0.16b, v16.16b
+        umlal2          v21.8h, v1.16b, v17.16b
+        rshrn2          v2.16b, v21.8h, #5
+
+        st1             {v2.16b}, [x0], x3
+
+        add             w9, w9, #1
+        cmp             w9, #16
+        b.lt            .Lv_pos_16x16_row_loop
+
+        ret
+
+.Lv_pos_16x16_mode34:
+        // Mode 34: each row copies from top[y+1..y+16]
+        mov             w9, #0
+.Lv_pos_16x16_mode34_loop:
+        add             w10, w9, #1
+        add             x13, x1, w10, sxtw
+        ldr             q0, [x13]
+        st1             {v0.16b}, [x0], x3
+        add             w9, w9, #1
+        cmp             w9, #16
+        b.lt            .Lv_pos_16x16_mode34_loop
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_v_pos_32x32_8: Vertical reference positive angle prediction (mode 27-34)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_v_pos_32x32_8_neon, export=1
+        // Load angle from table
+        movrel          x6, intra_pred_angle_v
+        sub             w7, w5, #27            // mode - 27 (index into angle table)
+        ldrsb           w8, [x6, w7, sxtw]     // angle = intra_pred_angle_v[mode-27]
+
+        // Mode 34 optimization
+        cmp             w8, #32
+        b.eq            .Lv_pos_32x32_mode34
+
+        mov             w9, #0                  // y = 0
+        mov             w10, #0                 // angle_acc = 0
+        movi            v18.16b, #32            // constant 32 for NEON-domain weight computation
+
+.Lv_pos_32x32_row_loop:
+        add             w10, w10, w8            // angle_acc = (y+1) * angle
+        asr             w11, w10, #5            // idx
+        and             w12, w10, #31           // fact
+
+        add             x13, x1, w11, sxtw
+
+        // Load 32 bytes + 1 for interpolation (unconditionally)
+        ldr             q0, [x13]               // ref[idx+1..idx+16]
+        ldr             q1, [x13, #1]           // ref[idx+2..idx+17]
+        ldr             q2, [x13, #16]          // ref[idx+17..idx+32]
+        ldr             q3, [x13, #17]          // ref[idx+18..idx+33]
+
+        // Unconditional interpolation: ((32-fact)*ref[idx+1] + fact*ref[idx+2] + 16) >> 5
+        dup             v17.16b, w12
+        sub             v16.16b, v18.16b, v17.16b
+
+        // First 16 bytes
+        umull           v20.8h, v0.8b, v16.8b
+        umlal           v20.8h, v1.8b, v17.8b
+        rshrn           v4.8b, v20.8h, #5
+
+        umull2          v21.8h, v0.16b, v16.16b
+        umlal2          v21.8h, v1.16b, v17.16b
+        rshrn2          v4.16b, v21.8h, #5
+
+        // Second 16 bytes
+        umull           v22.8h, v2.8b, v16.8b
+        umlal           v22.8h, v3.8b, v17.8b
+        rshrn           v5.8b, v22.8h, #5
+
+        umull2          v23.8h, v2.16b, v16.16b
+        umlal2          v23.8h, v3.16b, v17.16b
+        rshrn2          v5.16b, v23.8h, #5
+
+        st1             {v4.16b, v5.16b}, [x0], x3
+
+        add             w9, w9, #1
+        cmp             w9, #32
+        b.lt            .Lv_pos_32x32_row_loop
+
+        ret
+
+.Lv_pos_32x32_mode34:
+        // Mode 34: each row copies from top[y+1..y+32]
+        mov             w9, #0
+.Lv_pos_32x32_mode34_loop:
+        add             w10, w9, #1
+        add             x13, x1, w10, sxtw
+        ldr             q0, [x13]
+        ldr             q1, [x13, #16]
+        st1             {v0.16b, v1.16b}, [x0], x3
+        add             w9, w9, #1
+        cmp             w9, #32
+        b.lt            .Lv_pos_32x32_mode34_loop
+        ret
+endfunc
-- 
2.52.0


From 40e19a87d74258bfd63a571118ac8ca4aa1e35fc Mon Sep 17 00:00:00 2001
From: Jun Zhao <barryjzhao@tencent.com>
Date: Tue, 27 Jan 2026 18:40:31 +0800
Subject: [PATCH 5/8] lavc/hevc: add aarch64 NEON for angular H positive (modes
 2-9)

Add NEON-optimized implementations for HEVC angular intra prediction
modes 2-9 (horizontal positive angles) at 8-bit depth.

These modes use the left reference with positive angles, computing:
- idx = ((x+1) * angle) >> 5
- fact = ((x+1) * angle) & 31
- Interpolate between ref[idx] and ref[idx+1] using fact

Uses batch column computation with matrix transpose to convert
column-oriented interpolation results into contiguous row stores.

Mode 2 (angle=32) is optimized with direct row-wise contiguous writes
since each row copies left[y+1..y+size], avoiding interpolation.

Supports all block sizes (4x4, 8x8, 16x16, 32x32).

Speedup over C on Apple M4 (checkasm --bench, 10-run average):

    4x4:   2.29x - 3.29x    8x8:   3.40x - 4.73x
    16x16: 5.15x - 13.14x   32x32: 8.19x - 13.18x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
---
 libavcodec/aarch64/hevcpred_init_aarch64.c |  22 +
 libavcodec/aarch64/hevcpred_neon.S         | 477 +++++++++++++++++++++
 2 files changed, 499 insertions(+)

diff --git a/libavcodec/aarch64/hevcpred_init_aarch64.c b/libavcodec/aarch64/hevcpred_init_aarch64.c
index 3d27c251e1..2973c005ed 100644
--- a/libavcodec/aarch64/hevcpred_init_aarch64.c
+++ b/libavcodec/aarch64/hevcpred_init_aarch64.c
@@ -85,6 +85,20 @@ void ff_hevc_pred_angular_v_pos_32x32_8_neon(uint8_t *src, const uint8_t *top,
                                             const uint8_t *left, ptrdiff_t stride,
                                             int c_idx, int mode);
 
+// Positive angle horizontal modes (mode 2-9)
+void ff_hevc_pred_angular_h_pos_4x4_8_neon(uint8_t *src, const uint8_t *top,
+                                          const uint8_t *left, ptrdiff_t stride,
+                                          int c_idx, int mode);
+void ff_hevc_pred_angular_h_pos_8x8_8_neon(uint8_t *src, const uint8_t *top,
+                                          const uint8_t *left, ptrdiff_t stride,
+                                          int c_idx, int mode);
+void ff_hevc_pred_angular_h_pos_16x16_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int mode);
+void ff_hevc_pred_angular_h_pos_32x32_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int mode);
+
 static void pred_dc_neon(uint8_t *src, const uint8_t *top,
                          const uint8_t *left, ptrdiff_t stride,
                          int log2_size, int c_idx)
@@ -124,6 +138,8 @@ static void pred_angular_0_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_18_4x4_8_neon(src, top, left, stride, c_idx, 2);
     } else if (mode >= 27) {
         ff_hevc_pred_angular_v_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode >= 2 && mode <= 9) {
+        ff_hevc_pred_angular_h_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
     } else {
         pred_angular_c[0](src, top, left, stride, c_idx, mode);
     }
@@ -141,6 +157,8 @@ static void pred_angular_1_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_18_8x8_8_neon(src, top, left, stride, c_idx, 3);
     } else if (mode >= 27) {
         ff_hevc_pred_angular_v_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode >= 2 && mode <= 9) {
+        ff_hevc_pred_angular_h_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
     } else {
         pred_angular_c[1](src, top, left, stride, c_idx, mode);
     }
@@ -158,6 +176,8 @@ static void pred_angular_2_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_18_16x16_8_neon(src, top, left, stride, c_idx, 4);
     } else if (mode >= 27) {
         ff_hevc_pred_angular_v_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode >= 2 && mode <= 9) {
+        ff_hevc_pred_angular_h_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
     } else {
         pred_angular_c[2](src, top, left, stride, c_idx, mode);
     }
@@ -175,6 +195,8 @@ static void pred_angular_3_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_18_32x32_8_neon(src, top, left, stride, c_idx, 5);
     } else if (mode >= 27) {
         ff_hevc_pred_angular_v_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode >= 2 && mode <= 9) {
+        ff_hevc_pred_angular_h_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
     } else {
         pred_angular_c[3](src, top, left, stride, c_idx, mode);
     }
diff --git a/libavcodec/aarch64/hevcpred_neon.S b/libavcodec/aarch64/hevcpred_neon.S
index 37ddab42bf..3d982a3589 100644
--- a/libavcodec/aarch64/hevcpred_neon.S
+++ b/libavcodec/aarch64/hevcpred_neon.S
@@ -21,6 +21,7 @@
  */
 
 #include "libavutil/aarch64/asm.S"
+#include "neon.S"
 
 /* HEVC Intra Prediction functions
  *
@@ -1550,3 +1551,479 @@ function ff_hevc_pred_angular_v_pos_32x32_8_neon, export=1
         b.lt            .Lv_pos_32x32_mode34_loop
         ret
 endfunc
+
+// =============================================================================
+// Angular Prediction - Horizontal modes, positive angle (Mode 2-9)
+// =============================================================================
+
+const intra_pred_angle_h, align=4
+        .byte   32      // mode 2
+        .byte   26      // mode 3
+        .byte   21      // mode 4
+        .byte   17      // mode 5
+        .byte   13      // mode 6
+        .byte   9       // mode 7
+        .byte   5       // mode 8
+        .byte   2       // mode 9
+endconst
+
+// -----------------------------------------------------------------------------
+// pred_angular_h_pos_4x4_8: Horizontal reference positive angle prediction (mode 2-9)
+// Arguments:
+// x0: src
+// x1: top (unused for H reference modes)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_h_pos_4x4_8_neon, export=1
+        // Load angle from table
+        movrel          x6, intra_pred_angle_h
+        sub             w7, w5, #2              // mode - 2 (index into angle table)
+        ldrb            w8, [x6, w7, uxtw]      // angle = intra_pred_angle_h[mode-2]
+
+        // For mode 2 (angle=32), fact is always 0, optimize as pure copy
+        cmp             w8, #32
+        b.eq            .Lh_pos_4x4_mode2
+
+        // === Fully unrolled 4-column computation with transpose ===
+        str             d15, [sp, #-16]!
+        mov             w10, #0                 // angle_acc
+        movi            v15.16b, #32            // constant 32 for NEON-domain weight computation
+
+.macro h_pos_4x4_col dst
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+        add             x13, x2, w11, sxtw
+        ldr             d18, [x13]
+        ldr             d19, [x13, #1]
+        dup             v17.8b, w12
+        sub             v16.8b, v15.8b, v17.8b
+        umull           v20.8h, v18.8b, v16.8b
+        umlal           v20.8h, v19.8b, v17.8b
+        rshrn           \dst\().8b, v20.8h, #5
+.endm
+        h_pos_4x4_col  v0
+        h_pos_4x4_col  v1
+        h_pos_4x4_col  v2
+        h_pos_4x4_col  v3
+.purgem h_pos_4x4_col
+
+        transpose_4x8B  v0, v1, v2, v3, v16, v17, v18, v19
+
+        str             s0, [x0]
+        add             x0, x0, x3
+        str             s1, [x0]
+        add             x0, x0, x3
+        str             s2, [x0]
+        add             x0, x0, x3
+        str             s3, [x0]
+        ldr             d15, [sp], #16
+        ret
+
+.Lh_pos_4x4_mode2:
+        // Mode 2: Row-wise optimization
+        // Row y contains left[y+1..y+4], which is a contiguous read + contiguous write
+        // Row 0: left[1..4], Row 1: left[2..5], Row 2: left[3..6], Row 3: left[4..7]
+        add             x5, x2, #1              // left + 1
+        ldr             s0, [x5]                // row 0: left[1..4]
+        ldr             s1, [x5, #1]            // row 1: left[2..5]
+        ldr             s2, [x5, #2]            // row 2: left[3..6]
+        ldr             s3, [x5, #3]            // row 3: left[4..7]
+        str             s0, [x0]
+        add             x0, x0, x3
+        str             s1, [x0]
+        add             x0, x0, x3
+        str             s2, [x0]
+        add             x0, x0, x3
+        str             s3, [x0]
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_h_pos_8x8_8: Horizontal reference positive angle prediction (mode 2-9)
+// Arguments:
+// x0: src
+// x1: top (unused for H reference modes)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_h_pos_8x8_8_neon, export=1
+        // Load angle from table
+        movrel          x6, intra_pred_angle_h
+        sub             w7, w5, #2
+        ldrb            w8, [x6, w7, uxtw]      // angle
+
+        // Mode 2 optimization
+        cmp             w8, #32
+        b.eq            .Lh_pos_8x8_mode2
+
+        // === Fully unrolled 8-column computation with transpose ===
+        str             d15, [sp, #-16]!
+        mov             w10, #0                 // angle_acc
+        movi            v15.16b, #32            // constant 32 for NEON-domain weight computation
+
+.macro h_pos_8x8_col dst
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+        add             x13, x2, w11, sxtw
+        ldr             d18, [x13]
+        ldr             d19, [x13, #1]
+        dup             v17.8b, w12
+        sub             v16.8b, v15.8b, v17.8b
+        umull           v20.8h, v18.8b, v16.8b
+        umlal           v20.8h, v19.8b, v17.8b
+        rshrn           \dst\().8b, v20.8h, #5
+.endm
+        h_pos_8x8_col  v0
+        h_pos_8x8_col  v1
+        h_pos_8x8_col  v2
+        h_pos_8x8_col  v3
+        h_pos_8x8_col  v4
+        h_pos_8x8_col  v5
+        h_pos_8x8_col  v6
+        h_pos_8x8_col  v7
+.purgem h_pos_8x8_col
+
+        transpose_8x8B  v0, v1, v2, v3, v4, v5, v6, v7, v16, v17
+
+        str             d0, [x0]
+        add             x0, x0, x3
+        str             d1, [x0]
+        add             x0, x0, x3
+        str             d2, [x0]
+        add             x0, x0, x3
+        str             d3, [x0]
+        add             x0, x0, x3
+        str             d4, [x0]
+        add             x0, x0, x3
+        str             d5, [x0]
+        add             x0, x0, x3
+        str             d6, [x0]
+        add             x0, x0, x3
+        str             d7, [x0]
+        ldr             d15, [sp], #16
+        ret
+
+.Lh_pos_8x8_mode2:
+        // Mode 2: Row-wise optimization
+        // Row y contains left[y+1..y+8], contiguous read + contiguous write
+        add             x5, x2, #1              // left + 1
+        ldr             d0, [x5]                // row 0: left[1..8]
+        ldr             d1, [x5, #1]            // row 1: left[2..9]
+        str             d0, [x0]
+        add             x0, x0, x3
+        str             d1, [x0]
+        add             x0, x0, x3
+        ldr             d0, [x5, #2]            // row 2: left[3..10]
+        ldr             d1, [x5, #3]            // row 3: left[4..11]
+        str             d0, [x0]
+        add             x0, x0, x3
+        str             d1, [x0]
+        add             x0, x0, x3
+        ldr             d0, [x5, #4]            // row 4: left[5..12]
+        ldr             d1, [x5, #5]            // row 5: left[6..13]
+        str             d0, [x0]
+        add             x0, x0, x3
+        str             d1, [x0]
+        add             x0, x0, x3
+        ldr             d0, [x5, #6]            // row 6: left[7..14]
+        ldr             d1, [x5, #7]            // row 7: left[8..15]
+        str             d0, [x0]
+        add             x0, x0, x3
+        str             d1, [x0]
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_h_pos_16x16_8: Horizontal reference positive angle prediction (mode 2-9)
+// Arguments:
+// x0: src
+// x1: top (unused for H reference modes)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_h_pos_16x16_8_neon, export=1
+        // Load angle from table
+        movrel          x6, intra_pred_angle_h
+        sub             w7, w5, #2
+        ldrb            w8, [x6, w7, uxtw]
+
+        // Mode 2 optimization
+        cmp             w8, #32
+        b.eq            .Lh_pos_16x16_mode2
+
+        // === Two batches of 8 columns with 16-byte transpose ===
+        str             d15, [sp, #-16]!
+        mov             x15, x0                 // save base dst
+        movi            v15.16b, #32            // constant 32 for NEON-domain weight computation
+
+.macro h_pos_16x16_col dst
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+        add             x13, x2, w11, sxtw
+        ldr             q18, [x13]
+        ldr             q19, [x13, #1]
+        dup             v17.16b, w12
+        sub             v16.16b, v15.16b, v17.16b
+        umull           v20.8h, v18.8b, v16.8b
+        umlal           v20.8h, v19.8b, v17.8b
+        rshrn           \dst\().8b, v20.8h, #5
+        umull2          v21.8h, v18.16b, v16.16b
+        umlal2          v21.8h, v19.16b, v17.16b
+        rshrn2          \dst\().16b, v21.8h, #5
+.endm
+
+        // Batch 1: columns 0-7
+        mov             w10, #0
+        h_pos_16x16_col v0
+        h_pos_16x16_col v1
+        h_pos_16x16_col v2
+        h_pos_16x16_col v3
+        h_pos_16x16_col v4
+        h_pos_16x16_col v5
+        h_pos_16x16_col v6
+        h_pos_16x16_col v7
+
+        mov             w9, w10                 // save angle_acc
+
+        transpose_8x16B v0, v1, v2, v3, v4, v5, v6, v7, v16, v17
+
+        // Store cols 0-7 of rows 0-7
+        mov             x16, x15
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().8b}, [x16], x3
+        .endr
+        // Store cols 0-7 of rows 8-15
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().d}[1], [x16], x3
+        .endr
+
+        // Batch 2: columns 8-15
+        mov             w10, w9
+        h_pos_16x16_col v0
+        h_pos_16x16_col v1
+        h_pos_16x16_col v2
+        h_pos_16x16_col v3
+        h_pos_16x16_col v4
+        h_pos_16x16_col v5
+        h_pos_16x16_col v6
+        h_pos_16x16_col v7
+.purgem h_pos_16x16_col
+
+        transpose_8x16B v0, v1, v2, v3, v4, v5, v6, v7, v16, v17
+
+        // Store cols 8-15 of rows 0-7
+        add             x16, x15, #8
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().8b}, [x16], x3
+        .endr
+        // Store cols 8-15 of rows 8-15
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().d}[1], [x16], x3
+        .endr
+
+        ldr             d15, [sp], #16
+        ret
+
+.Lh_pos_16x16_mode2:
+        // Mode 2: Row-wise optimization with loop unrolling
+        // Row y contains left[y+1..y+16], contiguous read + contiguous write
+        add             x5, x2, #1              // left + 1
+
+        // Rows 0-3
+        ldr             q0, [x5]
+        ldr             q1, [x5, #1]
+        ldr             q2, [x5, #2]
+        ldr             q3, [x5, #3]
+        str             q0, [x0]
+        add             x0, x0, x3
+        str             q1, [x0]
+        add             x0, x0, x3
+        str             q2, [x0]
+        add             x0, x0, x3
+        str             q3, [x0]
+        add             x0, x0, x3
+
+        // Rows 4-7
+        ldr             q0, [x5, #4]
+        ldr             q1, [x5, #5]
+        ldr             q2, [x5, #6]
+        ldr             q3, [x5, #7]
+        str             q0, [x0]
+        add             x0, x0, x3
+        str             q1, [x0]
+        add             x0, x0, x3
+        str             q2, [x0]
+        add             x0, x0, x3
+        str             q3, [x0]
+        add             x0, x0, x3
+
+        // Rows 8-11
+        ldr             q0, [x5, #8]
+        ldr             q1, [x5, #9]
+        ldr             q2, [x5, #10]
+        ldr             q3, [x5, #11]
+        str             q0, [x0]
+        add             x0, x0, x3
+        str             q1, [x0]
+        add             x0, x0, x3
+        str             q2, [x0]
+        add             x0, x0, x3
+        str             q3, [x0]
+        add             x0, x0, x3
+
+        // Rows 12-15
+        ldr             q0, [x5, #12]
+        ldr             q1, [x5, #13]
+        ldr             q2, [x5, #14]
+        ldr             q3, [x5, #15]
+        str             q0, [x0]
+        add             x0, x0, x3
+        str             q1, [x0]
+        add             x0, x0, x3
+        str             q2, [x0]
+        add             x0, x0, x3
+        str             q3, [x0]
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_h_pos_32x32_8: Horizontal reference positive angle prediction (mode 2-9)
+// Arguments:
+// x0: src
+// x1: top (unused for H reference modes)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_h_pos_32x32_8_neon, export=1
+        // Load angle from table
+        movrel          x6, intra_pred_angle_h
+        sub             w7, w5, #2
+        ldrb            w8, [x6, w7, uxtw]
+
+        // Mode 2 optimization
+        cmp             w8, #32
+        b.eq            .Lh_pos_32x32_mode2
+
+        // === 4 batches of 8 columns with 32-byte transpose ===
+        str             d15, [sp, #-16]!
+        mov             x15, x0                 // save base dst
+        movi            v15.16b, #32            // constant 32 for NEON-domain weight computation
+
+.macro h_pos_32_col dst_hi, dst_lo
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+        add             x13, x2, w11, sxtw
+        ldr             q18, [x13]              // ref rows 0-15
+        ldr             q19, [x13, #1]
+        ldr             q20, [x13, #16]         // ref rows 16-31
+        ldr             q21, [x13, #17]
+        dup             v17.16b, w12
+        sub             v16.16b, v15.16b, v17.16b
+        umull           v22.8h, v18.8b, v16.8b
+        umlal           v22.8h, v19.8b, v17.8b
+        rshrn           \dst_hi\().8b, v22.8h, #5
+        umull2          v23.8h, v18.16b, v16.16b
+        umlal2          v23.8h, v19.16b, v17.16b
+        rshrn2          \dst_hi\().16b, v23.8h, #5
+        umull           v22.8h, v20.8b, v16.8b
+        umlal           v22.8h, v21.8b, v17.8b
+        rshrn           \dst_lo\().8b, v22.8h, #5
+        umull2          v23.8h, v20.16b, v16.16b
+        umlal2          v23.8h, v21.16b, v17.16b
+        rshrn2          \dst_lo\().16b, v23.8h, #5
+.endm
+
+        mov             w10, #0                 // angle_acc
+        mov             x9, #0                  // column byte offset
+
+.Lh_pos_32x32_batch:
+        h_pos_32_col    v0, v24
+        h_pos_32_col    v1, v25
+        h_pos_32_col    v2, v26
+        h_pos_32_col    v3, v27
+        h_pos_32_col    v4, v28
+        h_pos_32_col    v5, v29
+        h_pos_32_col    v6, v30
+        h_pos_32_col    v7, v31
+
+        mov             w11, w10                // save angle_acc
+
+        // Transpose upper half (rows 0-15)
+        transpose_8x16B v0, v1, v2, v3, v4, v5, v6, v7, v16, v17
+        // Transpose lower half (rows 16-31)
+        transpose_8x16B v24, v25, v26, v27, v28, v29, v30, v31, v16, v17
+
+        add             x16, x15, x9
+
+        // Rows 0-7
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().8b}, [x16], x3
+        .endr
+        // Rows 8-15
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().d}[1], [x16], x3
+        .endr
+        // Rows 16-23
+        .irp reg, v24, v25, v26, v27, v28, v29, v30, v31
+        st1             {\reg\().8b}, [x16], x3
+        .endr
+        // Rows 24-31
+        .irp reg, v24, v25, v26, v27, v28, v29, v30, v31
+        st1             {\reg\().d}[1], [x16], x3
+        .endr
+
+        mov             w10, w11                // restore angle_acc
+        add             x9, x9, #8
+        cmp             x9, #32
+        b.lt            .Lh_pos_32x32_batch
+
+.purgem h_pos_32_col
+
+        ldr             d15, [sp], #16
+        ret
+
+.Lh_pos_32x32_mode2:
+        // Mode 2: Row-wise optimization with loop unrolling (4 rows per iteration)
+        // Row y contains left[y+1..y+32], contiguous read + contiguous write
+        add             x5, x2, #1              // left + 1
+        mov             w6, #0                  // row counter (0, 4, 8, ... 28)
+.Lh_pos_32x32_mode2_row4:
+        // Process 4 rows at a time
+        add             x7, x5, w6, uxtw        // base for row y
+        ldp             q0, q1, [x7]            // row y
+        stp             q0, q1, [x0]
+        add             x0, x0, x3
+
+        add             x8, x7, #1              // base for row y+1
+        ldp             q0, q1, [x8]
+        stp             q0, q1, [x0]
+        add             x0, x0, x3
+
+        add             x8, x7, #2              // base for row y+2
+        ldp             q0, q1, [x8]
+        stp             q0, q1, [x0]
+        add             x0, x0, x3
+
+        add             x8, x7, #3              // base for row y+3
+        ldp             q0, q1, [x8]
+        stp             q0, q1, [x0]
+        add             x0, x0, x3
+
+        add             w6, w6, #4
+        cmp             w6, #32
+        b.lt            .Lh_pos_32x32_mode2_row4
+        ret
+endfunc
-- 
2.52.0


From c036d4456030dcdbcd9f9b4c8decfbaa9b6f2530 Mon Sep 17 00:00:00 2001
From: Jun Zhao <barryjzhao@tencent.com>
Date: Tue, 27 Jan 2026 18:44:17 +0800
Subject: [PATCH 6/8] lavc/hevc: add aarch64 NEON for angular V negative (modes
 19-25)

Add NEON-optimized implementations for HEVC angular intra prediction
modes 19-25 (vertical negative angles) at 8-bit depth.

These modes use the top reference with negative angles, requiring
reference extension from left samples when idx goes negative:
- idx = ((y+1) * angle) >> 5 (negative for y > threshold)
- Extended reference: ref[x] = left[-1 + ((x * inv_angle + 128) >> 8)]

Supports all block sizes (4x4, 8x8, 16x16, 32x32).

Speedup over C on Apple M4 (checkasm --bench, 10-run average):

    4x4:   1.63x - 2.55x    8x8:   2.54x - 3.27x
    16x16: 4.59x - 5.92x    32x32: 9.01x - 10.34x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
---
 libavcodec/aarch64/hevcpred_init_aarch64.c |  22 ++
 libavcodec/aarch64/hevcpred_neon.S         | 437 ++++++++++++++++++++-
 2 files changed, 454 insertions(+), 5 deletions(-)

diff --git a/libavcodec/aarch64/hevcpred_init_aarch64.c b/libavcodec/aarch64/hevcpred_init_aarch64.c
index 2973c005ed..73b3031650 100644
--- a/libavcodec/aarch64/hevcpred_init_aarch64.c
+++ b/libavcodec/aarch64/hevcpred_init_aarch64.c
@@ -99,6 +99,20 @@ void ff_hevc_pred_angular_h_pos_32x32_8_neon(uint8_t *src, const uint8_t *top,
                                             const uint8_t *left, ptrdiff_t stride,
                                             int c_idx, int mode);
 
+// Negative angle vertical modes (mode 19-25)
+void ff_hevc_pred_angular_v_neg_4x4_8_neon(uint8_t *src, const uint8_t *top,
+                                          const uint8_t *left, ptrdiff_t stride,
+                                          int c_idx, int mode);
+void ff_hevc_pred_angular_v_neg_8x8_8_neon(uint8_t *src, const uint8_t *top,
+                                          const uint8_t *left, ptrdiff_t stride,
+                                          int c_idx, int mode);
+void ff_hevc_pred_angular_v_neg_16x16_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int mode);
+void ff_hevc_pred_angular_v_neg_32x32_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int mode);
+
 static void pred_dc_neon(uint8_t *src, const uint8_t *top,
                          const uint8_t *left, ptrdiff_t stride,
                          int log2_size, int c_idx)
@@ -138,6 +152,8 @@ static void pred_angular_0_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_18_4x4_8_neon(src, top, left, stride, c_idx, 2);
     } else if (mode >= 27) {
         ff_hevc_pred_angular_v_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode > 18 && mode <= 25) {
+        ff_hevc_pred_angular_v_neg_4x4_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode >= 2 && mode <= 9) {
         ff_hevc_pred_angular_h_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
     } else {
@@ -157,6 +173,8 @@ static void pred_angular_1_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_18_8x8_8_neon(src, top, left, stride, c_idx, 3);
     } else if (mode >= 27) {
         ff_hevc_pred_angular_v_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode > 18 && mode <= 25) {
+        ff_hevc_pred_angular_v_neg_8x8_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode >= 2 && mode <= 9) {
         ff_hevc_pred_angular_h_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
     } else {
@@ -176,6 +194,8 @@ static void pred_angular_2_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_18_16x16_8_neon(src, top, left, stride, c_idx, 4);
     } else if (mode >= 27) {
         ff_hevc_pred_angular_v_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode > 18 && mode <= 25) {
+        ff_hevc_pred_angular_v_neg_16x16_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode >= 2 && mode <= 9) {
         ff_hevc_pred_angular_h_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
     } else {
@@ -195,6 +215,8 @@ static void pred_angular_3_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_mode_18_32x32_8_neon(src, top, left, stride, c_idx, 5);
     } else if (mode >= 27) {
         ff_hevc_pred_angular_v_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode > 18 && mode <= 25) {
+        ff_hevc_pred_angular_v_neg_32x32_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode >= 2 && mode <= 9) {
         ff_hevc_pred_angular_h_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
     } else {
diff --git a/libavcodec/aarch64/hevcpred_neon.S b/libavcodec/aarch64/hevcpred_neon.S
index 3d982a3589..845e749087 100644
--- a/libavcodec/aarch64/hevcpred_neon.S
+++ b/libavcodec/aarch64/hevcpred_neon.S
@@ -1551,11 +1551,6 @@ function ff_hevc_pred_angular_v_pos_32x32_8_neon, export=1
         b.lt            .Lv_pos_32x32_mode34_loop
         ret
 endfunc
-
-// =============================================================================
-// Angular Prediction - Horizontal modes, positive angle (Mode 2-9)
-// =============================================================================
-
 const intra_pred_angle_h, align=4
         .byte   32      // mode 2
         .byte   26      // mode 3
@@ -2027,3 +2022,435 @@ function ff_hevc_pred_angular_h_pos_32x32_8_neon, export=1
         b.lt            .Lh_pos_32x32_mode2_row4
         ret
 endfunc
+
+// =============================================================================
+// Angular Prediction - Vertical reference modes, negative angles (Mode 19-25)
+// =============================================================================
+
+// Angle table for V reference negative angles (mode 18-25)
+// angle = intra_pred_angle_v_neg[mode - 18]
+const intra_pred_angle_v_neg, align=4
+        .byte   -32     // mode 18
+        .byte   -26     // mode 19
+        .byte   -21     // mode 20
+        .byte   -17     // mode 21
+        .byte   -13     // mode 22
+        .byte   -9      // mode 23
+        .byte   -5      // mode 24
+        .byte   -2      // mode 25
+endconst
+
+// inv_angle table for reference extension (16-bit values)
+// inv_angle = inv_angle_v_neg[mode - 18]
+// Used to calculate: ref_tmp[x] = left[-1 + ((x * inv_angle + 128) >> 8)]
+const inv_angle_v_neg, align=4
+        .short  -256    // mode 18: inv_angle[7]
+        .short  -315    // mode 19: inv_angle[8]
+        .short  -390    // mode 20: inv_angle[9]
+        .short  -482    // mode 21: inv_angle[10]
+        .short  -630    // mode 22: inv_angle[11]
+        .short  -910    // mode 23: inv_angle[12]
+        .short  -1638   // mode 24: inv_angle[13]
+        .short  -4096   // mode 25: inv_angle[14]
+endconst
+
+// -----------------------------------------------------------------------------
+// pred_angular_v_neg_4x4_8: Vertical reference negative angle prediction (mode 18-25)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_v_neg_4x4_8_neon, export=1
+        // Stack layout: ref_tmp[-32..31] at sp, base at sp+32
+        // last ranges from -4 (mode 25) to 0; data occupies ref_tmp[-4..4]
+        // Allocated 64 bytes (16-byte aligned), offset 32 for negative indexing
+        sub             sp, sp, #64
+        add             x14, sp, #32            // ref_tmp base
+
+        // Load angle from table
+        movrel          x6, intra_pred_angle_v_neg
+        sub             w7, w5, #18             // mode - 18
+        ldrsb           w8, [x6, w7, sxtw]      // angle (negative)
+
+        // Calculate last = (4 * angle) >> 5
+        mov             w15, #4
+        mul             w15, w15, w8
+        asr             w15, w15, #5            // last (negative)
+
+        // Check if extension needed (last < -1)
+        cmn             w15, #1                 // Compare with -1: last + 1 < 0 means last < -1
+        b.ge            .Lv_neg_4x4_no_extend
+
+        // === Reference extension ===
+        // 1. Copy top[-1..4] to ref_tmp[0..5]
+        sub             x16, x1, #1             // top - 1
+        ldr             d0, [x16]               // load 8 bytes (top[-1..6])
+        str             d0, [x14]               // store to ref_tmp[0..7]
+
+        // 2. Load inv_angle
+        movrel          x17, inv_angle_v_neg
+        sxtw            x7, w7                  // sign extend mode index
+        ldrsh           w9, [x17, x7, lsl #1]   // inv_angle (negative)
+
+        // 3. Extend: ref_tmp[x] = left[-1 + ((x * inv_angle + 128) >> 8)]
+        mov             w10, w15                // x = last
+.Lv_neg_4x4_extend:
+        mul             w11, w10, w9            // x * inv_angle
+        add             w11, w11, #128
+        asr             w11, w11, #8            // (x * inv_angle + 128) >> 8
+        sub             w11, w11, #1            // -1 + result
+        ldrb            w12, [x2, w11, sxtw]    // left[index]
+        strb            w12, [x14, w10, sxtw]   // ref_tmp[x]
+        add             w10, w10, #1
+        cmn             w10, #1                 // compare with -1
+        b.le            .Lv_neg_4x4_extend      // loop while x <= -1
+
+        mov             x13, x14                // ref = ref_tmp
+        b               .Lv_neg_4x4_predict
+
+.Lv_neg_4x4_no_extend:
+        sub             x13, x1, #1             // ref = top - 1
+
+.Lv_neg_4x4_predict:
+        // === Standard interpolation loop ===
+        mov             w9, #0                  // y = 0
+        mov             w10, #0                 // angle_acc = 0
+        movi            v18.16b, #32            // constant 32 for NEON-domain weight computation
+
+.Lv_neg_4x4_row:
+        add             w10, w10, w8            // angle_acc = (y+1) * angle
+        asr             w11, w10, #5            // idx
+        and             w12, w10, #31           // fact
+
+        add             x16, x13, w11, sxtw     // ref + idx
+        ldr             s0, [x16, #1]           // ref[idx+1..idx+4]
+        ldr             s1, [x16, #2]           // ref[idx+2..idx+5]
+
+        // Unconditional interpolation
+        // When fact=0: (32*p0 + 0*p1) >> 5 = p0
+        dup             v17.8b, w12
+        sub             v16.8b, v18.8b, v17.8b
+        umull           v20.8h, v0.8b, v16.8b
+        umlal           v20.8h, v1.8b, v17.8b
+        rshrn           v2.8b, v20.8h, #5
+
+        str             s2, [x0]
+        add             x0, x0, x3              // advance to next row
+
+        add             w9, w9, #1
+        cmp             w9, #4
+        b.lt            .Lv_neg_4x4_row
+
+        add             sp, sp, #64
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_v_neg_8x8_8: Vertical reference negative angle prediction (mode 18-25)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_v_neg_8x8_8_neon, export=1
+        // Stack layout: ref_tmp[-32..47] at sp, base at sp+32
+        // last ranges from -7 (mode 25) to 0; data occupies ref_tmp[-7..8]
+        // Allocated 80 bytes (16-byte aligned), offset 32 for negative indexing
+        sub             sp, sp, #80
+        add             x14, sp, #32            // ref_tmp base
+
+        // Load angle
+        movrel          x6, intra_pred_angle_v_neg
+        sub             w7, w5, #18
+        ldrsb           w8, [x6, w7, sxtw]
+
+        // Calculate last = (8 * angle) >> 5
+        mov             w15, #8
+        mul             w15, w15, w8
+        asr             w15, w15, #5
+
+        // Check if extension needed
+        cmn             w15, #1
+        b.ge            .Lv_neg_8x8_no_extend
+
+        // Copy top[-1..8] to ref_tmp[0..9]
+        sub             x16, x1, #1
+        ldr             q0, [x16]               // load 16 bytes
+        str             q0, [x14]
+
+        // Load inv_angle
+        movrel          x17, inv_angle_v_neg
+        sxtw            x7, w7                  // sign extend mode index
+        ldrsh           w9, [x17, x7, lsl #1]
+
+        // Extend loop
+        mov             w10, w15
+.Lv_neg_8x8_extend:
+        mul             w11, w10, w9
+        add             w11, w11, #128
+        asr             w11, w11, #8            // (x * inv_angle + 128) >> 8
+        sub             w11, w11, #1
+        ldrb            w12, [x2, w11, sxtw]
+        strb            w12, [x14, w10, sxtw]
+        add             w10, w10, #1
+        cmn             w10, #1                 // compare with -1
+        b.le            .Lv_neg_8x8_extend      // loop while x <= -1
+
+        mov             x13, x14
+        b               .Lv_neg_8x8_predict
+
+.Lv_neg_8x8_no_extend:
+        sub             x13, x1, #1
+
+.Lv_neg_8x8_predict:
+        mov             w9, #0
+        mov             w10, #0
+        movi            v18.16b, #32            // constant 32 for NEON-domain weight computation
+
+.Lv_neg_8x8_row:
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+
+        add             x16, x13, w11, sxtw
+        ldr             d0, [x16, #1]
+        ldr             d1, [x16, #2]
+
+        // Unconditional interpolation
+        dup             v17.8b, w12
+        sub             v16.8b, v18.8b, v17.8b
+        umull           v20.8h, v0.8b, v16.8b
+        umlal           v20.8h, v1.8b, v17.8b
+        rshrn           v2.8b, v20.8h, #5
+
+        str             d2, [x0]
+        add             x0, x0, x3              // advance to next row
+
+        add             w9, w9, #1
+        cmp             w9, #8
+        b.lt            .Lv_neg_8x8_row
+
+        add             sp, sp, #80
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_v_neg_16x16_8: Vertical reference negative angle prediction (mode 18-25)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_v_neg_16x16_8_neon, export=1
+        // Stack layout: ref_tmp[-64..47] at sp, base at sp+64
+        // last ranges from -13 (mode 25) to 0; data occupies ref_tmp[-13..16]
+        // Allocated 112 bytes (16-byte aligned), offset 64 for negative indexing
+        sub             sp, sp, #112
+        add             x14, sp, #64            // ref_tmp base
+
+        // Load angle
+        movrel          x6, intra_pred_angle_v_neg
+        sub             w7, w5, #18
+        ldrsb           w8, [x6, w7, sxtw]
+
+        // Calculate last = (16 * angle) >> 5
+        mov             w15, #16
+        mul             w15, w15, w8
+        asr             w15, w15, #5
+
+        // Check if extension needed
+        cmn             w15, #1
+        b.ge            .Lv_neg_16x16_no_extend
+
+        // Copy top[-1..16] to ref_tmp[0..17]
+        sub             x16, x1, #1
+        ldp             q0, q1, [x16]           // load 32 bytes
+        stp             q0, q1, [x14]
+
+        // Load inv_angle
+        movrel          x17, inv_angle_v_neg
+        sxtw            x7, w7                  // sign extend mode index
+        ldrsh           w9, [x17, x7, lsl #1]
+
+        // Extend loop
+        mov             w10, w15
+.Lv_neg_16x16_extend:
+        mul             w11, w10, w9
+        add             w11, w11, #128
+        asr             w11, w11, #8            // (x * inv_angle + 128) >> 8
+        sub             w11, w11, #1
+        ldrb            w12, [x2, w11, sxtw]
+        strb            w12, [x14, w10, sxtw]
+        add             w10, w10, #1
+        cmn             w10, #1                 // compare with -1
+        b.le            .Lv_neg_16x16_extend    // loop while x <= -1
+
+        mov             x13, x14
+        b               .Lv_neg_16x16_predict
+
+.Lv_neg_16x16_no_extend:
+        sub             x13, x1, #1
+
+.Lv_neg_16x16_predict:
+        mov             w9, #0
+        mov             w10, #0
+        movi            v18.16b, #32            // constant 32 for NEON-domain weight computation
+
+.Lv_neg_16x16_row:
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+
+        add             x16, x13, w11, sxtw
+        ldr             q0, [x16, #1]
+        ldr             q1, [x16, #2]
+
+        // Unconditional interpolation
+        dup             v17.16b, w12
+        sub             v16.16b, v18.16b, v17.16b
+
+        umull           v20.8h, v0.8b, v16.8b
+        umlal           v20.8h, v1.8b, v17.8b
+        rshrn           v2.8b, v20.8h, #5
+
+        umull2          v21.8h, v0.16b, v16.16b
+        umlal2          v21.8h, v1.16b, v17.16b
+        rshrn2          v2.16b, v21.8h, #5
+
+        str             q2, [x0]
+        add             x0, x0, x3              // advance to next row
+
+        add             w9, w9, #1
+        cmp             w9, #16
+        b.lt            .Lv_neg_16x16_row
+
+        add             sp, sp, #112
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_v_neg_32x32_8: Vertical reference negative angle prediction (mode 18-25)
+// Arguments:
+// x0: src
+// x1: top
+// x2: left (unused for V reference modes)
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_v_neg_32x32_8_neon, export=1
+        // Stack layout: ref_tmp[-64..63] at sp, base at sp+64
+        // last ranges from -26 (mode 19) to -2 (mode 25); data occupies ref_tmp[-26..32]
+        // Allocated 128 bytes (16-byte aligned), offset 64 covers the 64-byte ldp/stp copy
+        sub             sp, sp, #128
+        add             x14, sp, #64            // ref_tmp base
+
+        // Load angle
+        movrel          x6, intra_pred_angle_v_neg
+        sub             w7, w5, #18
+        ldrsb           w8, [x6, w7, sxtw]
+
+        // Calculate last = (32 * angle) >> 5
+        mov             w15, #32
+        mul             w15, w15, w8
+        asr             w15, w15, #5
+
+        // Check if extension needed
+        cmn             w15, #1
+        b.ge            .Lv_neg_32x32_no_extend
+
+        // Copy top[-1..32] to ref_tmp[0..33]
+        sub             x16, x1, #1
+        ldp             q0, q1, [x16]
+        stp             q0, q1, [x14]
+        ldp             q2, q3, [x16, #32]
+        stp             q2, q3, [x14, #32]
+
+        // Load inv_angle
+        movrel          x17, inv_angle_v_neg
+        sxtw            x7, w7                  // sign extend mode index
+        ldrsh           w9, [x17, x7, lsl #1]
+
+        // Extend loop
+        mov             w10, w15
+.Lv_neg_32x32_extend:
+        mul             w11, w10, w9
+        add             w11, w11, #128
+        asr             w11, w11, #8            // (x * inv_angle + 128) >> 8
+        sub             w11, w11, #1
+        ldrb            w12, [x2, w11, sxtw]
+        strb            w12, [x14, w10, sxtw]
+        add             w10, w10, #1
+        cmn             w10, #1                 // compare with -1
+        b.le            .Lv_neg_32x32_extend    // loop while x <= -1
+
+        mov             x13, x14
+        b               .Lv_neg_32x32_predict
+
+.Lv_neg_32x32_no_extend:
+        sub             x13, x1, #1
+
+.Lv_neg_32x32_predict:
+        mov             w9, #0
+        mov             w10, #0
+        movi            v18.16b, #32            // constant 32 for NEON-domain weight computation
+
+.Lv_neg_32x32_row:
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+
+        add             x16, x13, w11, sxtw
+
+        // Load 32+1 bytes for interpolation (Unconditional)
+        ldr             q0, [x16, #1]
+        ldr             q1, [x16, #2]
+        ldr             q2, [x16, #17]
+        ldr             q3, [x16, #18]
+
+        // Unconditional interpolation
+        dup             v17.16b, w12
+        sub             v16.16b, v18.16b, v17.16b
+
+        // First 16 bytes
+        umull           v20.8h, v0.8b, v16.8b
+        umlal           v20.8h, v1.8b, v17.8b
+        rshrn           v4.8b, v20.8h, #5
+
+        umull2          v21.8h, v0.16b, v16.16b
+        umlal2          v21.8h, v1.16b, v17.16b
+        rshrn2          v4.16b, v21.8h, #5
+
+        // Second 16 bytes
+        umull           v22.8h, v2.8b, v16.8b
+        umlal           v22.8h, v3.8b, v17.8b
+        rshrn           v5.8b, v22.8h, #5
+
+        umull2          v23.8h, v2.16b, v16.16b
+        umlal2          v23.8h, v3.16b, v17.16b
+        rshrn2          v5.16b, v23.8h, #5
+
+        st1             {v4.16b, v5.16b}, [x0], x3
+
+        add             w9, w9, #1
+        cmp             w9, #32
+        b.lt            .Lv_neg_32x32_row
+
+        add             sp, sp, #128
+        ret
+endfunc
+
+// =============================================================================
+// Angular Prediction - Horizontal reference modes, negative angles (Mode 11-17)
+// =============================================================================
+
-- 
2.52.0


From a9c018826c91073f08637e157d39ec1ad35f8458 Mon Sep 17 00:00:00 2001
From: Jun Zhao <barryjzhao@tencent.com>
Date: Tue, 27 Jan 2026 18:54:51 +0800
Subject: [PATCH 7/8] lavc/hevc: add aarch64 NEON for angular H negative (modes
 11-17)

Add NEON-optimized implementations for HEVC angular intra prediction
modes 11-17 (horizontal negative angles) at 8-bit depth.

These modes use the left reference with negative angles, requiring
reference extension from top samples when idx goes negative:
- idx = ((x+1) * angle) >> 5 (negative for x > threshold)
- Extended reference: ref[y] = top[-1 + ((y * inv_angle + 128) >> 8)]

Uses batch column computation with matrix transpose to convert
column-oriented interpolation results into contiguous row stores.

Supports all block sizes (4x4, 8x8, 16x16, 32x32).

Speedup over C on Apple M4 (checkasm --bench, 10-run average):

    4x4:   1.78x - 2.30x    8x8:   2.80x - 3.44x
    16x16: 4.54x - 5.68x    32x32: 7.63x - 8.27x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
---
 libavcodec/aarch64/hevcpred_init_aarch64.c |  22 +
 libavcodec/aarch64/hevcpred_neon.S         | 747 +++++++++++++++++----
 2 files changed, 648 insertions(+), 121 deletions(-)

diff --git a/libavcodec/aarch64/hevcpred_init_aarch64.c b/libavcodec/aarch64/hevcpred_init_aarch64.c
index 73b3031650..f523abd97d 100644
--- a/libavcodec/aarch64/hevcpred_init_aarch64.c
+++ b/libavcodec/aarch64/hevcpred_init_aarch64.c
@@ -113,6 +113,20 @@ void ff_hevc_pred_angular_v_neg_32x32_8_neon(uint8_t *src, const uint8_t *top,
                                             const uint8_t *left, ptrdiff_t stride,
                                             int c_idx, int mode);
 
+// Negative angle horizontal modes (mode 11-17)
+void ff_hevc_pred_angular_h_neg_4x4_8_neon(uint8_t *src, const uint8_t *top,
+                                          const uint8_t *left, ptrdiff_t stride,
+                                          int c_idx, int mode);
+void ff_hevc_pred_angular_h_neg_8x8_8_neon(uint8_t *src, const uint8_t *top,
+                                          const uint8_t *left, ptrdiff_t stride,
+                                          int c_idx, int mode);
+void ff_hevc_pred_angular_h_neg_16x16_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int mode);
+void ff_hevc_pred_angular_h_neg_32x32_8_neon(uint8_t *src, const uint8_t *top,
+                                            const uint8_t *left, ptrdiff_t stride,
+                                            int c_idx, int mode);
+
 static void pred_dc_neon(uint8_t *src, const uint8_t *top,
                          const uint8_t *left, ptrdiff_t stride,
                          int log2_size, int c_idx)
@@ -154,6 +168,8 @@ static void pred_angular_0_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_v_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode > 18 && mode <= 25) {
         ff_hevc_pred_angular_v_neg_4x4_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode >= 11 && mode <= 17) {
+        ff_hevc_pred_angular_h_neg_4x4_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode >= 2 && mode <= 9) {
         ff_hevc_pred_angular_h_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
     } else {
@@ -175,6 +191,8 @@ static void pred_angular_1_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_v_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode > 18 && mode <= 25) {
         ff_hevc_pred_angular_v_neg_8x8_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode >= 11 && mode <= 17) {
+        ff_hevc_pred_angular_h_neg_8x8_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode >= 2 && mode <= 9) {
         ff_hevc_pred_angular_h_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
     } else {
@@ -196,6 +214,8 @@ static void pred_angular_2_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_v_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode > 18 && mode <= 25) {
         ff_hevc_pred_angular_v_neg_16x16_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode >= 11 && mode <= 17) {
+        ff_hevc_pred_angular_h_neg_16x16_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode >= 2 && mode <= 9) {
         ff_hevc_pred_angular_h_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
     } else {
@@ -217,6 +237,8 @@ static void pred_angular_3_neon(uint8_t *src, const uint8_t *top,
         ff_hevc_pred_angular_v_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode > 18 && mode <= 25) {
         ff_hevc_pred_angular_v_neg_32x32_8_neon(src, top, left, stride, c_idx, mode);
+    } else if (mode >= 11 && mode <= 17) {
+        ff_hevc_pred_angular_h_neg_32x32_8_neon(src, top, left, stride, c_idx, mode);
     } else if (mode >= 2 && mode <= 9) {
         ff_hevc_pred_angular_h_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
     } else {
diff --git a/libavcodec/aarch64/hevcpred_neon.S b/libavcodec/aarch64/hevcpred_neon.S
index 845e749087..590ec7d0cd 100644
--- a/libavcodec/aarch64/hevcpred_neon.S
+++ b/libavcodec/aarch64/hevcpred_neon.S
@@ -724,8 +724,8 @@ function ff_hevc_pred_planar_32x32_8_neon, export=1
 .purgem planar32_row
 
         ldp             d14, d15, [sp, #48]
-        ldp             d10, d11, [sp, #16]
         ldp             d12, d13, [sp, #32]
+        ldp             d10, d11, [sp, #16]
         ldp             d8, d9, [sp], #64
         ret
 endfunc
@@ -974,22 +974,9 @@ function ff_hevc_pred_angular_mode_26_8_neon, export=1
         sqxtun          v3.8b, v3.8h           // Saturate back to bytes
         sqxtun2         v3.16b, v4.8h
         
-        st1             {v3.b}[0], [x0], x3
-        st1             {v3.b}[1], [x0], x3
-        st1             {v3.b}[2], [x0], x3
-        st1             {v3.b}[3], [x0], x3
-        st1             {v3.b}[4], [x0], x3
-        st1             {v3.b}[5], [x0], x3
-        st1             {v3.b}[6], [x0], x3
-        st1             {v3.b}[7], [x0], x3
-        st1             {v3.b}[8], [x0], x3
-        st1             {v3.b}[9], [x0], x3
-        st1             {v3.b}[10], [x0], x3
-        st1             {v3.b}[11], [x0], x3
-        st1             {v3.b}[12], [x0], x3
-        st1             {v3.b}[13], [x0], x3
-        st1             {v3.b}[14], [x0], x3
-        st1             {v3.b}[15], [x0], x3
+.irp n, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+        st1             {v3.b}[\n], [x0], x3
+.endr
         b               9f
         
 224:    ldr             s2, [x2]
@@ -998,10 +985,9 @@ function ff_hevc_pred_angular_mode_26_8_neon, export=1
         sshr            v3.8h, v3.8h, #1       // Arithmetic shift right by 1
         uaddw           v3.8h, v3.8h, v5.8b    // Add top[0] from v5 (unsigned extend, top[0] is unsigned)
         sqxtun          v3.8b, v3.8h           // Saturate back to bytes
-        st1             {v3.b}[0], [x0], x3
-        st1             {v3.b}[1], [x0], x3
-        st1             {v3.b}[2], [x0], x3
-        st1             {v3.b}[3], [x0], x3
+.irp n, 0, 1, 2, 3
+        st1             {v3.b}[\n], [x0], x3
+.endr
         b               9f
 
 228:    ldr             d2, [x2]
@@ -1010,14 +996,9 @@ function ff_hevc_pred_angular_mode_26_8_neon, export=1
         sshr            v3.8h, v3.8h, #1       // Arithmetic shift right by 1
         uaddw           v3.8h, v3.8h, v5.8b    // Add top[0] from v5 (unsigned extend, top[0] is unsigned)
         sqxtun          v3.8b, v3.8h           // Saturate back to bytes
-        st1             {v3.b}[0], [x0], x3
-        st1             {v3.b}[1], [x0], x3
-        st1             {v3.b}[2], [x0], x3
-        st1             {v3.b}[3], [x0], x3
-        st1             {v3.b}[4], [x0], x3
-        st1             {v3.b}[5], [x0], x3
-        st1             {v3.b}[6], [x0], x3
-        st1             {v3.b}[7], [x0], x3
+.irp n, 0, 1, 2, 3, 4, 5, 6, 7
+        st1             {v3.b}[\n], [x0], x3
+.endr
         b               9f
 
 9:      ret
@@ -1088,37 +1069,16 @@ function ff_hevc_pred_angular_mode_18_8x8_8_neon, export=1
         mov             v2.d[1], v1.d[0]        // v2[8..15] = top[-1..6]
 
         // Row 0: ref[0..7] = top[-1..6] = v2[8..15]
-        mov             v3.d[0], v2.d[1]
-        str             d3, [x0]
-        add             x0, x0, x3
+        st1             {v2.d}[1], [x0], x3
 
         // Row 1-7: use ext with decreasing offset
-        ext             v3.16b, v2.16b, v2.16b, #7
+.irp offset, 7, 6, 5, 4, 3, 2, 1
+        ext             v3.16b, v2.16b, v2.16b, #\offset
         str             d3, [x0]
+  .ifnc \offset, 1
         add             x0, x0, x3
-
-        ext             v3.16b, v2.16b, v2.16b, #6
-        str             d3, [x0]
-        add             x0, x0, x3
-
-        ext             v3.16b, v2.16b, v2.16b, #5
-        str             d3, [x0]
-        add             x0, x0, x3
-
-        ext             v3.16b, v2.16b, v2.16b, #4
-        str             d3, [x0]
-        add             x0, x0, x3
-
-        ext             v3.16b, v2.16b, v2.16b, #3
-        str             d3, [x0]
-        add             x0, x0, x3
-
-        ext             v3.16b, v2.16b, v2.16b, #2
-        str             d3, [x0]
-        add             x0, x0, x3
-
-        ext             v3.16b, v2.16b, v2.16b, #1
-        str             d3, [x0]
+  .endif
+.endr
 
         ret
 endfunc
@@ -1141,65 +1101,14 @@ function ff_hevc_pred_angular_mode_18_16x16_8_neon, export=1
         // Row 0: ref[0..15] = v1
         str             q1, [x0]
         add             x0, x0, x3
-        // Row 1: EXT(v0, v1, #15) = {v0[15], v1[0..14]} = {left[0], top[-1..13]}
-        ext             v2.16b, v0.16b, v1.16b, #15
+        // Row 1-15: EXT(v0, v1, #N) slides window across {v0:v1}
+.irp offset, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
+        ext             v2.16b, v0.16b, v1.16b, #\offset
         str             q2, [x0]
+  .ifnc \offset, 1
         add             x0, x0, x3
-        // Row 2
-        ext             v2.16b, v0.16b, v1.16b, #14
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 3
-        ext             v2.16b, v0.16b, v1.16b, #13
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 4
-        ext             v2.16b, v0.16b, v1.16b, #12
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 5
-        ext             v2.16b, v0.16b, v1.16b, #11
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 6
-        ext             v2.16b, v0.16b, v1.16b, #10
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 7
-        ext             v2.16b, v0.16b, v1.16b, #9
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 8
-        ext             v2.16b, v0.16b, v1.16b, #8
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 9
-        ext             v2.16b, v0.16b, v1.16b, #7
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 10
-        ext             v2.16b, v0.16b, v1.16b, #6
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 11
-        ext             v2.16b, v0.16b, v1.16b, #5
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 12
-        ext             v2.16b, v0.16b, v1.16b, #4
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 13
-        ext             v2.16b, v0.16b, v1.16b, #3
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 14
-        ext             v2.16b, v0.16b, v1.16b, #2
-        str             q2, [x0]
-        add             x0, x0, x3
-        // Row 15
-        ext             v2.16b, v0.16b, v1.16b, #1
-        str             q2, [x0]
+  .endif
+.endr
 
         ret
 endfunc
@@ -1551,6 +1460,11 @@ function ff_hevc_pred_angular_v_pos_32x32_8_neon, export=1
         b.lt            .Lv_pos_32x32_mode34_loop
         ret
 endfunc
+
+// =============================================================================
+// Angular Prediction - Horizontal reference modes, positive angle (Mode 2-9)
+// =============================================================================
+	
 const intra_pred_angle_h, align=4
         .byte   32      // mode 2
         .byte   26      // mode 3
@@ -2055,11 +1969,11 @@ const inv_angle_v_neg, align=4
 endconst
 
 // -----------------------------------------------------------------------------
-// pred_angular_v_neg_4x4_8: Vertical reference negative angle prediction (mode 18-25)
+// pred_angular_v_neg_4x4_8: Vertical reference negative angle prediction (mode 19-25)
 // Arguments:
 // x0: src
 // x1: top
-// x2: left (unused for V reference modes)
+// x2: left (used for reference extension)
 // x3: stride
 // w4: c_idx
 // w5: mode
@@ -2150,11 +2064,11 @@ function ff_hevc_pred_angular_v_neg_4x4_8_neon, export=1
 endfunc
 
 // -----------------------------------------------------------------------------
-// pred_angular_v_neg_8x8_8: Vertical reference negative angle prediction (mode 18-25)
+// pred_angular_v_neg_8x8_8: Vertical reference negative angle prediction (mode 19-25)
 // Arguments:
 // x0: src
 // x1: top
-// x2: left (unused for V reference modes)
+// x2: left (used for reference extension)
 // x3: stride
 // w4: c_idx
 // w5: mode
@@ -2242,11 +2156,11 @@ function ff_hevc_pred_angular_v_neg_8x8_8_neon, export=1
 endfunc
 
 // -----------------------------------------------------------------------------
-// pred_angular_v_neg_16x16_8: Vertical reference negative angle prediction (mode 18-25)
+// pred_angular_v_neg_16x16_8: Vertical reference negative angle prediction (mode 19-25)
 // Arguments:
 // x0: src
 // x1: top
-// x2: left (unused for V reference modes)
+// x2: left (used for reference extension)
 // x3: stride
 // w4: c_idx
 // w5: mode
@@ -2339,11 +2253,11 @@ function ff_hevc_pred_angular_v_neg_16x16_8_neon, export=1
 endfunc
 
 // -----------------------------------------------------------------------------
-// pred_angular_v_neg_32x32_8: Vertical reference negative angle prediction (mode 18-25)
+// pred_angular_v_neg_32x32_8: Vertical reference negative angle prediction (mode 19-25)
 // Arguments:
 // x0: src
 // x1: top
-// x2: left (unused for V reference modes)
+// x2: left (used for reference extension)
 // x3: stride
 // w4: c_idx
 // w5: mode
@@ -2454,3 +2368,594 @@ endfunc
 // Angular Prediction - Horizontal reference modes, negative angles (Mode 11-17)
 // =============================================================================
 
+// Angle table for H reference negative angles (mode 11-17)
+// angle = intra_pred_angle_h_neg[mode - 11]
+// mode 11: -2, mode 12: -5, mode 13: -9, mode 14: -13,
+// mode 15: -17, mode 16: -21, mode 17: -26
+const intra_pred_angle_h_neg, align=4
+        .byte   -2      // mode 11
+        .byte   -5      // mode 12
+        .byte   -9      // mode 13
+        .byte   -13     // mode 14
+        .byte   -17     // mode 15
+        .byte   -21     // mode 16
+        .byte   -26     // mode 17
+endconst
+
+// inv_angle table for reference extension (16-bit values)
+// inv_angle = inv_angle_h_neg[mode - 11]
+// Used to calculate: ref_tmp[x] = top[-1 + ((x * inv_angle + 128) >> 8)]
+const inv_angle_h_neg, align=4
+        .short  -4096   // mode 11: inv_angle[0]
+        .short  -1638   // mode 12: inv_angle[1]
+        .short  -910    // mode 13: inv_angle[2]
+        .short  -630    // mode 14: inv_angle[3]
+        .short  -482    // mode 15: inv_angle[4]
+        .short  -390    // mode 16: inv_angle[5]
+        .short  -315    // mode 17: inv_angle[6]
+endconst
+
+// -----------------------------------------------------------------------------
+// pred_angular_h_neg_4x4_8: Horizontal reference negative angle prediction (mode 11-17)
+// Arguments:
+// x0: src
+// x1: top (used for reference extension)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_h_neg_4x4_8_neon, export=1
+        str             d15, [sp, #-16]!
+        // Stack layout: ref_tmp[-64..47] at sp, base at sp+64
+        // last ranges from -4 (mode 17) to 0; data occupies ref_tmp[-4..4]
+        // Allocated 112 bytes (16-byte aligned), offset 64 for negative indexing
+        sub             sp, sp, #112
+        add             x14, sp, #64            // ref_tmp base
+
+        // Load angle from table
+        movrel          x6, intra_pred_angle_h_neg
+        sub             w7, w5, #11             // mode - 11
+        ldrsb           w8, [x6, w7, sxtw]      // angle (negative)
+
+        // Calculate last = (4 * angle) >> 5
+        mov             w15, #4
+        mul             w15, w15, w8
+        asr             w15, w15, #5            // last (negative)
+
+        // Check if extension needed (last < -1)
+        cmn             w15, #1                 // Compare with -1: last + 1 < 0 means last < -1
+        b.ge            .Lh_neg_4x4_no_extend
+
+        // === Reference extension ===
+        // 1. Copy left[-1..4] to ref_tmp[0..5]
+        sub             x16, x2, #1             // left - 1
+        ldr             d0, [x16]               // load 8 bytes (left[-1..6])
+        str             d0, [x14]               // store to ref_tmp[0..7]
+
+        // 2. Load inv_angle
+        movrel          x17, inv_angle_h_neg
+        sxtw            x7, w7                  // sign extend mode index
+        ldrsh           w9, [x17, x7, lsl #1]   // inv_angle (negative)
+
+        // 3. Extend: ref_tmp[x] = top[-1 + ((x * inv_angle + 128) >> 8)]
+        mov             w10, w15                // x = last
+.Lh_neg_4x4_extend:
+        mul             w11, w10, w9            // x * inv_angle
+        add             w11, w11, #128
+        asr             w11, w11, #8            // (x * inv_angle + 128) >> 8
+        sub             w11, w11, #1            // -1 + result
+        ldrb            w12, [x1, w11, sxtw]    // top[index]
+        strb            w12, [x14, w10, sxtw]   // ref_tmp[x]
+        add             w10, w10, #1
+        cmn             w10, #1                 // compare with -1
+        b.le            .Lh_neg_4x4_extend      // loop while x <= -1
+
+        mov             x13, x14                // ref = ref_tmp
+        b               .Lh_neg_4x4_predict
+
+.Lh_neg_4x4_no_extend:
+        sub             x13, x2, #1             // ref = left - 1
+
+.Lh_neg_4x4_predict:
+        // === Fully unrolled 4-column computation ===
+        // Each column produces 4 interpolated pixels in the low 4 bytes of v0-v3.
+        // After computing, transpose 4x4 and store as contiguous rows.
+        mov             w10, #0                 // angle_acc
+        movi            v15.16b, #32            // constant 32 for NEON-domain weight computation
+
+.macro h_neg_4x4_col dst
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+        add             x16, x13, w11, sxtw
+        ldr             d18, [x16, #1]          // ref[idx+1..idx+8]
+        ldr             d19, [x16, #2]          // ref[idx+2..idx+9]
+        dup             v17.8b, w12
+        sub             v16.8b, v15.8b, v17.8b
+        umull           v20.8h, v18.8b, v16.8b
+        umlal           v20.8h, v19.8b, v17.8b
+        rshrn           \dst\().8b, v20.8h, #5
+.endm
+        h_neg_4x4_col  v0              // col0: v0.8b[0..3] = rows 0-3
+        h_neg_4x4_col  v1              // col1
+        h_neg_4x4_col  v2              // col2
+        h_neg_4x4_col  v3              // col3
+.purgem h_neg_4x4_col
+
+        // === 4x4 byte matrix transpose ===
+        // Input:  v0=col0, v1=col1, v2=col2, v3=col3 (low 4 bytes each)
+        // Output: v0=row0, v1=row1, v2=row2, v3=row3 (low 4 bytes each)
+        transpose_4x8B  v0, v1, v2, v3, v16, v17, v18, v19
+
+        // === Store 4 rows with 4-byte writes ===
+        str             s0, [x0]
+        add             x0, x0, x3
+        str             s1, [x0]
+        add             x0, x0, x3
+        str             s2, [x0]
+        add             x0, x0, x3
+        str             s3, [x0]
+
+        add             sp, sp, #112
+        ldr             d15, [sp], #16
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_h_neg_8x8_8: Horizontal reference negative angle prediction (mode 11-17)
+//
+// Fully unrolled 8-column computation with 8x8 byte matrix transpose.
+// Each column interpolation result is placed directly into v0-v7, then
+// transposed so rows can be stored with contiguous 8-byte writes.
+//
+// Arguments:
+// x0: src
+// x1: top (used for reference extension)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_h_neg_8x8_8_neon, export=1
+        str             d15, [sp, #-16]!
+        // Stack layout: ref_tmp[-64..47] at sp, base at sp+64
+        // last ranges from -7 (mode 17) to 0; data occupies ref_tmp[-7..8]
+        // Allocated 112 bytes (16-byte aligned), offset 64 for negative indexing
+        sub             sp, sp, #112
+        add             x14, sp, #64            // ref_tmp base
+
+        // Load angle from table
+        movrel          x6, intra_pred_angle_h_neg
+        sub             w7, w5, #11             // mode - 11
+        ldrsb           w8, [x6, w7, sxtw]      // angle (negative)
+
+        // Calculate last = (8 * angle) >> 5
+        mov             w15, #8
+        mul             w15, w15, w8
+        asr             w15, w15, #5            // last (negative or zero)
+
+        // Check if extension needed (last < -1)
+        cmn             w15, #1
+        b.ge            .Lh_neg_8x8_no_extend
+
+        // Copy left[-1..8] to ref_tmp[0..9]
+        sub             x16, x2, #1
+        ldr             q0, [x16]
+        str             q0, [x14]
+
+        // Load inv_angle
+        movrel          x17, inv_angle_h_neg
+        sxtw            x7, w7
+        ldrsh           w9, [x17, x7, lsl #1]
+
+        // Scalar extend loop: ref_tmp[x] = top[-1 + ((x * inv_angle + 128) >> 8)]
+        mov             w10, w15                // x = last (negative)
+.Lh_neg_8x8_extend:
+        mul             w11, w10, w9
+        add             w11, w11, #128
+        asr             w11, w11, #8
+        sub             w11, w11, #1
+        ldrb            w12, [x1, w11, sxtw]
+        strb            w12, [x14, w10, sxtw]
+        add             w10, w10, #1
+        cmn             w10, #1
+        b.le            .Lh_neg_8x8_extend
+
+        mov             x13, x14                // ref = ref_tmp
+        b               .Lh_neg_8x8_predict
+
+.Lh_neg_8x8_no_extend:
+        sub             x13, x2, #1             // ref = left - 1
+
+.Lh_neg_8x8_predict:
+        // === Fully unrolled 8-column computation ===
+        // Each column produces 8 interpolated pixels stored in v0-v7.
+        mov             w10, #0                 // angle_acc
+        movi            v15.16b, #32            // constant 32 for NEON-domain weight computation
+
+        // Interpolation macro body (inlined 8 times, targeting v0-v7):
+        //   angle_acc += angle; idx = angle_acc >> 5; fact = angle_acc & 31
+        //   load ref[idx+1..idx+8], ref[idx+2..idx+9]
+        //   result = ((32-fact)*a + fact*b + 16) >> 5
+
+.macro h_neg_8x8_col dst
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+        add             x16, x13, w11, sxtw
+        ldr             d18, [x16, #1]
+        ldr             d19, [x16, #2]
+        dup             v17.8b, w12
+        sub             v16.8b, v15.8b, v17.8b
+        umull           v20.8h, v18.8b, v16.8b
+        umlal           v20.8h, v19.8b, v17.8b
+        rshrn           \dst\().8b, v20.8h, #5
+.endm
+        h_neg_8x8_col  v0
+        h_neg_8x8_col  v1
+        h_neg_8x8_col  v2
+        h_neg_8x8_col  v3
+        h_neg_8x8_col  v4
+        h_neg_8x8_col  v5
+        h_neg_8x8_col  v6
+        h_neg_8x8_col  v7
+.purgem h_neg_8x8_col
+
+        // === 8x8 byte matrix transpose ===
+        // Input:  v0=col0 ... v7=col7 (each 8B: rows 0-7 of that column)
+        // Output: v0=row0 ... v7=row7 (each 8B: cols 0-7 of that row)
+        transpose_8x8B  v0, v1, v2, v3, v4, v5, v6, v7, v16, v17
+
+        // === Store 8 rows contiguously ===
+        str             d0, [x0]
+        add             x0, x0, x3
+        str             d1, [x0]
+        add             x0, x0, x3
+        str             d2, [x0]
+        add             x0, x0, x3
+        str             d3, [x0]
+        add             x0, x0, x3
+        str             d4, [x0]
+        add             x0, x0, x3
+        str             d5, [x0]
+        add             x0, x0, x3
+        str             d6, [x0]
+        add             x0, x0, x3
+        str             d7, [x0]
+
+        add             sp, sp, #112
+        ldr             d15, [sp], #16
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_h_neg_16x16_8: Horizontal reference negative angle prediction (mode 11-17)
+//
+// Process 16 columns in two batches of 8. Each column produces a 16-byte
+// vector (16 rows). transpose_8x16B transposes each batch, then rows are
+// stored with contiguous 16-byte writes by combining both halves.
+//
+// Arguments:
+// x0: src
+// x1: top (used for reference extension)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_h_neg_16x16_8_neon, export=1
+        stp             x19, x20, [sp, #-16]!
+        str             d15, [sp, #-16]!
+        // Stack layout: ref_tmp[-64..79] at sp, base at sp+64
+        // last ranges from -13 (mode 17) to 0; data occupies ref_tmp[-13..16]
+        // Allocated 144 bytes (16-byte aligned), offset 64 for negative indexing
+        sub             sp, sp, #144
+        add             x14, sp, #64            // ref_tmp base
+
+        // Load angle
+        movrel          x6, intra_pred_angle_h_neg
+        sub             w7, w5, #11
+        ldrsb           w8, [x6, w7, sxtw]
+
+        // Calculate last = (16 * angle) >> 5
+        mov             w15, #16
+        mul             w15, w15, w8
+        asr             w15, w15, #5
+
+        // Check if extension needed
+        cmn             w15, #1
+        b.ge            .Lh_neg_16x16_no_extend
+
+        // Copy left[-1..16] to ref_tmp[0..17]
+        sub             x16, x2, #1
+        ldp             q0, q1, [x16]
+        stp             q0, q1, [x14]
+
+        // Load inv_angle
+        movrel          x17, inv_angle_h_neg
+        sxtw            x7, w7
+        ldrsh           w9, [x17, x7, lsl #1]
+
+        // Extend loop
+        mov             w10, w15
+.Lh_neg_16x16_extend:
+        mul             w11, w10, w9
+        add             w11, w11, #128
+        asr             w11, w11, #8
+        sub             w11, w11, #1
+        ldrb            w12, [x1, w11, sxtw]
+        strb            w12, [x14, w10, sxtw]
+        add             w10, w10, #1
+        cmn             w10, #1
+        b.le            .Lh_neg_16x16_extend
+
+        mov             x13, x14
+        b               .Lh_neg_16x16_predict
+
+.Lh_neg_16x16_no_extend:
+        sub             x13, x2, #1
+
+.Lh_neg_16x16_predict:
+        // Save base pointers
+        mov             x19, x0                 // base dst (callee-saved)
+        mov             x20, x13                // base ref (callee-saved)
+        movi            v15.16b, #32            // constant 32 for NEON-domain weight computation
+
+        // === Column interpolation macro (16-byte column) ===
+.macro h_neg_16x16_col dst
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+        add             x16, x20, w11, sxtw
+        ldr             q18, [x16, #1]          // ref[idx+1..idx+16]
+        ldr             q19, [x16, #2]          // ref[idx+2..idx+17]
+        dup             v17.16b, w12
+        sub             v16.16b, v15.16b, v17.16b
+        umull           v20.8h, v18.8b, v16.8b
+        umlal           v20.8h, v19.8b, v17.8b
+        rshrn           \dst\().8b, v20.8h, #5
+        umull2          v21.8h, v18.16b, v16.16b
+        umlal2          v21.8h, v19.16b, v17.16b
+        rshrn2          \dst\().16b, v21.8h, #5
+.endm
+
+        // === Batch 1: columns 0-7 ===
+        mov             w10, #0                 // angle_acc = 0
+        h_neg_16x16_col v0
+        h_neg_16x16_col v1
+        h_neg_16x16_col v2
+        h_neg_16x16_col v3
+        h_neg_16x16_col v4
+        h_neg_16x16_col v5
+        h_neg_16x16_col v6
+        h_neg_16x16_col v7
+
+        // Save angle_acc for batch 2
+        mov             w19, w10                // reuse x19 (base dst already consumed below)
+
+        // transpose_8x16B: transposes two independent 8x8 blocks
+        // (low halves and high halves separately).
+        // Input  vi.16b = [row0..row7 | row8..row15] of column i
+        // Output vi.16b: lo8 = col0..col7 of row i, hi8 = col0..col7 of row (i+8)
+        transpose_8x16B v0, v1, v2, v3, v4, v5, v6, v7, v16, v17
+
+        // Store first 8 bytes (cols 0-7) of rows 0-7 and rows 8-15
+        // Rows 0-7: store lo 8 bytes of v0-v7
+        mov             x16, x0
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().8b}, [x16], x3
+        .endr
+
+        // Rows 8-15: store hi 8 bytes of v0-v7
+        // The high 8 bytes are accessed by storing from the .d[1] element
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().d}[1], [x16], x3
+        .endr
+
+        // === Batch 2: columns 8-15 ===
+        mov             w10, w19                // restore angle_acc from batch 1
+        h_neg_16x16_col v0
+        h_neg_16x16_col v1
+        h_neg_16x16_col v2
+        h_neg_16x16_col v3
+        h_neg_16x16_col v4
+        h_neg_16x16_col v5
+        h_neg_16x16_col v6
+        h_neg_16x16_col v7
+.purgem h_neg_16x16_col
+
+        transpose_8x16B v0, v1, v2, v3, v4, v5, v6, v7, v16, v17
+
+        // Store second 8 bytes (cols 8-15) of rows 0-7
+        add             x16, x0, #8            // offset to column 8 of row 0
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().8b}, [x16], x3
+        .endr
+
+        // Store second 8 bytes (cols 8-15) of rows 8-15
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().d}[1], [x16], x3
+        .endr
+
+        add             sp, sp, #144
+        ldr             d15, [sp], #16
+        ldp             x19, x20, [sp], #16
+        ret
+endfunc
+
+// -----------------------------------------------------------------------------
+// pred_angular_h_neg_32x32_8: Horizontal reference negative angle prediction (mode 11-17)
+//
+// Process 32 columns in 4 batches of 8.  Each column produces 32 pixels
+// stored in two q registers (rows 0-15 and rows 16-31).  Each batch is
+// transposed with two transpose_8x16B calls, then stored as contiguous
+// 8-byte row segments.
+//
+// Arguments:
+// x0: src
+// x1: top (used for reference extension)
+// x2: left
+// x3: stride
+// w4: c_idx
+// w5: mode
+// -----------------------------------------------------------------------------
+function ff_hevc_pred_angular_h_neg_32x32_8_neon, export=1
+        stp             x19, x20, [sp, #-16]!
+        stp             x21, x22, [sp, #-16]!
+        str             d15, [sp, #-16]!
+        // Stack layout: ref_tmp[-64..79] at sp, base at sp+64
+        // last ranges from -26 (mode 11) to -2 (mode 17); data occupies ref_tmp[-26..32]
+        // Allocated 144 bytes (16-byte aligned), offset 64 covers the 64-byte copy
+        sub             sp, sp, #144
+        add             x14, sp, #64            // ref_tmp base
+
+        // Load angle
+        movrel          x6, intra_pred_angle_h_neg
+        sub             w7, w5, #11
+        ldrsb           w8, [x6, w7, sxtw]
+
+        // Calculate last = (32 * angle) >> 5
+        mov             w15, #32
+        mul             w15, w15, w8
+        asr             w15, w15, #5
+
+        // Check if extension needed
+        cmn             w15, #1
+        b.ge            .Lh_neg_32x32_no_extend
+
+        // Copy left[-1..32] to ref_tmp[0..33]
+        sub             x16, x2, #1
+        ldp             q0, q1, [x16]
+        stp             q0, q1, [x14]
+        ldp             q2, q3, [x16, #32]
+        stp             q2, q3, [x14, #32]
+
+        // Load inv_angle
+        movrel          x17, inv_angle_h_neg
+        sxtw            x7, w7
+        ldrsh           w9, [x17, x7, lsl #1]
+
+        // Extend loop
+        mov             w10, w15
+.Lh_neg_32x32_extend:
+        mul             w11, w10, w9
+        add             w11, w11, #128
+        asr             w11, w11, #8
+        sub             w11, w11, #1
+        ldrb            w12, [x1, w11, sxtw]
+        strb            w12, [x14, w10, sxtw]
+        add             w10, w10, #1
+        cmn             w10, #1
+        b.le            .Lh_neg_32x32_extend
+
+        mov             x13, x14
+        b               .Lh_neg_32x32_predict
+
+.Lh_neg_32x32_no_extend:
+        sub             x13, x2, #1
+
+.Lh_neg_32x32_predict:
+        mov             x19, x0                 // save dst base
+        mov             x20, x13                // save ref base
+        movi            v15.16b, #32            // constant 32 for NEON-domain weight computation
+
+        // === Column interpolation macro (32-byte column → two 16B regs) ===
+        // \dst_hi = rows 0-15, \dst_lo = rows 16-31
+.macro h_neg_32_col dst_hi, dst_lo
+        add             w10, w10, w8
+        asr             w11, w10, #5
+        and             w12, w10, #31
+        add             x16, x20, w11, sxtw
+        // Load rows 0-15 ref pair
+        ldr             q18, [x16, #1]
+        ldr             q19, [x16, #2]
+        // Load rows 16-31 ref pair
+        ldr             q20, [x16, #17]
+        ldr             q21, [x16, #18]
+        dup             v17.16b, w12
+        sub             v16.16b, v15.16b, v17.16b
+        // Interpolate rows 0-15
+        umull           v22.8h, v18.8b, v16.8b
+        umlal           v22.8h, v19.8b, v17.8b
+        rshrn           \dst_hi\().8b, v22.8h, #5
+        umull2          v23.8h, v18.16b, v16.16b
+        umlal2          v23.8h, v19.16b, v17.16b
+        rshrn2          \dst_hi\().16b, v23.8h, #5
+        // Interpolate rows 16-31
+        umull           v22.8h, v20.8b, v16.8b
+        umlal           v22.8h, v21.8b, v17.8b
+        rshrn           \dst_lo\().8b, v22.8h, #5
+        umull2          v23.8h, v20.16b, v16.16b
+        umlal2          v23.8h, v21.16b, v17.16b
+        rshrn2          \dst_lo\().16b, v23.8h, #5
+.endm
+
+        // Process 4 batches of 8 columns each.
+        // x21 = current column offset (byte offset into each row)
+        // x22 = saved angle_acc between batches
+        mov             w10, #0                 // angle_acc
+        mov             x21, #0                 // column byte offset
+
+.Lh_neg_32x32_batch:
+        // Compute 8 columns. Upper half (rows 0-15) → v0-v7, lower half (rows 16-31) → v24-v31
+        h_neg_32_col    v0, v24
+        h_neg_32_col    v1, v25
+        h_neg_32_col    v2, v26
+        h_neg_32_col    v3, v27
+        h_neg_32_col    v4, v28
+        h_neg_32_col    v5, v29
+        h_neg_32_col    v6, v30
+        h_neg_32_col    v7, v31
+
+        // Save angle_acc
+        mov             w22, w10
+
+        // Transpose upper half: v0-v7 (each .16b = rows 0-15)
+        // After: vi lo8 = cols of row i (i=0..7), vi hi8 = cols of row i+8
+        transpose_8x16B v0, v1, v2, v3, v4, v5, v6, v7, v16, v17
+
+        // Transpose lower half: v24-v31 (each .16b = rows 16-31)
+        // After: vi lo8 = cols of row i+16, vi hi8 = cols of row i+24
+        transpose_8x16B v24, v25, v26, v27, v28, v29, v30, v31, v16, v17
+
+        // Store this batch's 8 columns for all 32 rows
+        add             x16, x19, x21          // dst + col_offset
+
+        // Rows 0-7: lo8 of v0-v7
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().8b}, [x16], x3
+        .endr
+
+        // Rows 8-15: hi8 of v0-v7
+        // x16 now points to row 8 + col_offset
+        .irp reg, v0, v1, v2, v3, v4, v5, v6, v7
+        st1             {\reg\().d}[1], [x16], x3
+        .endr
+
+        // Rows 16-23: lo8 of v24-v31
+        // x16 now points to row 16 + col_offset
+        .irp reg, v24, v25, v26, v27, v28, v29, v30, v31
+        st1             {\reg\().8b}, [x16], x3
+        .endr
+
+        // Rows 24-31: hi8 of v24-v31
+        // x16 now points to row 24 + col_offset
+        .irp reg, v24, v25, v26, v27, v28, v29, v30, v31
+        st1             {\reg\().d}[1], [x16], x3
+        .endr
+
+        // Advance to next batch
+        mov             w10, w22                // restore angle_acc
+        add             x21, x21, #8           // next 8 columns
+        cmp             x21, #32
+        b.lt            .Lh_neg_32x32_batch
+
+.purgem h_neg_32_col
+
+        add             sp, sp, #144
+        ldr             d15, [sp], #16
+        ldp             x21, x22, [sp], #16
+        ldp             x19, x20, [sp], #16
+        ret
+endfunc
-- 
2.52.0


From daf5bf2b0c0cc32d0665e2025945f226a0af0f72 Mon Sep 17 00:00:00 2001
From: Jun Zhao <barryjzhao@tencent.com>
Date: Fri, 13 Feb 2026 18:19:44 +0800
Subject: [PATCH 8/8] lavc/hevc: use macro to generate angular prediction
 dispatch functions

Replace four nearly identical pred_angular_N_neon() dispatch functions
with a PRED_ANGULAR_NEON macro that generates them from (index,
log2_size, block_size) parameters.

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
---
 libavcodec/aarch64/hevcpred_init_aarch64.c | 121 ++++++---------------
 1 file changed, 32 insertions(+), 89 deletions(-)

diff --git a/libavcodec/aarch64/hevcpred_init_aarch64.c b/libavcodec/aarch64/hevcpred_init_aarch64.c
index f523abd97d..16149ef7ea 100644
--- a/libavcodec/aarch64/hevcpred_init_aarch64.c
+++ b/libavcodec/aarch64/hevcpred_init_aarch64.c
@@ -154,97 +154,40 @@ typedef void (*pred_angular_func)(uint8_t *src, const uint8_t *top,
                                   int c_idx, int mode);
 static pred_angular_func pred_angular_c[4];
 
-static void pred_angular_0_neon(uint8_t *src, const uint8_t *top,
-                                const uint8_t *left, ptrdiff_t stride,
-                                int c_idx, int mode)
-{
-    if (mode == 10) {
-        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 2);
-    } else if (mode == 26) {
-        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 2);
-    } else if (mode == 18) {
-        ff_hevc_pred_angular_mode_18_4x4_8_neon(src, top, left, stride, c_idx, 2);
-    } else if (mode >= 27) {
-        ff_hevc_pred_angular_v_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode > 18 && mode <= 25) {
-        ff_hevc_pred_angular_v_neg_4x4_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode >= 11 && mode <= 17) {
-        ff_hevc_pred_angular_h_neg_4x4_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode >= 2 && mode <= 9) {
-        ff_hevc_pred_angular_h_pos_4x4_8_neon(src, top, left, stride, c_idx, mode);
-    } else {
-        pred_angular_c[0](src, top, left, stride, c_idx, mode);
-    }
+#define PRED_ANGULAR_NEON(IDX, LOG2, SZ)                                      \
+static void pred_angular_##IDX##_neon(uint8_t *src, const uint8_t *top,       \
+                                      const uint8_t *left, ptrdiff_t stride,  \
+                                      int c_idx, int mode)                    \
+{                                                                             \
+    if (mode == 10)                                                           \
+        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride,           \
+                                            c_idx, LOG2);                     \
+    else if (mode == 26)                                                      \
+        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride,           \
+                                            c_idx, LOG2);                     \
+    else if (mode == 18)                                                      \
+        ff_hevc_pred_angular_mode_18_##SZ##_8_neon(src, top, left, stride,    \
+                                                   c_idx, LOG2);              \
+    else if (mode >= 27)                                                      \
+        ff_hevc_pred_angular_v_pos_##SZ##_8_neon(src, top, left, stride,      \
+                                                 c_idx, mode);                \
+    else if (mode > 18 && mode <= 25)                                         \
+        ff_hevc_pred_angular_v_neg_##SZ##_8_neon(src, top, left, stride,      \
+                                                 c_idx, mode);                \
+    else if (mode >= 11 && mode <= 17)                                        \
+        ff_hevc_pred_angular_h_neg_##SZ##_8_neon(src, top, left, stride,      \
+                                                 c_idx, mode);                \
+    else if (mode >= 2 && mode <= 9)                                          \
+        ff_hevc_pred_angular_h_pos_##SZ##_8_neon(src, top, left, stride,      \
+                                                 c_idx, mode);                \
+    else                                                                      \
+        pred_angular_c[IDX](src, top, left, stride, c_idx, mode);             \
 }
 
-static void pred_angular_1_neon(uint8_t *src, const uint8_t *top,
-                                const uint8_t *left, ptrdiff_t stride,
-                                int c_idx, int mode)
-{
-    if (mode == 10) {
-        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 3);
-    } else if (mode == 26) {
-        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 3);
-    } else if (mode == 18) {
-        ff_hevc_pred_angular_mode_18_8x8_8_neon(src, top, left, stride, c_idx, 3);
-    } else if (mode >= 27) {
-        ff_hevc_pred_angular_v_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode > 18 && mode <= 25) {
-        ff_hevc_pred_angular_v_neg_8x8_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode >= 11 && mode <= 17) {
-        ff_hevc_pred_angular_h_neg_8x8_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode >= 2 && mode <= 9) {
-        ff_hevc_pred_angular_h_pos_8x8_8_neon(src, top, left, stride, c_idx, mode);
-    } else {
-        pred_angular_c[1](src, top, left, stride, c_idx, mode);
-    }
-}
-
-static void pred_angular_2_neon(uint8_t *src, const uint8_t *top,
-                                const uint8_t *left, ptrdiff_t stride,
-                                int c_idx, int mode)
-{
-    if (mode == 10) {
-        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 4);
-    } else if (mode == 26) {
-        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 4);
-    } else if (mode == 18) {
-        ff_hevc_pred_angular_mode_18_16x16_8_neon(src, top, left, stride, c_idx, 4);
-    } else if (mode >= 27) {
-        ff_hevc_pred_angular_v_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode > 18 && mode <= 25) {
-        ff_hevc_pred_angular_v_neg_16x16_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode >= 11 && mode <= 17) {
-        ff_hevc_pred_angular_h_neg_16x16_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode >= 2 && mode <= 9) {
-        ff_hevc_pred_angular_h_pos_16x16_8_neon(src, top, left, stride, c_idx, mode);
-    } else {
-        pred_angular_c[2](src, top, left, stride, c_idx, mode);
-    }
-}
-
-static void pred_angular_3_neon(uint8_t *src, const uint8_t *top,
-                                const uint8_t *left, ptrdiff_t stride,
-                                int c_idx, int mode)
-{
-    if (mode == 10) {
-        ff_hevc_pred_angular_mode_10_8_neon(src, top, left, stride, c_idx, 5);
-    } else if (mode == 26) {
-        ff_hevc_pred_angular_mode_26_8_neon(src, top, left, stride, c_idx, 5);
-    } else if (mode == 18) {
-        ff_hevc_pred_angular_mode_18_32x32_8_neon(src, top, left, stride, c_idx, 5);
-    } else if (mode >= 27) {
-        ff_hevc_pred_angular_v_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode > 18 && mode <= 25) {
-        ff_hevc_pred_angular_v_neg_32x32_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode >= 11 && mode <= 17) {
-        ff_hevc_pred_angular_h_neg_32x32_8_neon(src, top, left, stride, c_idx, mode);
-    } else if (mode >= 2 && mode <= 9) {
-        ff_hevc_pred_angular_h_pos_32x32_8_neon(src, top, left, stride, c_idx, mode);
-    } else {
-        pred_angular_c[3](src, top, left, stride, c_idx, mode);
-    }
-}
+PRED_ANGULAR_NEON(0, 2, 4x4)
+PRED_ANGULAR_NEON(1, 3, 8x8)
+PRED_ANGULAR_NEON(2, 4, 16x16)
+PRED_ANGULAR_NEON(3, 5, 32x32)
 
 av_cold void ff_hevc_pred_init_aarch64(HEVCPredContext *hpc, int bit_depth)
 {
-- 
2.52.0

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-02-14 12:19 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-14 12:18 [FFmpeg-devel] [PR] hevc intra pred neon optimizations (PR #21757) Jun Zhao via ffmpeg-devel

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git