[FFmpeg-devel] [PATCH 1/6] libavcodec/riscv: add RVV optimized for qpel

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PATCH 1/6] libavcodec/riscv: add RVV optimized for qpel_h in HEVC.
@ 2026-01-22  4:23 zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 2/6] libavcodec/riscv: add RVV optimized for qpel_v " zhanheng.yang--- via ffmpeg-devel
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: zhanheng.yang--- via ffmpeg-devel @ 2026-01-22  4:23 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Zhanheng Yang

From: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>

Bench on A210 C908 core(VLEN 128).
put_hevc_qpel_h4_8_c:                                  275.4 ( 1.00x)
put_hevc_qpel_h4_8_rvv_i32:                            142.9 ( 1.93x)
put_hevc_qpel_h6_8_c:                                  595.2 ( 1.00x)
put_hevc_qpel_h6_8_rvv_i32:                            209.7 ( 2.84x)
put_hevc_qpel_h8_8_c:                                 1044.0 ( 1.00x)
put_hevc_qpel_h8_8_rvv_i32:                            287.2 ( 3.63x)
put_hevc_qpel_h12_8_c:                                2371.0 ( 1.00x)
put_hevc_qpel_h12_8_rvv_i32:                           419.5 ( 5.65x)
put_hevc_qpel_h16_8_c:                                4187.2 ( 1.00x)
put_hevc_qpel_h16_8_rvv_i32:                           530.8 ( 7.89x)
put_hevc_qpel_h24_8_c:                                9276.4 ( 1.00x)
put_hevc_qpel_h24_8_rvv_i32:                          1509.6 ( 6.15x)
put_hevc_qpel_h32_8_c:                               16417.8 ( 1.00x)
put_hevc_qpel_h32_8_rvv_i32:                          1984.3 ( 8.27x)
put_hevc_qpel_h48_8_c:                               36812.8 ( 1.00x)
put_hevc_qpel_h48_8_rvv_i32:                          4390.6 ( 8.38x)
put_hevc_qpel_h64_8_c:                               65296.8 ( 1.00x)
put_hevc_qpel_h64_8_rvv_i32:                          7745.0 ( 8.43x)
put_hevc_qpel_uni_h4_8_c:                              374.8 ( 1.00x)
put_hevc_qpel_uni_h4_8_rvv_i32:                        162.9 ( 2.30x)
put_hevc_qpel_uni_h6_8_c:                              818.6 ( 1.00x)
put_hevc_qpel_uni_h6_8_rvv_i32:                        236.3 ( 3.46x)
put_hevc_qpel_uni_h8_8_c:                             1504.3 ( 1.00x)
put_hevc_qpel_uni_h8_8_rvv_i32:                        309.3 ( 4.86x)
put_hevc_qpel_uni_h12_8_c:                            3239.2 ( 1.00x)
put_hevc_qpel_uni_h12_8_rvv_i32:                       448.0 ( 7.23x)
put_hevc_qpel_uni_h16_8_c:                            5702.9 ( 1.00x)
put_hevc_qpel_uni_h16_8_rvv_i32:                       589.3 ( 9.68x)
put_hevc_qpel_uni_h24_8_c:                           12741.4 ( 1.00x)
put_hevc_qpel_uni_h24_8_rvv_i32:                      1650.3 ( 7.72x)
put_hevc_qpel_uni_h32_8_c:                           22531.3 ( 1.00x)
put_hevc_qpel_uni_h32_8_rvv_i32:                      2189.1 (10.29x)
put_hevc_qpel_uni_h48_8_c:                           50647.0 ( 1.00x)
put_hevc_qpel_uni_h48_8_rvv_i32:                      4817.0 (10.51x)
put_hevc_qpel_uni_h64_8_c:                           89742.9 ( 1.00x)
put_hevc_qpel_uni_h64_8_rvv_i32:                      8497.9 (10.56x)
put_hevc_qpel_uni_hv4_8_c:                             920.4 ( 1.00x)
put_hevc_qpel_uni_hv4_8_rvv_i32:                       532.1 ( 1.73x)
put_hevc_qpel_uni_hv6_8_c:                            1753.0 ( 1.00x)
put_hevc_qpel_uni_hv6_8_rvv_i32:                       691.0 ( 2.54x)
put_hevc_qpel_uni_hv8_8_c:                            2872.7 ( 1.00x)
put_hevc_qpel_uni_hv8_8_rvv_i32:                       836.9 ( 3.43x)
put_hevc_qpel_uni_hv12_8_c:                           5828.4 ( 1.00x)
put_hevc_qpel_uni_hv12_8_rvv_i32:                     1141.2 ( 5.11x)
put_hevc_qpel_uni_hv16_8_c:                           9906.7 ( 1.00x)
put_hevc_qpel_uni_hv16_8_rvv_i32:                     1452.5 ( 6.82x)
put_hevc_qpel_uni_hv24_8_c:                          20871.3 ( 1.00x)
put_hevc_qpel_uni_hv24_8_rvv_i32:                     4094.0 ( 5.10x)
put_hevc_qpel_uni_hv32_8_c:                          36123.3 ( 1.00x)
put_hevc_qpel_uni_hv32_8_rvv_i32:                     5310.5 ( 6.80x)
put_hevc_qpel_uni_hv48_8_c:                          79016.0 ( 1.00x)
put_hevc_qpel_uni_hv48_8_rvv_i32:                    11591.2 ( 6.82x)
put_hevc_qpel_uni_hv64_8_c:                         138779.8 ( 1.00x)
put_hevc_qpel_uni_hv64_8_rvv_i32:                    20321.1 ( 6.83x)
put_hevc_qpel_uni_w_h4_8_c:                            412.1 ( 1.00x)
put_hevc_qpel_uni_w_h4_8_rvv_i32:                      237.3 ( 1.74x)
put_hevc_qpel_uni_w_h6_8_c:                            895.9 ( 1.00x)
put_hevc_qpel_uni_w_h6_8_rvv_i32:                      345.6 ( 2.59x)
put_hevc_qpel_uni_w_h8_8_c:                           1625.4 ( 1.00x)
put_hevc_qpel_uni_w_h8_8_rvv_i32:                      452.4 ( 3.59x)
put_hevc_qpel_uni_w_h12_8_c:                          3541.2 ( 1.00x)
put_hevc_qpel_uni_w_h12_8_rvv_i32:                     663.6 ( 5.34x)
put_hevc_qpel_uni_w_h16_8_c:                          6290.3 ( 1.00x)
put_hevc_qpel_uni_w_h16_8_rvv_i32:                     875.7 ( 7.18x)
put_hevc_qpel_uni_w_h24_8_c:                         13994.9 ( 1.00x)
put_hevc_qpel_uni_w_h24_8_rvv_i32:                    2475.0 ( 5.65x)
put_hevc_qpel_uni_w_h32_8_c:                         24852.3 ( 1.00x)
put_hevc_qpel_uni_w_h32_8_rvv_i32:                    3291.2 ( 7.55x)
put_hevc_qpel_uni_w_h48_8_c:                         55595.5 ( 1.00x)
put_hevc_qpel_uni_w_h48_8_rvv_i32:                    7297.4 ( 7.62x)
put_hevc_qpel_uni_w_h64_8_c:                         98628.2 ( 1.00x)
put_hevc_qpel_uni_w_h64_8_rvv_i32:                   12883.2 ( 7.66x)
put_hevc_qpel_bi_h4_8_c:                               392.6 ( 1.00x)
put_hevc_qpel_bi_h4_8_rvv_i32:                         186.1 ( 2.11x)
put_hevc_qpel_bi_h6_8_c:                               842.3 ( 1.00x)
put_hevc_qpel_bi_h6_8_rvv_i32:                         267.8 ( 3.15x)
put_hevc_qpel_bi_h8_8_c:                              1546.4 ( 1.00x)
put_hevc_qpel_bi_h8_8_rvv_i32:                         353.7 ( 4.37x)
put_hevc_qpel_bi_h12_8_c:                             3317.2 ( 1.00x)
put_hevc_qpel_bi_h12_8_rvv_i32:                        515.1 ( 6.44x)
put_hevc_qpel_bi_h16_8_c:                             5848.3 ( 1.00x)
put_hevc_qpel_bi_h16_8_rvv_i32:                        680.9 ( 8.59x)
put_hevc_qpel_bi_h24_8_c:                            13032.6 ( 1.00x)
put_hevc_qpel_bi_h24_8_rvv_i32:                       1880.8 ( 6.93x)
put_hevc_qpel_bi_h32_8_c:                            23021.1 ( 1.00x)
put_hevc_qpel_bi_h32_8_rvv_i32:                       2498.5 ( 9.21x)
put_hevc_qpel_bi_h48_8_c:                            51655.9 ( 1.00x)
put_hevc_qpel_bi_h48_8_rvv_i32:                       5486.3 ( 9.42x)
put_hevc_qpel_bi_h64_8_c:                            91738.7 ( 1.00x)
put_hevc_qpel_bi_h64_8_rvv_i32:                       9735.0 ( 9.42x)

Signed-off-by: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>
---
 libavcodec/riscv/Makefile            |   3 +-
 libavcodec/riscv/h26x/h2656dsp.h     |  12 ++
 libavcodec/riscv/h26x/hevcqpel_rvv.S | 309 +++++++++++++++++++++++++++
 libavcodec/riscv/hevcdsp_init.c      |  55 +++--
 4 files changed, 364 insertions(+), 15 deletions(-)
 create mode 100644 libavcodec/riscv/h26x/hevcqpel_rvv.S

diff --git a/libavcodec/riscv/Makefile b/libavcodec/riscv/Makefile
index 2c53334923..414790ae0c 100644
--- a/libavcodec/riscv/Makefile
+++ b/libavcodec/riscv/Makefile
@@ -36,7 +36,8 @@ RVV-OBJS-$(CONFIG_H264DSP) += riscv/h264addpx_rvv.o riscv/h264dsp_rvv.o \
 OBJS-$(CONFIG_H264QPEL) += riscv/h264qpel_init.o
 RVV-OBJS-$(CONFIG_H264QPEL) += riscv/h264qpel_rvv.o
 OBJS-$(CONFIG_HEVC_DECODER) += riscv/hevcdsp_init.o
-RVV-OBJS-$(CONFIG_HEVC_DECODER)  += riscv/h26x/h2656_inter_rvv.o
+RVV-OBJS-$(CONFIG_HEVC_DECODER)  += riscv/h26x/h2656_inter_rvv.o \
+                                    riscv/h26x/hevcqpel_rvv.o
 OBJS-$(CONFIG_HUFFYUV_DECODER) += riscv/huffyuvdsp_init.o
 RVV-OBJS-$(CONFIG_HUFFYUV_DECODER) += riscv/huffyuvdsp_rvv.o
 OBJS-$(CONFIG_IDCTDSP) += riscv/idctdsp_init.o
diff --git a/libavcodec/riscv/h26x/h2656dsp.h b/libavcodec/riscv/h26x/h2656dsp.h
index 6d2ac55556..028b9ffbfd 100644
--- a/libavcodec/riscv/h26x/h2656dsp.h
+++ b/libavcodec/riscv/h26x/h2656dsp.h
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2024 Institute of Software Chinese Academy of Sciences (ISCAS).
+ * Copyright (C) 2026 Alibaba Group Holding Limited.
  *
  * This file is part of FFmpeg.
  *
@@ -24,4 +25,15 @@
 void ff_h2656_put_pixels_8_rvv_256(int16_t *dst, const uint8_t *src, ptrdiff_t srcstride, int height, intptr_t mx, intptr_t my, int width);
 void ff_h2656_put_pixels_8_rvv_128(int16_t *dst, const uint8_t *src, ptrdiff_t srcstride, int height, intptr_t mx, intptr_t my, int width);
 
+void ff_hevc_put_qpel_h_8_m1_rvv(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_uni_h_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_uni_w_h_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_bi_h_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
+        mx, intptr_t my, int width);
 #endif
diff --git a/libavcodec/riscv/h26x/hevcqpel_rvv.S b/libavcodec/riscv/h26x/hevcqpel_rvv.S
new file mode 100644
index 0000000000..52d7acac33
--- /dev/null
+++ b/libavcodec/riscv/h26x/hevcqpel_rvv.S
@@ -0,0 +1,309 @@
+ /*
+ * Copyright (C) 2026 Alibaba Group Holding Limited.
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+.data
+.align 2
+qpel_filters:
+    .byte  0,  0,  0,  0,  0,  0, 0,  0
+    .byte -1,  4, -10, 58, 17, -5, 1,  0
+    .byte -1,  4, -11, 40, 40, -11, 4, -1
+    .byte  0,  1, -5, 17, 58, -10, 4, -1
+
+.text
+#include "libavutil/riscv/asm.S"
+#define HEVC_MAX_PB_SIZE 64
+
+.macro  lx rd, addr
+#if (__riscv_xlen == 32)
+        lw      \rd, \addr
+#elif (__riscv_xlen == 64)
+        ld      \rd, \addr
+#else
+        lq      \rd, \addr
+#endif
+.endm
+
+.macro  sx rd, addr
+#if (__riscv_xlen == 32)
+        sw      \rd, \addr
+#elif (__riscv_xlen == 64)
+        sd      \rd, \addr
+#else
+        sq      \rd, \addr
+#endif
+.endm
+
+/* clobbers t0, t1 */
+.macro load_filter m
+        la          t0, qpel_filters
+        slli        t1, \m, 3
+        add         t0, t0, t1
+        lb          s1, 0(t0)
+        lb          s2, 1(t0)
+        lb          s3, 2(t0)
+        lb          s4, 3(t0)
+        lb          s5, 4(t0)
+        lb          s6, 5(t0)
+        lb          s7, 6(t0)
+        lb          s8, 7(t0)
+.endm
+
+/* output is unclipped; clobbers t4 */
+.macro filter_h         vdst, vsrc0, vsrc1, vsrc2, vsrc3, vsrc4, vsrc5, vsrc6, vsrc7, src
+        addi             t4, \src, -3
+        vle8.v           \vsrc0, (t4)
+        addi             t4, \src, -2
+        vmv.v.x          \vsrc3, s1
+        vwmulsu.vv       \vdst, \vsrc3, \vsrc0
+        vle8.v           \vsrc1, (t4)
+        addi             t4, \src, -1
+        vle8.v           \vsrc2, (t4)
+        vle8.v           \vsrc3, (\src)
+        addi             t4, \src, 1
+        vle8.v           \vsrc4, (t4)
+        addi             t4, \src, 2
+        vle8.v           \vsrc5, (t4)
+        addi             t4, \src, 3
+        vle8.v           \vsrc6, (t4)
+        addi             t4, \src, 4
+        vle8.v           \vsrc7, (t4)
+
+        vwmaccsu.vx      \vdst, s2, \vsrc1
+        vwmaccsu.vx      \vdst, s3, \vsrc2
+        vwmaccsu.vx      \vdst, s4, \vsrc3
+        vwmaccsu.vx      \vdst, s5, \vsrc4
+        vwmaccsu.vx      \vdst, s6, \vsrc5
+        vwmaccsu.vx      \vdst, s7, \vsrc6
+        vwmaccsu.vx      \vdst, s8, \vsrc7
+.endm
+
+.macro vreg
+
+.endm
+
+.macro hevc_qpel_h       lmul, lmul2, lmul4
+func ff_hevc_put_qpel_h_8_\lmul\()_rvv, zve32x
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter a4
+    mv          t3, a6
+    li          t1, 0       # offset
+
+1:
+    vsetvli     t6, t3, e8, \lmul, ta, ma
+    add         t2, a1, t1
+    filter_h    v0, v16, v18, v20, v22, v24, v26, v28, v30, t2
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    slli        t2, t1, 1
+    add         t2, a0, t2
+    vse16.v     v0, (t2)
+    sub         t3, t3, t6
+    add         t1, t1, t6
+    bgt         t3, zero, 1b
+    addi        a3, a3, -1
+    mv          t3, a6
+    add         a1, a1, a2
+    addi        a0, a0, 2*HEVC_MAX_PB_SIZE
+    li          t1, 0
+    bgt         a3, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_qpel_uni_h_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter a5
+    mv          t3, a7
+    li          t1, 0       # offset
+
+1:
+    vsetvli     t6, t3, e8, \lmul, ta, ma
+    add         t2, a2, t1
+    filter_h    v0, v16, v18, v20, v22, v24, v26, v28, v30, t2
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 6
+    add         t2, a0, t1
+    vse8.v      v0, (t2)
+    sub         t3, t3, t6
+    add         t1, t1, t6
+    bgt         t3, zero, 1b
+    addi        a4, a4, -1
+    mv          t3, a7
+    add         a2, a2, a3
+    add         a0, a0, a1
+    li          t1, 0
+    bgt         a4, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_qpel_uni_w_h_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lx          t2, 0(sp)       # mx
+    addi        a5, a5, 6       # shift
+#if (__riscv_xlen == 32)
+    lw          t3, 8(sp)       # width
+#elif (__riscv_xlen == 64)
+    lw          t3, 16(sp)
+#endif
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter t2
+    li          t2, 0           # offset
+
+1:
+    vsetvli     t6, t3, e8, \lmul, ta, ma
+    add         t1, a2, t2
+    filter_h    v8, v16, v18, v20, v22, v24, v26, v28, v30, t1
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vwmul.vx    v0, v8, a6
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vssra.vx    v0, v0, a5
+    vsadd.vx    v0, v0, a7
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vnclip.wi   v0, v0, 0
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    add         t1, a0, t2
+    vse8.v      v0, (t1)
+    sub         t3, t3, t6
+    add         t2, t2, t6
+    bgt         t3, zero, 1b
+    addi        a4, a4, -1
+#if (__riscv_xlen == 32)
+    lw          t3, 72(sp)
+#elif (__riscv_xlen == 64)
+    ld          t3, 80(sp)
+#endif
+    add         a2, a2, a3
+    add         a0, a0, a1
+    li          t2, 0
+    bgt         a4, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_qpel_bi_h_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lw          t3, 0(sp)      # width
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter a6
+    li          t1, 0          # offset
+
+1:
+    vsetvli     t6, t3, e16, \lmul2, ta, ma
+    slli        t2, t1, 1
+    add         t2, a4, t2
+    vle16.v     v12, (t2)
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    add         t2, a2, t1
+    filter_h    v0, v16, v18, v20, v22, v24, v26, v28, v30, t2
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vsadd.vv    v0, v0, v12
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 7
+    add         t2, a0, t1
+    vse8.v      v0, (t2)
+    sub         t3, t3, t6
+    add         t1, t1, t6
+    bgt         t3, zero, 1b
+    addi        a5, a5, -1
+    lw          t3, 64(sp)
+    add         a2, a2, a3
+    add         a0, a0, a1
+    addi        a4, a4, 2*HEVC_MAX_PB_SIZE
+    li          t1, 0
+    bgt         a5, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+.endm
+
+hevc_qpel_h m1, m2, m4
\ No newline at end of file
diff --git a/libavcodec/riscv/hevcdsp_init.c b/libavcodec/riscv/hevcdsp_init.c
index 70bc8ebea7..59333740de 100644
--- a/libavcodec/riscv/hevcdsp_init.c
+++ b/libavcodec/riscv/hevcdsp_init.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (c) 2024 Institute of Software Chinese Academy of Sciences (ISCAS).
+ * Copyright (C) 2026 Alibaba Group Holding Limited.
  *
  * This file is part of FFmpeg.
  *
@@ -34,30 +35,56 @@
         member[7][v][h] = ff_h2656_put_pixels_##8_##ext; \
         member[9][v][h] = ff_h2656_put_pixels_##8_##ext;
 
+#define RVV_FNASSIGN_PEL(member, v, h, fn) \
+        member[1][v][h] = fn;  \
+        member[2][v][h] = fn;  \
+        member[3][v][h] = fn;  \
+        member[4][v][h] = fn;  \
+        member[5][v][h] = fn; \
+        member[6][v][h] = fn; \
+        member[7][v][h] = fn; \
+        member[8][v][h] = fn; \
+        member[9][v][h] = fn;
+
 void ff_hevc_dsp_init_riscv(HEVCDSPContext *c, const int bit_depth)
 {
 #if HAVE_RVV
     const int flags = av_get_cpu_flags();
     int vlenb;
 
-    if (!(flags & AV_CPU_FLAG_RVV_I32) || !(flags & AV_CPU_FLAG_RVB))
-        return;
+    if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB)) {
+        vlenb = ff_get_rv_vlenb();
+        if (vlenb >= 32) {
+            switch (bit_depth) {
+                case 8:
+                    RVV_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels, rvv_256);
+                    RVV_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels, rvv_256);
 
-    vlenb = ff_get_rv_vlenb();
-    if (vlenb >= 32) {
-        switch (bit_depth) {
-            case 8:
-                RVV_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels, rvv_256);
-                RVV_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels, rvv_256);
-                break;
-            default:
-                break;
+                    break;
+                default:
+                    break;
+            }
+        } else if (vlenb >= 16) {
+            switch (bit_depth) {
+                case 8:
+                    RVV_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels, rvv_128);
+                    RVV_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels, rvv_128);
+
+                    break;
+                default:
+                    break;
+            }
         }
-    } else if (vlenb >= 16) {
+    }
+
+    if ((flags & AV_CPU_FLAG_RVV_I32)) {
         switch (bit_depth) {
             case 8:
-                RVV_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels, rvv_128);
-                RVV_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels, rvv_128);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel, 0, 1, ff_hevc_put_qpel_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni, 0, 1, ff_hevc_put_qpel_uni_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni_w, 0, 1, ff_hevc_put_qpel_uni_w_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_bi, 0, 1, ff_hevc_put_qpel_bi_h_8_m1_rvv);
+
                 break;
             default:
                 break;
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FFmpeg-devel] [PATCH 2/6] libavcodec/riscv: add RVV optimized for qpel_v in HEVC.
  2026-01-22  4:23 [FFmpeg-devel] [PATCH 1/6] libavcodec/riscv: add RVV optimized for qpel_h in HEVC zhanheng.yang--- via ffmpeg-devel
@ 2026-01-22  4:23 ` zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 3/6] libavcodec/riscv: add RVV optimized for epel_h " zhanheng.yang--- via ffmpeg-devel
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: zhanheng.yang--- via ffmpeg-devel @ 2026-01-22  4:23 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Zhanheng Yang

From: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>

Bench on A210 C908 core(VLEN 128).
put_hevc_qpel_v4_8_c:                                  265.0 ( 1.00x)
put_hevc_qpel_v4_8_rvv_i32:                            117.0 ( 2.26x)
put_hevc_qpel_v6_8_c:                                  568.8 ( 1.00x)
put_hevc_qpel_v6_8_rvv_i32:                            162.3 ( 3.50x)
put_hevc_qpel_v8_8_c:                                  986.9 ( 1.00x)
put_hevc_qpel_v8_8_rvv_i32:                            200.9 ( 4.91x)
put_hevc_qpel_v12_8_c:                                2236.1 ( 1.00x)
put_hevc_qpel_v12_8_rvv_i32:                           294.8 ( 7.58x)
put_hevc_qpel_v16_8_c:                                3958.8 ( 1.00x)
put_hevc_qpel_v16_8_rvv_i32:                           387.0 (10.23x)
put_hevc_qpel_v24_8_c:                                8707.6 ( 1.00x)
put_hevc_qpel_v24_8_rvv_i32:                          1096.5 ( 7.94x)
put_hevc_qpel_v32_8_c:                               15392.3 ( 1.00x)
put_hevc_qpel_v32_8_rvv_i32:                          1442.4 (10.67x)
put_hevc_qpel_v48_8_c:                               34569.2 ( 1.00x)
put_hevc_qpel_v48_8_rvv_i32:                          3197.1 (10.81x)
put_hevc_qpel_v64_8_c:                               61109.7 ( 1.00x)
put_hevc_qpel_v64_8_rvv_i32:                          5642.4 (10.83x)
put_hevc_qpel_uni_v4_8_c:                              354.9 ( 1.00x)
put_hevc_qpel_uni_v4_8_rvv_i32:                        131.3 ( 2.70x)
put_hevc_qpel_uni_v6_8_c:                              769.3 ( 1.00x)
put_hevc_qpel_uni_v6_8_rvv_i32:                        180.8 ( 4.25x)
put_hevc_qpel_uni_v8_8_c:                             1399.3 ( 1.00x)
put_hevc_qpel_uni_v8_8_rvv_i32:                        223.6 ( 6.26x)
put_hevc_qpel_uni_v12_8_c:                            3031.4 ( 1.00x)
put_hevc_qpel_uni_v12_8_rvv_i32:                       323.2 ( 9.38x)
put_hevc_qpel_uni_v16_8_c:                            5334.2 ( 1.00x)
put_hevc_qpel_uni_v16_8_rvv_i32:                       417.9 (12.76x)
put_hevc_qpel_uni_v24_8_c:                           11908.4 ( 1.00x)
put_hevc_qpel_uni_v24_8_rvv_i32:                      1212.2 ( 9.82x)
put_hevc_qpel_uni_v32_8_c:                           21030.6 ( 1.00x)
put_hevc_qpel_uni_v32_8_rvv_i32:                      1579.5 (13.31x)
put_hevc_qpel_uni_v48_8_c:                           47025.7 ( 1.00x)
put_hevc_qpel_uni_v48_8_rvv_i32:                      3500.2 (13.43x)
put_hevc_qpel_uni_v64_8_c:                           83487.0 ( 1.00x)
put_hevc_qpel_uni_v64_8_rvv_i32:                      6188.4 (13.49x)
put_hevc_qpel_uni_w_v4_8_c:                            396.3 ( 1.00x)
put_hevc_qpel_uni_w_v4_8_rvv_i32:                      200.9 ( 1.97x)
put_hevc_qpel_uni_w_v6_8_c:                            851.4 ( 1.00x)
put_hevc_qpel_uni_w_v6_8_rvv_i32:                      282.1 ( 3.02x)
put_hevc_qpel_uni_w_v8_8_c:                           1544.0 ( 1.00x)
put_hevc_qpel_uni_w_v8_8_rvv_i32:                      356.5 ( 4.33x)
put_hevc_qpel_uni_w_v12_8_c:                          3329.0 ( 1.00x)
put_hevc_qpel_uni_w_v12_8_rvv_i32:                     519.6 ( 6.41x)
put_hevc_qpel_uni_w_v16_8_c:                          5857.9 ( 1.00x)
put_hevc_qpel_uni_w_v16_8_rvv_i32:                     679.6 ( 8.62x)
put_hevc_qpel_uni_w_v24_8_c:                         13050.5 ( 1.00x)
put_hevc_qpel_uni_w_v24_8_rvv_i32:                    1965.5 ( 6.64x)
put_hevc_qpel_uni_w_v32_8_c:                         23219.4 ( 1.00x)
put_hevc_qpel_uni_w_v32_8_rvv_i32:                    2601.6 ( 8.93x)
put_hevc_qpel_uni_w_v48_8_c:                         51925.3 ( 1.00x)
put_hevc_qpel_uni_w_v48_8_rvv_i32:                    5786.7 ( 8.97x)
put_hevc_qpel_uni_w_v64_8_c:                         92075.5 ( 1.00x)
put_hevc_qpel_uni_w_v64_8_rvv_i32:                   10269.8 ( 8.97x)
put_hevc_qpel_bi_v4_8_c:                               376.4 ( 1.00x)
put_hevc_qpel_bi_v4_8_rvv_i32:                         150.2 ( 2.51x)
put_hevc_qpel_bi_v6_8_c:                               808.3 ( 1.00x)
put_hevc_qpel_bi_v6_8_rvv_i32:                         207.1 ( 3.90x)
put_hevc_qpel_bi_v8_8_c:                              1490.1 ( 1.00x)
put_hevc_qpel_bi_v8_8_rvv_i32:                         257.2 ( 5.79x)
put_hevc_qpel_bi_v12_8_c:                             3220.3 ( 1.00x)
put_hevc_qpel_bi_v12_8_rvv_i32:                        375.2 ( 8.58x)
put_hevc_qpel_bi_v16_8_c:                             5657.5 ( 1.00x)
put_hevc_qpel_bi_v16_8_rvv_i32:                        482.5 (11.72x)
put_hevc_qpel_bi_v24_8_c:                            12495.4 ( 1.00x)
put_hevc_qpel_bi_v24_8_rvv_i32:                       1383.8 ( 9.03x)
put_hevc_qpel_bi_v32_8_c:                            22191.6 ( 1.00x)
put_hevc_qpel_bi_v32_8_rvv_i32:                       1822.0 (12.18x)
put_hevc_qpel_bi_v48_8_c:                            49654.0 ( 1.00x)
put_hevc_qpel_bi_v48_8_rvv_i32:                       4046.8 (12.27x)
put_hevc_qpel_bi_v64_8_c:                            88287.8 ( 1.00x)
put_hevc_qpel_bi_v64_8_rvv_i32:                       7196.6 (12.27x)

Signed-off-by: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>
---
 libavcodec/riscv/h26x/h2656dsp.h     |  11 +
 libavcodec/riscv/h26x/hevcqpel_rvv.S | 315 ++++++++++++++++++++++++++-
 libavcodec/riscv/hevcdsp_init.c      |   5 +
 3 files changed, 330 insertions(+), 1 deletion(-)

diff --git a/libavcodec/riscv/h26x/h2656dsp.h b/libavcodec/riscv/h26x/h2656dsp.h
index 028b9ffbfd..2dabc16aee 100644
--- a/libavcodec/riscv/h26x/h2656dsp.h
+++ b/libavcodec/riscv/h26x/h2656dsp.h
@@ -36,4 +36,15 @@ void ff_hevc_put_qpel_uni_w_h_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
 void ff_hevc_put_qpel_bi_h_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
         ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
         mx, intptr_t my, int width);
+void ff_hevc_put_qpel_v_8_m1_rvv(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_uni_v_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_uni_w_v_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_bi_v_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
+        mx, intptr_t my, int width);
 #endif
diff --git a/libavcodec/riscv/h26x/hevcqpel_rvv.S b/libavcodec/riscv/h26x/hevcqpel_rvv.S
index 52d7acac33..8fd3c47bcc 100644
--- a/libavcodec/riscv/h26x/hevcqpel_rvv.S
+++ b/libavcodec/riscv/h26x/hevcqpel_rvv.S
@@ -306,4 +306,317 @@ func ff_hevc_put_qpel_bi_h_8_\lmul\()_rvv, zve32x
 endfunc
 .endm
 
-hevc_qpel_h m1, m2, m4
\ No newline at end of file
+hevc_qpel_h m1, m2, m4
+
+/* output is unclipped; clobbers v4 */
+.macro filter_v         vdst, vsrc0, vsrc1, vsrc2, vsrc3, vsrc4, vsrc5, vsrc6, vsrc7
+        vmv.v.x          v4, s1
+        vwmulsu.vv       \vdst, v4, \vsrc0
+        vwmaccsu.vx      \vdst, s2, \vsrc1
+        vmv.v.v          \vsrc0, \vsrc1
+        vwmaccsu.vx      \vdst, s3, \vsrc2
+        vmv.v.v          \vsrc1, \vsrc2
+        vwmaccsu.vx      \vdst, s4, \vsrc3
+        vmv.v.v          \vsrc2, \vsrc3
+        vwmaccsu.vx      \vdst, s5, \vsrc4
+        vmv.v.v          \vsrc3, \vsrc4
+        vwmaccsu.vx      \vdst, s6, \vsrc5
+        vmv.v.v          \vsrc4, \vsrc5
+        vwmaccsu.vx      \vdst, s7, \vsrc6
+        vmv.v.v          \vsrc5, \vsrc6
+        vwmaccsu.vx      \vdst, s8, \vsrc7
+        vmv.v.v          \vsrc6, \vsrc7
+.endm
+
+.macro hevc_qpel_v       lmul, lmul2, lmul4
+func ff_hevc_put_qpel_v_8_\lmul\()_rvv, zve32x
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter a5
+    slli        t1, a2, 1
+    add         t1, t1, a2
+    sub         a1, a1, t1      # src - 3 * src_stride
+    li          t1, 0           # offset
+    mv          t4, a3
+
+1:
+    add         t2, a1, t1
+    slli        t3, t1, 1
+    add         t3, a0, t3
+
+    vsetvli     t5, a6, e8, \lmul, ta, ma
+    vle8.V      v16, (t2)
+    add         t2, t2, a2
+    vle8.V      v18, (t2)
+    add         t2, t2, a2
+    vle8.V      v20, (t2)
+    add         t2, t2, a2
+    vle8.V      v22, (t2)
+    add         t2, t2, a2
+    vle8.V      v24, (t2)
+    add         t2, t2, a2
+    vle8.V      v26, (t2)
+    add         t2, t2, a2
+    vle8.V      v28, (t2)
+    add         t2, t2, a2
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vle8.V      v30, (t2)
+    add         t2, t2, a2
+    filter_v    v0, v16, v18, v20, v22, v24, v26, v28, v30
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vse16.v     v0, (t3)
+    add         t3, t3, 2*HEVC_MAX_PB_SIZE
+    addi        a3, a3, -1
+    bgt         a3, zero, 2b
+    add         t1, t1, t5
+    sub         a6, a6, t5
+    mv          a3, t4
+    bgt         a6, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_qpel_uni_v_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter a6
+    slli        t1, a3, 1
+    add         t1, t1, a3
+    sub         a2, a2, t1      # src - 3 * src_stride
+    li          t1, 0           # offset
+    mv          t4, a4
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+
+    vsetvli     t5, a7, e8, \lmul, ta, ma
+    vle8.V      v16, (t2)
+    add         t2, t2, a3
+    vle8.V      v18, (t2)
+    add         t2, t2, a3
+    vle8.V      v20, (t2)
+    add         t2, t2, a3
+    vle8.V      v22, (t2)
+    add         t2, t2, a3
+    vle8.V      v24, (t2)
+    add         t2, t2, a3
+    vle8.V      v26, (t2)
+    add         t2, t2, a3
+    vle8.V      v28, (t2)
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vle8.V      v30, (t2)
+    add         t2, t2, a3
+    filter_v    v0, v16, v18, v20, v22, v24, v26, v28, v30
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 6
+    vse8.v      v0, (t3)
+    add         t3, t3, a1
+    addi        a4, a4, -1
+    bgt         a4, zero, 2b
+    add         t1, t1, t5
+    sub         a7, a7, t5
+    mv          a4, t4
+    bgt         a7, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_qpel_uni_w_v_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+#if (__riscv_xlen == 32)
+    lw          t1, 4(sp)       # my
+    lw          t6, 8(sp)       # width
+#elif (__riscv_xlen == 64)
+    ld          t1, 8(sp)
+    lw          t6, 16(sp)
+#endif
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter t1
+    addi        a5, a5, 6       # shift
+    slli        t1, a3, 1
+    add         t1, t1, a3
+    sub         a2, a2, t1      # src - 3 * src_stride
+    li          t1, 0           # offset
+    mv          t4, a4
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+
+    vsetvli     t5, t6, e8, \lmul, ta, ma
+    vle8.V      v16, (t2)
+    add         t2, t2, a3
+    vle8.V      v18, (t2)
+    add         t2, t2, a3
+    vle8.V      v20, (t2)
+    add         t2, t2, a3
+    vle8.V      v22, (t2)
+    add         t2, t2, a3
+    vle8.V      v24, (t2)
+    add         t2, t2, a3
+    vle8.V      v26, (t2)
+    add         t2, t2, a3
+    vle8.V      v28, (t2)
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vle8.V      v30, (t2)
+    add         t2, t2, a3
+    filter_v    v0, v16, v18, v20, v22, v24, v26, v28, v30
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vwmul.vx    v8, v0, a6
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vssra.vx    v0, v8, a5
+    vsadd.vx    v0, v0, a7
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vnclip.wi   v0, v0, 0
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    vse8.v      v0, (t3)
+    add         t3, t3, a1
+    addi        a4, a4, -1
+    bgt         a4, zero, 2b
+    add         t1, t1, t5
+    sub         t6, t6, t5
+    mv          a4, t4
+    bgt         t6, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_qpel_bi_v_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lw          t6, 0(sp)      # width
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter a7
+    slli        t1, a3, 1
+    add         t1, t1, a3
+    sub         a2, a2, t1      # src - 3 * src_stride
+    li          t1, 0           # offset
+    mv          t4, a5
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+    slli        t0, t1, 1
+    add         t0, a4, t0
+
+    vsetvli     t5, t6, e8, \lmul, ta, ma
+    vle8.V      v16, (t2)
+    add         t2, t2, a3
+    vle8.V      v18, (t2)
+    add         t2, t2, a3
+    vle8.V      v20, (t2)
+    add         t2, t2, a3
+    vle8.V      v22, (t2)
+    add         t2, t2, a3
+    vle8.V      v24, (t2)
+    add         t2, t2, a3
+    vle8.V      v26, (t2)
+    add         t2, t2, a3
+    vle8.V      v28, (t2)
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vle8.V      v30, (t2)
+    add         t2, t2, a3
+    filter_v    v0, v16, v18, v20, v22, v24, v26, v28, v30
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vle16.v     v8, (t0)
+    addi        t0, t0, 2*HEVC_MAX_PB_SIZE
+    vsadd.vv    v0, v0, v8
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 7
+    vse8.v      v0, (t3)
+    add         t3, t3, a1
+    addi        a5, a5, -1
+    bgt         a5, zero, 2b
+    add         t1, t1, t5
+    sub         t6, t6, t5
+    mv          a5, t4
+    bgt         t6, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+.endm
+
+hevc_qpel_v m1, m2, m4
\ No newline at end of file
diff --git a/libavcodec/riscv/hevcdsp_init.c b/libavcodec/riscv/hevcdsp_init.c
index 59333740de..480cfd2968 100644
--- a/libavcodec/riscv/hevcdsp_init.c
+++ b/libavcodec/riscv/hevcdsp_init.c
@@ -84,6 +84,11 @@ void ff_hevc_dsp_init_riscv(HEVCDSPContext *c, const int bit_depth)
                 RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni, 0, 1, ff_hevc_put_qpel_uni_h_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni_w, 0, 1, ff_hevc_put_qpel_uni_w_h_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_qpel_bi, 0, 1, ff_hevc_put_qpel_bi_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_bi, 0, 1, ff_hevc_put_qpel_bi_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel, 1, 0, ff_hevc_put_qpel_v_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni, 1, 0, ff_hevc_put_qpel_uni_v_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni_w, 1, 0, ff_hevc_put_qpel_uni_w_v_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_bi, 1, 0, ff_hevc_put_qpel_bi_v_8_m1_rvv);
 
                 break;
             default:
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FFmpeg-devel] [PATCH 3/6] libavcodec/riscv: add RVV optimized for epel_h in HEVC.
  2026-01-22  4:23 [FFmpeg-devel] [PATCH 1/6] libavcodec/riscv: add RVV optimized for qpel_h in HEVC zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 2/6] libavcodec/riscv: add RVV optimized for qpel_v " zhanheng.yang--- via ffmpeg-devel
@ 2026-01-22  4:23 ` zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 4/6] libavcodec/riscv: add RVV optimized for epel_v " zhanheng.yang--- via ffmpeg-devel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: zhanheng.yang--- via ffmpeg-devel @ 2026-01-22  4:23 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Zhanheng Yang

From: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>

Bench on A210 C908 core(VLEN 128).
put_hevc_epel_h4_8_c:                                  146.2 ( 1.00x)
put_hevc_epel_h4_8_rvv_i32:                             81.8 ( 1.79x)
put_hevc_epel_h6_8_c:                                  305.4 ( 1.00x)
put_hevc_epel_h6_8_rvv_i32:                            115.5 ( 2.65x)
put_hevc_epel_h8_8_c:                                  532.7 ( 1.00x)
put_hevc_epel_h8_8_rvv_i32:                            156.7 ( 3.40x)
put_hevc_epel_h12_8_c:                                1233.8 ( 1.00x)
put_hevc_epel_h12_8_rvv_i32:                           225.7 ( 5.47x)
put_hevc_epel_h16_8_c:                                2223.8 ( 1.00x)
put_hevc_epel_h16_8_rvv_i32:                           296.2 ( 7.51x)
put_hevc_epel_h24_8_c:                                4739.4 ( 1.00x)
put_hevc_epel_h24_8_rvv_i32:                           800.7 ( 5.92x)
put_hevc_epel_h32_8_c:                                8344.4 ( 1.00x)
put_hevc_epel_h32_8_rvv_i32:                          1066.0 ( 7.83x)
put_hevc_epel_h48_8_c:                               18595.3 ( 1.00x)
put_hevc_epel_h48_8_rvv_i32:                          2324.3 ( 8.00x)
put_hevc_epel_h64_8_c:                               32911.2 ( 1.00x)
put_hevc_epel_h64_8_rvv_i32:                          4079.8 ( 8.07x)
put_hevc_epel_uni_h4_8_c:                              225.1 ( 1.00x)
put_hevc_epel_uni_h4_8_rvv_i32:                         99.0 ( 2.27x)
put_hevc_epel_uni_h6_8_c:                              500.0 ( 1.00x)
put_hevc_epel_uni_h6_8_rvv_i32:                        138.1 ( 3.62x)
put_hevc_epel_uni_h8_8_c:                              895.6 ( 1.00x)
put_hevc_epel_uni_h8_8_rvv_i32:                        186.3 ( 4.81x)
put_hevc_epel_uni_h12_8_c:                            1925.0 ( 1.00x)
put_hevc_epel_uni_h12_8_rvv_i32:                       264.4 ( 7.28x)
put_hevc_epel_uni_h16_8_c:                            3372.3 ( 1.00x)
put_hevc_epel_uni_h16_8_rvv_i32:                       342.7 ( 9.84x)
put_hevc_epel_uni_h24_8_c:                            7501.4 ( 1.00x)
put_hevc_epel_uni_h24_8_rvv_i32:                       935.6 ( 8.02x)
put_hevc_epel_uni_h32_8_c:                           13232.0 ( 1.00x)
put_hevc_epel_uni_h32_8_rvv_i32:                      1240.0 (10.67x)
put_hevc_epel_uni_h48_8_c:                           29608.1 ( 1.00x)
put_hevc_epel_uni_h48_8_rvv_i32:                      2710.5 (10.92x)
put_hevc_epel_uni_h64_8_c:                           52452.8 ( 1.00x)
put_hevc_epel_uni_h64_8_rvv_i32:                      4775.5 (10.98x)
put_hevc_epel_uni_w_h4_8_c:                            298.5 ( 1.00x)
put_hevc_epel_uni_w_h4_8_rvv_i32:                      176.6 ( 1.69x)
put_hevc_epel_uni_w_h6_8_c:                            645.3 ( 1.00x)
put_hevc_epel_uni_w_h6_8_rvv_i32:                      254.9 ( 2.53x)
put_hevc_epel_uni_w_h8_8_c:                           1187.0 ( 1.00x)
put_hevc_epel_uni_w_h8_8_rvv_i32:                      335.3 ( 3.54x)
put_hevc_epel_uni_w_h12_8_c:                          2535.6 ( 1.00x)
put_hevc_epel_uni_w_h12_8_rvv_i32:                     487.8 ( 5.20x)
put_hevc_epel_uni_w_h16_8_c:                          4491.0 ( 1.00x)
put_hevc_epel_uni_w_h16_8_rvv_i32:                     641.8 ( 7.00x)
put_hevc_epel_uni_w_h24_8_c:                          9974.7 ( 1.00x)
put_hevc_epel_uni_w_h24_8_rvv_i32:                    1791.4 ( 5.57x)
put_hevc_epel_uni_w_h32_8_c:                         17646.1 ( 1.00x)
put_hevc_epel_uni_w_h32_8_rvv_i32:                    2379.0 ( 7.42x)
put_hevc_epel_uni_w_h48_8_c:                         39569.2 ( 1.00x)
put_hevc_epel_uni_w_h48_8_rvv_i32:                    5226.0 ( 7.57x)
put_hevc_epel_uni_w_h64_8_c:                         70274.5 ( 1.00x)
put_hevc_epel_uni_w_h64_8_rvv_i32:                    9214.3 ( 7.63x)
put_hevc_epel_bi_h4_8_c:                               234.5 ( 1.00x)
put_hevc_epel_bi_h4_8_rvv_i32:                         128.3 ( 1.83x)
put_hevc_epel_bi_h6_8_c:                               505.0 ( 1.00x)
put_hevc_epel_bi_h6_8_rvv_i32:                         177.1 ( 2.85x)
put_hevc_epel_bi_h8_8_c:                               958.2 ( 1.00x)
put_hevc_epel_bi_h8_8_rvv_i32:                         235.2 ( 4.07x)
put_hevc_epel_bi_h12_8_c:                             2001.0 ( 1.00x)
put_hevc_epel_bi_h12_8_rvv_i32:                        338.5 ( 5.91x)
put_hevc_epel_bi_h16_8_c:                             3510.2 ( 1.00x)
put_hevc_epel_bi_h16_8_rvv_i32:                        446.5 ( 7.86x)
put_hevc_epel_bi_h24_8_c:                             7803.2 ( 1.00x)
put_hevc_epel_bi_h24_8_rvv_i32:                       1189.6 ( 6.56x)
put_hevc_epel_bi_h32_8_c:                            13764.5 ( 1.00x)
put_hevc_epel_bi_h32_8_rvv_i32:                       1579.3 ( 8.72x)
put_hevc_epel_bi_h48_8_c:                            30827.4 ( 1.00x)
put_hevc_epel_bi_h48_8_rvv_i32:                       3422.3 ( 9.01x)
put_hevc_epel_bi_h64_8_c:                            54715.6 ( 1.00x)
put_hevc_epel_bi_h64_8_rvv_i32:                       6059.8 ( 9.03x)

Signed-off-by: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>
---
 libavcodec/riscv/Makefile            |   3 +-
 libavcodec/riscv/h26x/h2656dsp.h     |  12 ++
 libavcodec/riscv/h26x/hevcepel_rvv.S | 265 +++++++++++++++++++++++++++
 libavcodec/riscv/hevcdsp_init.c      |   4 +
 4 files changed, 283 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/riscv/h26x/hevcepel_rvv.S

diff --git a/libavcodec/riscv/Makefile b/libavcodec/riscv/Makefile
index 414790ae0c..bf65e827e7 100644
--- a/libavcodec/riscv/Makefile
+++ b/libavcodec/riscv/Makefile
@@ -37,7 +37,8 @@ OBJS-$(CONFIG_H264QPEL) += riscv/h264qpel_init.o
 RVV-OBJS-$(CONFIG_H264QPEL) += riscv/h264qpel_rvv.o
 OBJS-$(CONFIG_HEVC_DECODER) += riscv/hevcdsp_init.o
 RVV-OBJS-$(CONFIG_HEVC_DECODER)  += riscv/h26x/h2656_inter_rvv.o \
-                                    riscv/h26x/hevcqpel_rvv.o
+                                    riscv/h26x/hevcqpel_rvv.o \
+                                    riscv/h26x/hevcepel_rvv.o
 OBJS-$(CONFIG_HUFFYUV_DECODER) += riscv/huffyuvdsp_init.o
 RVV-OBJS-$(CONFIG_HUFFYUV_DECODER) += riscv/huffyuvdsp_rvv.o
 OBJS-$(CONFIG_IDCTDSP) += riscv/idctdsp_init.o
diff --git a/libavcodec/riscv/h26x/h2656dsp.h b/libavcodec/riscv/h26x/h2656dsp.h
index 2dabc16aee..fa2f5a88e3 100644
--- a/libavcodec/riscv/h26x/h2656dsp.h
+++ b/libavcodec/riscv/h26x/h2656dsp.h
@@ -47,4 +47,16 @@ void ff_hevc_put_qpel_uni_w_v_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
 void ff_hevc_put_qpel_bi_v_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
         ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
         mx, intptr_t my, int width);
+
+void ff_hevc_put_epel_h_8_m1_rvv(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_uni_h_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_uni_w_h_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_bi_h_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
+        mx, intptr_t my, int width);
 #endif
diff --git a/libavcodec/riscv/h26x/hevcepel_rvv.S b/libavcodec/riscv/h26x/hevcepel_rvv.S
new file mode 100644
index 0000000000..81044846f7
--- /dev/null
+++ b/libavcodec/riscv/h26x/hevcepel_rvv.S
@@ -0,0 +1,265 @@
+ /*
+ * Copyright (C) 2026 Alibaba Group Holding Limited.
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+.data
+.align 2
+qpel_filters:
+    .byte  0,  0,  0,  0
+    .byte -2, 58, 10, -2
+    .byte -4, 54, 16, -2
+    .byte -6, 46, 28, -4
+    .byte -4, 36, 36, -4
+    .byte -4, 28, 46, -6
+    .byte -2, 16, 54, -4
+    .byte -2, 10, 58, -2
+
+.text
+#include "libavutil/riscv/asm.S"
+#define HEVC_MAX_PB_SIZE 64
+
+.macro  lx rd, addr
+#if (__riscv_xlen == 32)
+        lw      \rd, \addr
+#elif (__riscv_xlen == 64)
+        ld      \rd, \addr
+#else
+        lq      \rd, \addr
+#endif
+.endm
+
+.macro  sx rd, addr
+#if (__riscv_xlen == 32)
+        sw      \rd, \addr
+#elif (__riscv_xlen == 64)
+        sd      \rd, \addr
+#else
+        sq      \rd, \addr
+#endif
+.endm
+
+/* clobbers t0, t1 */
+.macro load_filter m
+        la          t0, qpel_filters
+        slli        t1, \m, 2
+        add         t0, t0, t1
+        lb          s1, 0(t0)
+        lb          s2, 1(t0)
+        lb          s3, 2(t0)
+        lb          s4, 3(t0)
+.endm
+
+/* output is unclipped; clobbers t4 */
+.macro filter_h         vdst, vsrc0, vsrc1, vsrc2, vsrc3, src
+        addi             t4, \src, -1
+        vle8.v           \vsrc0, (t4)
+        vmv.v.x          \vsrc3, s1
+        vwmulsu.vv       \vdst, \vsrc3, \vsrc0
+        vle8.v           \vsrc1, (\src)
+        addi             t4, \src, 1
+        vle8.v           \vsrc2, (t4)
+        addi             t4, \src, 2
+        vle8.v           \vsrc3, (t4)
+
+        vwmaccsu.vx      \vdst, s2, \vsrc1
+        vwmaccsu.vx      \vdst, s3, \vsrc2
+        vwmaccsu.vx      \vdst, s4, \vsrc3
+.endm
+
+.macro vreg
+
+.endm
+
+.macro hevc_epel_h       lmul, lmul2, lmul4
+func ff_hevc_put_epel_h_8_\lmul\()_rvv, zve32x
+    addi        sp, sp, -32
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    load_filter a4
+    mv          t3, a6
+    li          t1, 0       # offset
+
+1:
+    vsetvli     t6, t3, e8, \lmul, ta, ma
+    add         t2, a1, t1
+    filter_h    v0, v16, v18, v20, v22, t2
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    slli        t2, t1, 1
+    add         t2, a0, t2
+    vse16.v     v0, (t2)
+    sub         t3, t3, t6
+    add         t1, t1, t6
+    bgt         t3, zero, 1b
+    addi        a3, a3, -1
+    mv          t3, a6
+    add         a1, a1, a2
+    addi        a0, a0, 2*HEVC_MAX_PB_SIZE
+    li          t1, 0
+    bgt         a3, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    addi        sp, sp, 32
+    ret
+endfunc
+
+func ff_hevc_put_epel_uni_h_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    addi        sp, sp, -32
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    load_filter a5
+    mv          t3, a7
+    li          t1, 0       # offset
+
+1:
+    vsetvli     t6, t3, e8, \lmul, ta, ma
+    add         t2, a2, t1
+    filter_h    v0, v16, v18, v20, v22, t2
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 6
+    add         t2, a0, t1
+    vse8.v      v0, (t2)
+    sub         t3, t3, t6
+    add         t1, t1, t6
+    bgt         t3, zero, 1b
+    addi        a4, a4, -1
+    mv          t3, a7
+    add         a2, a2, a3
+    add         a0, a0, a1
+    li          t1, 0
+    bgt         a4, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    addi        sp, sp, 32
+    ret
+endfunc
+
+func ff_hevc_put_epel_uni_w_h_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lx          t2, 0(sp)       # mx
+    addi        a5, a5, 6       # shift
+#if (__riscv_xlen == 32)
+    lw          t3, 8(sp)       # width
+#elif (__riscv_xlen == 64)
+    lw          t3, 16(sp)
+#endif
+    addi        sp, sp, -32
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    load_filter t2
+    li          t2, 0           # offset
+
+1:
+    vsetvli     t6, t3, e8, \lmul, ta, ma
+    add         t1, a2, t2
+    filter_h    v8, v16, v18, v20, v22, t1
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vwmul.vx    v0, v8, a6
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vssra.vx    v0, v0, a5
+    vsadd.vx    v0, v0, a7
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vnclip.wi   v0, v0, 0
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    add         t1, a0, t2
+    vse8.v      v0, (t1)
+    sub         t3, t3, t6
+    add         t2, t2, t6
+    bgt         t3, zero, 1b
+    addi        a4, a4, -1
+#if (__riscv_xlen == 32)
+    lw          t3, 40(sp)
+#elif (__riscv_xlen == 64)
+    ld          t3, 48(sp)
+#endif
+    add         a2, a2, a3
+    add         a0, a0, a1
+    li          t2, 0
+    bgt         a4, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    addi        sp, sp, 32
+    ret
+endfunc
+
+func ff_hevc_put_epel_bi_h_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lw          t3, 0(sp)      # width
+    addi        sp, sp, -32
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    load_filter a6
+    li          t1, 0          # offset
+
+1:
+    vsetvli     t6, t3, e16, \lmul2, ta, ma
+    slli        t2, t1, 1
+    add         t2, a4, t2
+    vle16.v     v12, (t2)
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    add         t2, a2, t1
+    filter_h    v0, v16, v18, v20, v22, t2
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vsadd.vv    v0, v0, v12
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 7
+    add         t2, a0, t1
+    vse8.v      v0, (t2)
+    sub         t3, t3, t6
+    add         t1, t1, t6
+    bgt         t3, zero, 1b
+    addi        a5, a5, -1
+    lw          t3, 32(sp)
+    add         a2, a2, a3
+    add         a0, a0, a1
+    addi        a4, a4, 2*HEVC_MAX_PB_SIZE
+    li          t1, 0
+    bgt         a5, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    addi        sp, sp, 32
+    ret
+endfunc
+.endm
+
+hevc_epel_h m1, m2, m4
\ No newline at end of file
diff --git a/libavcodec/riscv/hevcdsp_init.c b/libavcodec/riscv/hevcdsp_init.c
index 480cfd2968..8608fdbd19 100644
--- a/libavcodec/riscv/hevcdsp_init.c
+++ b/libavcodec/riscv/hevcdsp_init.c
@@ -90,6 +90,10 @@ void ff_hevc_dsp_init_riscv(HEVCDSPContext *c, const int bit_depth)
                 RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni_w, 1, 0, ff_hevc_put_qpel_uni_w_v_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_qpel_bi, 1, 0, ff_hevc_put_qpel_bi_v_8_m1_rvv);
 
+                RVV_FNASSIGN_PEL(c->put_hevc_epel, 0, 1, ff_hevc_put_epel_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_uni, 0, 1, ff_hevc_put_epel_uni_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_uni_w, 0, 1, ff_hevc_put_epel_uni_w_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_bi, 0, 1, ff_hevc_put_epel_bi_h_8_m1_rvv);
                 break;
             default:
                 break;
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FFmpeg-devel] [PATCH 4/6] libavcodec/riscv: add RVV optimized for epel_v in HEVC.
  2026-01-22  4:23 [FFmpeg-devel] [PATCH 1/6] libavcodec/riscv: add RVV optimized for qpel_h in HEVC zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 2/6] libavcodec/riscv: add RVV optimized for qpel_v " zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 3/6] libavcodec/riscv: add RVV optimized for epel_h " zhanheng.yang--- via ffmpeg-devel
@ 2026-01-22  4:23 ` zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 5/6] libavcodec/riscv: add RVV optimized for qpel_hv " zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 6/6] libavcodec/riscv: add RVV optimized for epel_hv " zhanheng.yang--- via ffmpeg-devel
  4 siblings, 0 replies; 6+ messages in thread
From: zhanheng.yang--- via ffmpeg-devel @ 2026-01-22  4:23 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Zhanheng Yang

From: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>

Bench on A210 C908 core(VLEN 128).
put_hevc_epel_v4_8_c:                                  157.8 ( 1.00x)
put_hevc_epel_v4_8_rvv_i32:                             73.2 ( 2.16x)
put_hevc_epel_v6_8_c:                                  314.6 ( 1.00x)
put_hevc_epel_v6_8_rvv_i32:                            101.2 ( 3.11x)
put_hevc_epel_v8_8_c:                                  545.5 ( 1.00x)
put_hevc_epel_v8_8_rvv_i32:                            124.4 ( 4.39x)
put_hevc_epel_v12_8_c:                                1240.8 ( 1.00x)
put_hevc_epel_v12_8_rvv_i32:                           183.6 ( 6.76x)
put_hevc_epel_v16_8_c:                                2170.7 ( 1.00x)
put_hevc_epel_v16_8_rvv_i32:                           235.1 ( 9.23x)
put_hevc_epel_v24_8_c:                                4743.5 ( 1.00x)
put_hevc_epel_v24_8_rvv_i32:                           677.5 ( 7.00x)
put_hevc_epel_v32_8_c:                                8353.4 ( 1.00x)
put_hevc_epel_v32_8_rvv_i32:                           892.1 ( 9.36x)
put_hevc_epel_v48_8_c:                               18608.1 ( 1.00x)
put_hevc_epel_v48_8_rvv_i32:                          1956.1 ( 9.51x)
put_hevc_epel_v64_8_c:                               32934.3 ( 1.00x)
put_hevc_epel_v64_8_rvv_i32:                          3454.1 ( 9.53x)
put_hevc_epel_uni_v4_8_c:                              237.5 ( 1.00x)
put_hevc_epel_uni_v4_8_rvv_i32:                         87.5 ( 2.72x)
put_hevc_epel_uni_v6_8_c:                              509.5 ( 1.00x)
put_hevc_epel_uni_v6_8_rvv_i32:                        119.6 ( 4.26x)
put_hevc_epel_uni_v8_8_c:                              982.8 ( 1.00x)
put_hevc_epel_uni_v8_8_rvv_i32:                        147.1 ( 6.68x)
put_hevc_epel_uni_v12_8_c:                            2027.7 ( 1.00x)
put_hevc_epel_uni_v12_8_rvv_i32:                       211.0 ( 9.61x)
put_hevc_epel_uni_v16_8_c:                            3525.4 ( 1.00x)
put_hevc_epel_uni_v16_8_rvv_i32:                       278.8 (12.64x)
put_hevc_epel_uni_v24_8_c:                            7804.3 ( 1.00x)
put_hevc_epel_uni_v24_8_rvv_i32:                       778.9 (10.02x)
put_hevc_epel_uni_v32_8_c:                           13807.3 ( 1.00x)
put_hevc_epel_uni_v32_8_rvv_i32:                      1028.7 (13.42x)
put_hevc_epel_uni_v48_8_c:                           30934.9 ( 1.00x)
put_hevc_epel_uni_v48_8_rvv_i32:                      2265.1 (13.66x)
put_hevc_epel_uni_v64_8_c:                           54705.5 ( 1.00x)
put_hevc_epel_uni_v64_8_rvv_i32:                      4003.7 (13.66x)
put_hevc_epel_uni_w_v4_8_c:                            313.8 ( 1.00x)
put_hevc_epel_uni_w_v4_8_rvv_i32:                      156.6 ( 2.00x)
put_hevc_epel_uni_w_v6_8_c:                            674.3 ( 1.00x)
put_hevc_epel_uni_w_v6_8_rvv_i32:                      222.8 ( 3.03x)
put_hevc_epel_uni_w_v8_8_c:                           1253.3 ( 1.00x)
put_hevc_epel_uni_w_v8_8_rvv_i32:                      279.4 ( 4.49x)
put_hevc_epel_uni_w_v12_8_c:                          2619.4 ( 1.00x)
put_hevc_epel_uni_w_v12_8_rvv_i32:                     410.2 ( 6.39x)
put_hevc_epel_uni_w_v16_8_c:                          4614.2 ( 1.00x)
put_hevc_epel_uni_w_v16_8_rvv_i32:                     535.8 ( 8.61x)
put_hevc_epel_uni_w_v24_8_c:                         10290.6 ( 1.00x)
put_hevc_epel_uni_w_v24_8_rvv_i32:                    1550.6 ( 6.64x)
put_hevc_epel_uni_w_v32_8_c:                         18169.4 ( 1.00x)
put_hevc_epel_uni_w_v32_8_rvv_i32:                    2047.2 ( 8.88x)
put_hevc_epel_uni_w_v48_8_c:                         40704.3 ( 1.00x)
put_hevc_epel_uni_w_v48_8_rvv_i32:                    4552.4 ( 8.94x)
put_hevc_epel_uni_w_v64_8_c:                         72197.1 ( 1.00x)
put_hevc_epel_uni_w_v64_8_rvv_i32:                    8069.4 ( 8.95x)
put_hevc_epel_bi_v4_8_c:                               262.7 ( 1.00x)
put_hevc_epel_bi_v4_8_rvv_i32:                         105.9 ( 2.48x)
put_hevc_epel_bi_v6_8_c:                               553.0 ( 1.00x)
put_hevc_epel_bi_v6_8_rvv_i32:                         145.4 ( 3.80x)
put_hevc_epel_bi_v8_8_c:                              1045.5 ( 1.00x)
put_hevc_epel_bi_v8_8_rvv_i32:                         180.3 ( 5.80x)
put_hevc_epel_bi_v12_8_c:                             2172.7 ( 1.00x)
put_hevc_epel_bi_v12_8_rvv_i32:                        264.2 ( 8.22x)
put_hevc_epel_bi_v16_8_c:                             3791.6 ( 1.00x)
put_hevc_epel_bi_v16_8_rvv_i32:                        336.5 (11.27x)
put_hevc_epel_bi_v24_8_c:                             8424.1 ( 1.00x)
put_hevc_epel_bi_v24_8_rvv_i32:                        967.2 ( 8.71x)
put_hevc_epel_bi_v32_8_c:                            14910.8 ( 1.00x)
put_hevc_epel_bi_v32_8_rvv_i32:                       1270.7 (11.73x)
put_hevc_epel_bi_v48_8_c:                            33326.5 ( 1.00x)
put_hevc_epel_bi_v48_8_rvv_i32:                       2804.7 (11.88x)
put_hevc_epel_bi_v64_8_c:                            59177.9 ( 1.00x)
put_hevc_epel_bi_v64_8_rvv_i32:                       5022.3 (11.78x)

Signed-off-by: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>
---
 libavcodec/riscv/h26x/h2656dsp.h     |  11 ++
 libavcodec/riscv/h26x/hevcepel_rvv.S | 235 ++++++++++++++++++++++++++-
 libavcodec/riscv/hevcdsp_init.c      |   4 +
 3 files changed, 249 insertions(+), 1 deletion(-)

diff --git a/libavcodec/riscv/h26x/h2656dsp.h b/libavcodec/riscv/h26x/h2656dsp.h
index fa2f5a88e3..085ed4cf14 100644
--- a/libavcodec/riscv/h26x/h2656dsp.h
+++ b/libavcodec/riscv/h26x/h2656dsp.h
@@ -59,4 +59,15 @@ void ff_hevc_put_epel_uni_w_h_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
 void ff_hevc_put_epel_bi_h_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
         ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
         mx, intptr_t my, int width);
+void ff_hevc_put_epel_v_8_m1_rvv(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_uni_v_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_uni_w_v_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_bi_v_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
+        mx, intptr_t my, int width);
 #endif
diff --git a/libavcodec/riscv/h26x/hevcepel_rvv.S b/libavcodec/riscv/h26x/hevcepel_rvv.S
index 81044846f7..caca0b88ab 100644
--- a/libavcodec/riscv/h26x/hevcepel_rvv.S
+++ b/libavcodec/riscv/h26x/hevcepel_rvv.S
@@ -262,4 +262,237 @@ func ff_hevc_put_epel_bi_h_8_\lmul\()_rvv, zve32x
 endfunc
 .endm
 
-hevc_epel_h m1, m2, m4
\ No newline at end of file
+hevc_epel_h m1, m2, m4
+
+/* output is unclipped; clobbers v4 */
+.macro filter_v         vdst, vsrc0, vsrc1, vsrc2, vsrc3
+        vmv.v.x          v4, s1
+        vwmulsu.vv       \vdst, v4, \vsrc0
+        vwmaccsu.vx      \vdst, s2, \vsrc1
+        vmv.v.v          \vsrc0, \vsrc1
+        vwmaccsu.vx      \vdst, s3, \vsrc2
+        vmv.v.v          \vsrc1, \vsrc2
+        vwmaccsu.vx      \vdst, s4, \vsrc3
+        vmv.v.v          \vsrc2, \vsrc3
+.endm
+
+.macro hevc_epel_v       lmul, lmul2, lmul4
+func ff_hevc_put_epel_v_8_\lmul\()_rvv, zve32x
+    addi        sp, sp, -32
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    load_filter a5
+    sub         a1, a1, a2      # src - src_stride
+    li          t1, 0           # offset   
+    mv          t4, a3 
+
+1:
+    add         t2, a1, t1
+    slli        t3, t1, 1
+    add         t3, a0, t3
+
+    vsetvli     t5, a6, e8, \lmul, ta, ma
+    vle8.V      v16, (t2)
+    add         t2, t2, a2
+    vle8.V      v18, (t2)
+    add         t2, t2, a2
+    vle8.V      v20, (t2)
+    add         t2, t2, a2
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vle8.V      v22, (t2)
+    add         t2, t2, a2
+    filter_v    v0, v16, v18, v20, v22
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vse16.v     v0, (t3)
+    add         t3, t3, 2*HEVC_MAX_PB_SIZE
+    addi        a3, a3, -1
+    bgt         a3, zero, 2b    
+    add         t1, t1, t5
+    sub         a6, a6, t5
+    mv          a3, t4
+    bgt         a6, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    addi        sp, sp, 32
+    ret
+endfunc
+
+func ff_hevc_put_epel_uni_v_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0 
+    addi        sp, sp, -32
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    load_filter a6
+    sub         a2, a2, a3      # src - src_stride
+    li          t1, 0           # offset   
+    mv          t4, a4 
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+
+    vsetvli     t5, a7, e8, \lmul, ta, ma
+    vle8.V      v16, (t2)
+    add         t2, t2, a3
+    vle8.V      v18, (t2)
+    add         t2, t2, a3
+    vle8.V      v20, (t2)
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vle8.V      v22, (t2)
+    add         t2, t2, a3
+    filter_v    v0, v16, v18, v20, v22
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 6
+    vse8.v      v0, (t3)
+    add         t3, t3, a1
+    addi        a4, a4, -1
+    bgt         a4, zero, 2b    
+    add         t1, t1, t5
+    sub         a7, a7, t5
+    mv          a4, t4
+    bgt         a7, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    addi        sp, sp, 32
+    ret
+endfunc
+
+func ff_hevc_put_epel_uni_w_v_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0 
+#if (__riscv_xlen == 32)
+    lw          t1, 4(sp)       # my
+    lw          t6, 8(sp)       # width
+#elif (__riscv_xlen == 64)
+    ld          t1, 8(sp)
+    lw          t6, 16(sp)
+#endif
+    addi        sp, sp, -32
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    load_filter t1
+    addi        a5, a5, 6       # shift
+    sub         a2, a2, a3      # src - src_stride
+    li          t1, 0           # offset   
+    mv          t4, a4 
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+
+    vsetvli     t5, t6, e8, \lmul, ta, ma
+    vle8.V      v16, (t2)
+    add         t2, t2, a3
+    vle8.V      v18, (t2)
+    add         t2, t2, a3
+    vle8.V      v20, (t2)
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vle8.V      v22, (t2)
+    add         t2, t2, a3
+    filter_v    v0, v16, v18, v20, v22
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vwmul.vx    v8, v0, a6
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vssra.vx    v0, v8, a5
+    vsadd.vx    v0, v0, a7
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vnclip.wi   v0, v0, 0
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    vse8.v      v0, (t3)
+    add         t3, t3, a1
+    addi        a4, a4, -1
+    bgt         a4, zero, 2b    
+    add         t1, t1, t5
+    sub         t6, t6, t5
+    mv          a4, t4
+    bgt         t6, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    addi        sp, sp, 32
+    ret
+endfunc
+
+func ff_hevc_put_epel_bi_v_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0 
+    lw          t6, 0(sp)      # width
+    addi        sp, sp, -32
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    load_filter a7
+    sub         a2, a2, a3      # src - src_stride
+    li          t1, 0           # offset   
+    mv          t4, a5 
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+    slli        t0, t1, 1
+    add         t0, a4, t0
+
+    vsetvli     t5, t6, e8, \lmul, ta, ma
+    vle8.V      v16, (t2)
+    add         t2, t2, a3
+    vle8.V      v18, (t2)
+    add         t2, t2, a3
+    vle8.V      v20, (t2)
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vle8.V      v22, (t2)
+    add         t2, t2, a3
+    filter_v    v0, v16, v18, v20, v22
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vle16.v     v8, (t0)
+    addi        t0, t0, 2*HEVC_MAX_PB_SIZE
+    vsadd.vv    v0, v0, v8
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 7
+    vse8.v      v0, (t3)
+    add         t3, t3, a1
+    addi        a5, a5, -1
+    bgt         a5, zero, 2b
+    add         t1, t1, t5
+    sub         t6, t6, t5
+    mv          a5, t4
+    bgt         t6, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    addi        sp, sp, 32
+    ret
+endfunc
+.endm
+
+hevc_epel_v m1, m2, m4
\ No newline at end of file
diff --git a/libavcodec/riscv/hevcdsp_init.c b/libavcodec/riscv/hevcdsp_init.c
index 8608fdbd19..c7874996a8 100644
--- a/libavcodec/riscv/hevcdsp_init.c
+++ b/libavcodec/riscv/hevcdsp_init.c
@@ -94,6 +94,10 @@ void ff_hevc_dsp_init_riscv(HEVCDSPContext *c, const int bit_depth)
                 RVV_FNASSIGN_PEL(c->put_hevc_epel_uni, 0, 1, ff_hevc_put_epel_uni_h_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_epel_uni_w, 0, 1, ff_hevc_put_epel_uni_w_h_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_epel_bi, 0, 1, ff_hevc_put_epel_bi_h_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel, 1, 0, ff_hevc_put_epel_v_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_uni, 1, 0, ff_hevc_put_epel_uni_v_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_uni_w, 1, 0, ff_hevc_put_epel_uni_w_v_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_bi, 1, 0, ff_hevc_put_epel_bi_v_8_m1_rvv);
                 break;
             default:
                 break;
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FFmpeg-devel] [PATCH 5/6] libavcodec/riscv: add RVV optimized for qpel_hv in HEVC.
  2026-01-22  4:23 [FFmpeg-devel] [PATCH 1/6] libavcodec/riscv: add RVV optimized for qpel_h in HEVC zhanheng.yang--- via ffmpeg-devel
                   ` (2 preceding siblings ...)
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 4/6] libavcodec/riscv: add RVV optimized for epel_v " zhanheng.yang--- via ffmpeg-devel
@ 2026-01-22  4:23 ` zhanheng.yang--- via ffmpeg-devel
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 6/6] libavcodec/riscv: add RVV optimized for epel_hv " zhanheng.yang--- via ffmpeg-devel
  4 siblings, 0 replies; 6+ messages in thread
From: zhanheng.yang--- via ffmpeg-devel @ 2026-01-22  4:23 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Zhanheng Yang

From: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>

Bench on A210 C908 core(VLEN 128).
put_hevc_qpel_hv4_8_c:                                 865.6 ( 1.00x)
put_hevc_qpel_hv4_8_rvv_i32:                           501.8 ( 1.72x)
put_hevc_qpel_hv6_8_c:                                1602.9 ( 1.00x)
put_hevc_qpel_hv6_8_rvv_i32:                           635.4 ( 2.52x)
put_hevc_qpel_hv8_8_c:                                2571.2 ( 1.00x)
put_hevc_qpel_hv8_8_rvv_i32:                           774.1 ( 3.32x)
put_hevc_qpel_hv12_8_c:                               5366.3 ( 1.00x)
put_hevc_qpel_hv12_8_rvv_i32:                         1049.3 ( 5.11x)
put_hevc_qpel_hv16_8_c:                               8959.2 ( 1.00x)
put_hevc_qpel_hv16_8_rvv_i32:                         1328.1 ( 6.75x)
put_hevc_qpel_hv24_8_c:                              18969.7 ( 1.00x)
put_hevc_qpel_hv24_8_rvv_i32:                         3712.5 ( 5.11x)
put_hevc_qpel_hv32_8_c:                              32674.3 ( 1.00x)
put_hevc_qpel_hv32_8_rvv_i32:                         4806.7 ( 6.80x)
put_hevc_qpel_hv48_8_c:                              71309.9 ( 1.00x)
put_hevc_qpel_hv48_8_rvv_i32:                        10465.8 ( 6.81x)
put_hevc_qpel_hv64_8_c:                             124846.0 ( 1.00x)
put_hevc_qpel_hv64_8_rvv_i32:                        18306.5 ( 6.82x)
put_hevc_qpel_uni_hv4_8_c:                             920.4 ( 1.00x)
put_hevc_qpel_uni_hv4_8_rvv_i32:                       532.1 ( 1.73x)
put_hevc_qpel_uni_hv6_8_c:                            1753.0 ( 1.00x)
put_hevc_qpel_uni_hv6_8_rvv_i32:                       691.0 ( 2.54x)
put_hevc_qpel_uni_hv8_8_c:                            2872.7 ( 1.00x)
put_hevc_qpel_uni_hv8_8_rvv_i32:                       836.9 ( 3.43x)
put_hevc_qpel_uni_hv12_8_c:                           5828.4 ( 1.00x)
put_hevc_qpel_uni_hv12_8_rvv_i32:                     1141.2 ( 5.11x)
put_hevc_qpel_uni_hv16_8_c:                           9906.7 ( 1.00x)
put_hevc_qpel_uni_hv16_8_rvv_i32:                     1452.5 ( 6.82x)
put_hevc_qpel_uni_hv24_8_c:                          20871.3 ( 1.00x)
put_hevc_qpel_uni_hv24_8_rvv_i32:                     4094.0 ( 5.10x)
put_hevc_qpel_uni_hv32_8_c:                          36123.3 ( 1.00x)
put_hevc_qpel_uni_hv32_8_rvv_i32:                     5310.5 ( 6.80x)
put_hevc_qpel_uni_hv48_8_c:                          79016.0 ( 1.00x)
put_hevc_qpel_uni_hv48_8_rvv_i32:                    11591.2 ( 6.82x)
put_hevc_qpel_uni_hv64_8_c:                         138779.8 ( 1.00x)
put_hevc_qpel_uni_hv64_8_rvv_i32:                    20321.1 ( 6.83x)
put_hevc_qpel_uni_w_hv4_8_c:                           988.8 ( 1.00x)
put_hevc_qpel_uni_w_hv4_8_rvv_i32:                     580.3 ( 1.70x)
put_hevc_qpel_uni_w_hv6_8_c:                          1871.5 ( 1.00x)
put_hevc_qpel_uni_w_hv6_8_rvv_i32:                     751.7 ( 2.49x)
put_hevc_qpel_uni_w_hv8_8_c:                          3089.8 ( 1.00x)
put_hevc_qpel_uni_w_hv8_8_rvv_i32:                     923.7 ( 3.35x)
put_hevc_qpel_uni_w_hv12_8_c:                         6384.8 ( 1.00x)
put_hevc_qpel_uni_w_hv12_8_rvv_i32:                   1266.7 ( 5.04x)
put_hevc_qpel_uni_w_hv16_8_c:                        10844.7 ( 1.00x)
put_hevc_qpel_uni_w_hv16_8_rvv_i32:                   1612.2 ( 6.73x)
put_hevc_qpel_uni_w_hv24_8_c:                        23060.9 ( 1.00x)
put_hevc_qpel_uni_w_hv24_8_rvv_i32:                   4560.2 ( 5.06x)
put_hevc_qpel_uni_w_hv32_8_c:                        39977.0 ( 1.00x)
put_hevc_qpel_uni_w_hv32_8_rvv_i32:                   5927.0 ( 6.74x)
put_hevc_qpel_uni_w_hv48_8_c:                        87560.3 ( 1.00x)
put_hevc_qpel_uni_w_hv48_8_rvv_i32:                  12978.3 ( 6.75x)
put_hevc_qpel_uni_w_hv64_8_c:                       153980.5 ( 1.00x)
put_hevc_qpel_uni_w_hv64_8_rvv_i32:                  22823.0 ( 6.75x)
put_hevc_qpel_bi_hv4_8_c:                              938.5 ( 1.00x)
put_hevc_qpel_bi_hv4_8_rvv_i32:                        541.4 ( 1.73x)
put_hevc_qpel_bi_hv6_8_c:                             1760.1 ( 1.00x)
put_hevc_qpel_bi_hv6_8_rvv_i32:                        695.9 ( 2.53x)
put_hevc_qpel_bi_hv8_8_c:                             2924.3 ( 1.00x)
put_hevc_qpel_bi_hv8_8_rvv_i32:                        849.3 ( 3.44x)
put_hevc_qpel_bi_hv12_8_c:                            5992.7 ( 1.00x)
put_hevc_qpel_bi_hv12_8_rvv_i32:                      1157.5 ( 5.18x)
put_hevc_qpel_bi_hv16_8_c:                           10065.4 ( 1.00x)
put_hevc_qpel_bi_hv16_8_rvv_i32:                      1473.6 ( 6.83x)
put_hevc_qpel_bi_hv24_8_c:                           21450.2 ( 1.00x)
put_hevc_qpel_bi_hv24_8_rvv_i32:                      4151.3 ( 5.17x)
put_hevc_qpel_bi_hv32_8_c:                           37107.8 ( 1.00x)
put_hevc_qpel_bi_hv32_8_rvv_i32:                      5386.4 ( 6.89x)
put_hevc_qpel_bi_hv48_8_c:                           81401.7 ( 1.00x)
put_hevc_qpel_bi_hv48_8_rvv_i32:                     11761.7 ( 6.92x)
put_hevc_qpel_bi_hv64_8_c:                          143503.3 ( 1.00x)
put_hevc_qpel_bi_hv64_8_rvv_i32:                     20700.3 ( 6.93x)

Signed-off-by: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>
---
 libavcodec/riscv/h26x/h2656dsp.h     |  11 +
 libavcodec/riscv/h26x/hevcqpel_rvv.S | 386 ++++++++++++++++++++++++++-
 libavcodec/riscv/hevcdsp_init.c      |   4 +
 3 files changed, 400 insertions(+), 1 deletion(-)

diff --git a/libavcodec/riscv/h26x/h2656dsp.h b/libavcodec/riscv/h26x/h2656dsp.h
index 085ed4cf14..7e320bd795 100644
--- a/libavcodec/riscv/h26x/h2656dsp.h
+++ b/libavcodec/riscv/h26x/h2656dsp.h
@@ -47,6 +47,17 @@ void ff_hevc_put_qpel_uni_w_v_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
 void ff_hevc_put_qpel_bi_v_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
         ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
         mx, intptr_t my, int width);
+void ff_hevc_put_qpel_hv_8_m1_rvv(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_uni_hv_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_uni_w_hv_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_bi_hv_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
+        mx, intptr_t my, int width);
 
 void ff_hevc_put_epel_h_8_m1_rvv(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
         intptr_t mx, intptr_t my, int width);
diff --git a/libavcodec/riscv/h26x/hevcqpel_rvv.S b/libavcodec/riscv/h26x/hevcqpel_rvv.S
index 8fd3c47bcc..ed7fa8fe00 100644
--- a/libavcodec/riscv/h26x/hevcqpel_rvv.S
+++ b/libavcodec/riscv/h26x/hevcqpel_rvv.S
@@ -619,4 +619,388 @@ func ff_hevc_put_qpel_bi_v_8_\lmul\()_rvv, zve32x
 endfunc
 .endm
 
-hevc_qpel_v m1, m2, m4
\ No newline at end of file
+hevc_qpel_v m1, m2, m4
+
+/* clobbers reg t4 */
+.macro filter_v_s         vdst, vsrc0, vsrc1, vsrc2, vsrc3, vsrc4, vsrc5, vsrc6, vsrc7, vf
+        vwmul.vx       \vdst, \vsrc0, s0
+        vwmacc.vx      \vdst, s9, \vsrc1
+        vmv.v.v        \vsrc0, \vsrc1
+        vwmacc.vx      \vdst, s10, \vsrc2
+        vmv.v.v        \vsrc1, \vsrc2
+        vwmacc.vx      \vdst, s11, \vsrc3
+        vmv.v.v        \vsrc2, \vsrc3
+        lb             t4, 4(\vf)
+        vwmacc.vx      \vdst, t4, \vsrc4
+        lb             t4, 5(\vf)
+        vmv.v.v        \vsrc3, \vsrc4
+        vwmacc.vx      \vdst, t4, \vsrc5
+        lb             t4, 6(\vf)
+        vmv.v.v        \vsrc4, \vsrc5
+        vwmacc.vx      \vdst, t4, \vsrc6
+        lb             t4, 7(\vf)
+        vmv.v.v        \vsrc5, \vsrc6
+        vwmacc.vx      \vdst, t4, \vsrc7
+        vmv.v.v        \vsrc6, \vsrc7
+.endm
+
+/* output \m as t0; clobbers t0, t1, reg not enough for all coef */
+.macro load_filter2 m
+        la          t0, qpel_filters
+        slli        t1, \m, 3
+        add         t0, t0, t1
+        lb          s0, 0(t0)
+        lb          s9, 1(t0)
+        lb          s10, 2(t0)
+        lb          s11, 3(t0)
+        mv          \m, t0
+.endm
+
+.macro hevc_qpel_hv       lmul, lmul2, lmul4
+func ff_hevc_put_qpel_hv_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 2
+    addi        sp, sp, -96
+    sx          s0, 0(sp)
+    sx          s1, 8(sp)
+    sx          s2, 16(sp)
+    sx          s3, 24(sp)
+    sx          s4, 32(sp)
+    sx          s5, 40(sp)
+    sx          s6, 48(sp)
+    sx          s7, 56(sp)
+    sx          s8, 64(sp)
+    sx          s9, 72(sp)
+    sx          s10, 80(sp)
+    sx          s11, 88(sp)
+    load_filter  a4
+    load_filter2 a5
+    slli        t1, a2, 1
+    add         t1, t1, a2
+    sub         a1, a1, t1     # src - 3 * src_stride
+    mv          t0, a3
+    li          t1, 0          # offset
+
+1:
+    add         t2, a1, t1
+    slli        t3, t1, 1
+    add         t3, a0, t3
+
+    vsetvli     t6, a6, e8, \lmul, ta, ma
+    filter_h    v4, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a2
+    filter_h    v6, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a2
+    filter_h    v8, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a2
+    filter_h    v10, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a2
+    filter_h    v12, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a2
+    filter_h    v14, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a2
+    filter_h    v16, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a2
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    filter_h    v18, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a2
+
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    filter_v_s  v0, v4, v6, v8, v10, v12, v14, v16, v18, a5
+    vnclip.wi   v0, v0, 6
+    vse16.v     v0, (t3)
+    addi        a3, a3, -1
+    addi        t3, t3, 2*HEVC_MAX_PB_SIZE
+    bgt         a3, zero, 2b
+    mv          a3, t0
+    add         t1, t1, t6
+    sub         a6, a6, t6
+    bgt         a6, zero, 1b
+
+    lx          s0, 0(sp)
+    lx          s1, 8(sp)
+    lx          s2, 16(sp)
+    lx          s3, 24(sp)
+    lx          s4, 32(sp)
+    lx          s5, 40(sp)
+    lx          s6, 48(sp)
+    lx          s7, 56(sp)
+    lx          s8, 64(sp)
+    lx          s9, 72(sp)
+    lx          s10, 80(sp)
+    lx          s11, 88(sp)
+    addi        sp, sp, 96
+    ret
+endfunc
+
+func ff_hevc_put_qpel_uni_hv_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    addi        sp, sp, -96
+    sx          s0, 0(sp)
+    sx          s1, 8(sp)
+    sx          s2, 16(sp)
+    sx          s3, 24(sp)
+    sx          s4, 32(sp)
+    sx          s5, 40(sp)
+    sx          s6, 48(sp)
+    sx          s7, 56(sp)
+    sx          s8, 64(sp)
+    sx          s9, 72(sp)
+    sx          s10, 80(sp)
+    sx          s11, 88(sp)
+    load_filter  a5
+    load_filter2 a6
+    slli        t1, a3, 1
+    add         t1, t1, a3
+    sub         a2, a2, t1     # src - 3 * src_stride
+    mv          t0, a4
+    li          t1, 0          # offset
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+
+    vsetvli     t6, a7, e8, \lmul, ta, ma
+    filter_h    v4, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v6, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v8, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v10, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v12, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v14, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v16, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    filter_h    v18, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    filter_v_s  v0, v4, v6, v8, v10, v12, v14, v16, v18, a6
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vsra.vi     v0, v0, 6
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vnclipu.wi   v0, v0, 6
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    vse8.v     v0, (t3)
+    addi        a4, a4, -1
+    add         t3, t3, a1
+    bgt         a4, zero, 2b
+    mv          a4, t0
+    add         t1, t1, t6
+    sub         a7, a7, t6
+    bgt         a7, zero, 1b
+
+    lx          s0, 0(sp)
+    lx          s1, 8(sp)
+    lx          s2, 16(sp)
+    lx          s3, 24(sp)
+    lx          s4, 32(sp)
+    lx          s5, 40(sp)
+    lx          s6, 48(sp)
+    lx          s7, 56(sp)
+    lx          s8, 64(sp)
+    lx          s9, 72(sp)
+    lx          s10, 80(sp)
+    lx          s11, 88(sp)
+    addi        sp, sp, 96
+    ret
+endfunc
+
+func ff_hevc_put_qpel_uni_w_hv_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lx          t2, 0(sp)       # mx
+#if (__riscv_xlen == 32)
+    lw          t4, 4(sp)       # my
+    lw          t5, 8(sp)       # width
+#elif (__riscv_xlen == 64)
+    ld          t4, 8(sp)
+    lw          t5, 16(sp)
+#endif
+    addi        a5, a5, 6       # shift
+    addi        sp, sp, -104
+    sx          s0, 0(sp)
+    sx          s1, 8(sp)
+    sx          s2, 16(sp)
+    sx          s3, 24(sp)
+    sx          s4, 32(sp)
+    sx          s5, 40(sp)
+    sx          s6, 48(sp)
+    sx          s7, 56(sp)
+    sx          s8, 64(sp)
+    sx          s9, 72(sp)
+    sx          s10, 80(sp)
+    sx          s11, 88(sp)
+    sx          ra, 96(sp)
+    mv          ra, t4
+    load_filter  t2
+    load_filter2 ra
+    slli        t1, a3, 1
+    add         t1, t1, a3
+    sub         a2, a2, t1     # src - 3 * src_stride
+    mv          t0, a4
+    li          t1, 0          # offset
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+
+    vsetvli     t6, t5, e8, \lmul, ta, ma
+    filter_h    v4, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v6, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v8, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v10, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v12, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v14, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v16, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    filter_h    v18, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    filter_v_s  v0, v4, v6, v8, v10, v12, v14, v16, v18, ra
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vsra.vi     v0, v0, 6
+    vmul.vx     v0, v0, a6
+    vssra.vx    v0, v0, a5
+    vsadd.vx    v0, v0, a7
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vnclip.wi   v0, v0, 0
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    vse8.v      v0, (t3)
+    addi        a4, a4, -1
+    add         t3, t3, a1
+    bgt         a4, zero, 2b
+    mv          a4, t0
+    add         t1, t1, t6
+    sub         t5, t5, t6
+    bgt         t5, zero, 1b
+
+    lx          s0, 0(sp)
+    lx          s1, 8(sp)
+    lx          s2, 16(sp)
+    lx          s3, 24(sp)
+    lx          s4, 32(sp)
+    lx          s5, 40(sp)
+    lx          s6, 48(sp)
+    lx          s7, 56(sp)
+    lx          s8, 64(sp)
+    lx          s9, 72(sp)
+    lx          s10, 80(sp)
+    lx          s11, 88(sp)
+    lx          ra, 96(sp)
+    addi        sp, sp, 104
+    ret
+endfunc
+
+func ff_hevc_put_qpel_bi_hv_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lw          t3, 0(sp)      # width
+    addi        sp, sp, -96
+    sx          s0, 0(sp)
+    sx          s1, 8(sp)
+    sx          s2, 16(sp)
+    sx          s3, 24(sp)
+    sx          s4, 32(sp)
+    sx          s5, 40(sp)
+    sx          s6, 48(sp)
+    sx          s7, 56(sp)
+    sx          s8, 64(sp)
+    sx          s9, 72(sp)
+    sx          s10, 80(sp)
+    sx          s11, 88(sp)
+    load_filter  a6
+    load_filter2 a7
+    mv          a6, t3
+    slli        t1, a3, 1
+    add         t1, t1, a3
+    sub         a2, a2, t1     # src - 3 * src_stride
+    mv          t0, a5
+    li          t1, 0          # offset
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+    slli        t5, t1, 1
+    add         t5, a4, t5
+
+    vsetvli     t6, a6, e8, \lmul, ta, ma
+    filter_h    v4, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v6, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v8, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v10, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v12, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v14, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+    filter_h    v16, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    filter_h    v18, v24, v25, v26, v27, v28, v29, v30, v31, t2
+    add         t2, t2, a3
+
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vle16.V     v24, (t5)
+    addi        t5, t5, 2*HEVC_MAX_PB_SIZE
+    filter_v_s  v0, v4, v6, v8, v10, v12, v14, v16, v18, a7
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vsra.vi     v0, v0, 6
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vwadd.wv    v0, v0, v24
+    vnclip.wi   v0, v0, 7
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    vse8.v     v0, (t3)
+    addi        a5, a5, -1
+    add         t3, t3, a1
+    bgt         a5, zero, 2b
+    mv          a5, t0
+    add         t1, t1, t6
+    sub         a6, a6, t6
+    bgt         a6, zero, 1b
+
+    lx          s0, 0(sp)
+    lx          s1, 8(sp)
+    lx          s2, 16(sp)
+    lx          s3, 24(sp)
+    lx          s4, 32(sp)
+    lx          s5, 40(sp)
+    lx          s6, 48(sp)
+    lx          s7, 56(sp)
+    lx          s8, 64(sp)
+    lx          s9, 72(sp)
+    lx          s10, 80(sp)
+    lx          s11, 88(sp)
+    addi        sp, sp, 96
+    ret
+endfunc
+.endm
+
+hevc_qpel_hv m1, m2, m4
diff --git a/libavcodec/riscv/hevcdsp_init.c b/libavcodec/riscv/hevcdsp_init.c
index c7874996a8..53c800626f 100644
--- a/libavcodec/riscv/hevcdsp_init.c
+++ b/libavcodec/riscv/hevcdsp_init.c
@@ -89,6 +89,10 @@ void ff_hevc_dsp_init_riscv(HEVCDSPContext *c, const int bit_depth)
                 RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni, 1, 0, ff_hevc_put_qpel_uni_v_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni_w, 1, 0, ff_hevc_put_qpel_uni_w_v_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_qpel_bi, 1, 0, ff_hevc_put_qpel_bi_v_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel, 1, 1, ff_hevc_put_qpel_hv_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni, 1, 1, ff_hevc_put_qpel_uni_hv_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_uni_w, 1, 1, ff_hevc_put_qpel_uni_w_hv_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_qpel_bi, 1, 1, ff_hevc_put_qpel_bi_hv_8_m1_rvv);
 
                 RVV_FNASSIGN_PEL(c->put_hevc_epel, 0, 1, ff_hevc_put_epel_h_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_epel_uni, 0, 1, ff_hevc_put_epel_uni_h_8_m1_rvv);
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FFmpeg-devel] [PATCH 6/6] libavcodec/riscv: add RVV optimized for epel_hv in HEVC.
  2026-01-22  4:23 [FFmpeg-devel] [PATCH 1/6] libavcodec/riscv: add RVV optimized for qpel_h in HEVC zhanheng.yang--- via ffmpeg-devel
                   ` (3 preceding siblings ...)
  2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 5/6] libavcodec/riscv: add RVV optimized for qpel_hv " zhanheng.yang--- via ffmpeg-devel
@ 2026-01-22  4:23 ` zhanheng.yang--- via ffmpeg-devel
  4 siblings, 0 replies; 6+ messages in thread
From: zhanheng.yang--- via ffmpeg-devel @ 2026-01-22  4:23 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Zhanheng Yang

From: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>

Bench on A210 C908 core(VLEN 128).
put_hevc_epel_hv4_8_c:                                 390.0 ( 1.00x)
put_hevc_epel_hv4_8_rvv_i32:                           213.0 ( 1.83x)
put_hevc_epel_hv6_8_c:                                 749.8 ( 1.00x)
put_hevc_epel_hv6_8_rvv_i32:                           290.8 ( 2.58x)
put_hevc_epel_hv8_8_c:                                1215.5 ( 1.00x)
put_hevc_epel_hv8_8_rvv_i32:                           360.7 ( 3.37x)
put_hevc_epel_hv12_8_c:                               2602.5 ( 1.00x)
put_hevc_epel_hv12_8_rvv_i32:                          515.4 ( 5.05x)
put_hevc_epel_hv16_8_c:                               4417.0 ( 1.00x)
put_hevc_epel_hv16_8_rvv_i32:                          661.8 ( 6.67x)
put_hevc_epel_hv24_8_c:                               9524.8 ( 1.00x)
put_hevc_epel_hv24_8_rvv_i32:                         1909.2 ( 4.99x)
put_hevc_epel_hv32_8_c:                              16589.1 ( 1.00x)
put_hevc_epel_hv32_8_rvv_i32:                         2508.0 ( 6.61x)
put_hevc_epel_hv48_8_c:                              37145.4 ( 1.00x)
put_hevc_epel_hv48_8_rvv_i32:                         5526.8 ( 6.72x)
put_hevc_epel_hv64_8_c:                              65015.9 ( 1.00x)
put_hevc_epel_hv64_8_rvv_i32:                         9751.9 ( 6.67x)
put_hevc_epel_uni_hv4_8_c:                             434.8 ( 1.00x)
put_hevc_epel_uni_hv4_8_rvv_i32:                       238.8 ( 1.82x)
put_hevc_epel_uni_hv6_8_c:                             856.8 ( 1.00x)
put_hevc_epel_uni_hv6_8_rvv_i32:                       329.6 ( 2.60x)
put_hevc_epel_uni_hv8_8_c:                            1474.2 ( 1.00x)
put_hevc_epel_uni_hv8_8_rvv_i32:                       412.9 ( 3.57x)
put_hevc_epel_uni_hv12_8_c:                           2995.9 ( 1.00x)
put_hevc_epel_uni_hv12_8_rvv_i32:                      593.9 ( 5.04x)
put_hevc_epel_uni_hv16_8_c:                           5128.2 ( 1.00x)
put_hevc_epel_uni_hv16_8_rvv_i32:                      770.6 ( 6.66x)
put_hevc_epel_uni_hv24_8_c:                          11159.5 ( 1.00x)
put_hevc_epel_uni_hv24_8_rvv_i32:                     2223.1 ( 5.02x)
put_hevc_epel_uni_hv32_8_c:                          19462.3 ( 1.00x)
put_hevc_epel_uni_hv32_8_rvv_i32:                     2925.1 ( 6.65x)
put_hevc_epel_uni_hv48_8_c:                          43480.5 ( 1.00x)
put_hevc_epel_uni_hv48_8_rvv_i32:                     6476.7 ( 6.71x)
put_hevc_epel_uni_hv64_8_c:                          76411.2 ( 1.00x)
put_hevc_epel_uni_hv64_8_rvv_i32:                    11456.7 ( 6.67x)
put_hevc_epel_uni_w_hv4_8_c:                           557.8 ( 1.00x)
put_hevc_epel_uni_w_hv4_8_rvv_i32:                     287.9 ( 1.94x)
put_hevc_epel_uni_w_hv6_8_c:                          1068.0 ( 1.00x)
put_hevc_epel_uni_w_hv6_8_rvv_i32:                     399.4 ( 2.67x)
put_hevc_epel_uni_w_hv8_8_c:                          1835.2 ( 1.00x)
put_hevc_epel_uni_w_hv8_8_rvv_i32:                     507.3 ( 3.62x)
put_hevc_epel_uni_w_hv12_8_c:                         3758.9 ( 1.00x)
put_hevc_epel_uni_w_hv12_8_rvv_i32:                    729.2 ( 5.15x)
put_hevc_epel_uni_w_hv16_8_c:                         6524.5 ( 1.00x)
put_hevc_epel_uni_w_hv16_8_rvv_i32:                    954.7 ( 6.83x)
put_hevc_epel_uni_w_hv24_8_c:                        14094.2 ( 1.00x)
put_hevc_epel_uni_w_hv24_8_rvv_i32:                   2764.9 ( 5.10x)
put_hevc_epel_uni_w_hv32_8_c:                        24887.0 ( 1.00x)
put_hevc_epel_uni_w_hv32_8_rvv_i32:                   3640.5 ( 6.84x)
put_hevc_epel_uni_w_hv48_8_c:                        55341.0 ( 1.00x)
put_hevc_epel_uni_w_hv48_8_rvv_i32:                   8083.8 ( 6.85x)
put_hevc_epel_uni_w_hv64_8_c:                        97377.8 ( 1.00x)
put_hevc_epel_uni_w_hv64_8_rvv_i32:                  14322.9 ( 6.80x)
put_hevc_epel_bi_hv4_8_c:                              472.2 ( 1.00x)
put_hevc_epel_bi_hv4_8_rvv_i32:                        250.0 ( 1.89x)
put_hevc_epel_bi_hv6_8_c:                              903.1 ( 1.00x)
put_hevc_epel_bi_hv6_8_rvv_i32:                        341.3 ( 2.65x)
put_hevc_epel_bi_hv8_8_c:                             1583.5 ( 1.00x)
put_hevc_epel_bi_hv8_8_rvv_i32:                        433.1 ( 3.66x)
put_hevc_epel_bi_hv12_8_c:                            3205.8 ( 1.00x)
put_hevc_epel_bi_hv12_8_rvv_i32:                       615.0 ( 5.21x)
put_hevc_epel_bi_hv16_8_c:                            5504.1 ( 1.00x)
put_hevc_epel_bi_hv16_8_rvv_i32:                       800.3 ( 6.88x)
put_hevc_epel_bi_hv24_8_c:                           11897.2 ( 1.00x)
put_hevc_epel_bi_hv24_8_rvv_i32:                      2309.9 ( 5.15x)
put_hevc_epel_bi_hv32_8_c:                           20823.8 ( 1.00x)
put_hevc_epel_bi_hv32_8_rvv_i32:                      3031.2 ( 6.87x)
put_hevc_epel_bi_hv48_8_c:                           46854.5 ( 1.00x)
put_hevc_epel_bi_hv48_8_rvv_i32:                      6713.2 ( 6.98x)
put_hevc_epel_bi_hv64_8_c:                           82399.2 ( 1.00x)
put_hevc_epel_bi_hv64_8_rvv_i32:                     11901.4 ( 6.92x)

Signed-off-by: Zhanheng Yang <zhanheng.yang@linux.alibaba.com>
---
 libavcodec/riscv/h26x/h2656dsp.h     |  11 +
 libavcodec/riscv/h26x/hevcepel_rvv.S | 325 +++++++++++++++++++++++++--
 libavcodec/riscv/hevcdsp_init.c      |   4 +
 3 files changed, 325 insertions(+), 15 deletions(-)

diff --git a/libavcodec/riscv/h26x/h2656dsp.h b/libavcodec/riscv/h26x/h2656dsp.h
index 7e320bd795..b8a116bdf7 100644
--- a/libavcodec/riscv/h26x/h2656dsp.h
+++ b/libavcodec/riscv/h26x/h2656dsp.h
@@ -81,4 +81,15 @@ void ff_hevc_put_epel_uni_w_v_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
 void ff_hevc_put_epel_bi_v_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
         ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
         mx, intptr_t my, int width);
+void ff_hevc_put_epel_hv_8_m1_rvv(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_uni_hv_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_uni_w_hv_8_m1_rvv(uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_epel_bi_hv_8_m1_rvv(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+        ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
+        mx, intptr_t my, int width);
 #endif
diff --git a/libavcodec/riscv/h26x/hevcepel_rvv.S b/libavcodec/riscv/h26x/hevcepel_rvv.S
index caca0b88ab..7a4a3f3318 100644
--- a/libavcodec/riscv/h26x/hevcepel_rvv.S
+++ b/libavcodec/riscv/h26x/hevcepel_rvv.S
@@ -285,8 +285,8 @@ func ff_hevc_put_epel_v_8_\lmul\()_rvv, zve32x
     sx          s4, 24(sp)
     load_filter a5
     sub         a1, a1, a2      # src - src_stride
-    li          t1, 0           # offset   
-    mv          t4, a3 
+    li          t1, 0           # offset
+    mv          t4, a3
 
 1:
     add         t2, a1, t1
@@ -310,7 +310,7 @@ func ff_hevc_put_epel_v_8_\lmul\()_rvv, zve32x
     vse16.v     v0, (t3)
     add         t3, t3, 2*HEVC_MAX_PB_SIZE
     addi        a3, a3, -1
-    bgt         a3, zero, 2b    
+    bgt         a3, zero, 2b
     add         t1, t1, t5
     sub         a6, a6, t5
     mv          a3, t4
@@ -325,7 +325,7 @@ func ff_hevc_put_epel_v_8_\lmul\()_rvv, zve32x
 endfunc
 
 func ff_hevc_put_epel_uni_v_8_\lmul\()_rvv, zve32x
-    csrwi       vxrm, 0 
+    csrwi       vxrm, 0
     addi        sp, sp, -32
     sx          s1, 0(sp)
     sx          s2, 8(sp)
@@ -333,8 +333,8 @@ func ff_hevc_put_epel_uni_v_8_\lmul\()_rvv, zve32x
     sx          s4, 24(sp)
     load_filter a6
     sub         a2, a2, a3      # src - src_stride
-    li          t1, 0           # offset   
-    mv          t4, a4 
+    li          t1, 0           # offset
+    mv          t4, a4
 
 1:
     add         t2, a2, t1
@@ -360,7 +360,7 @@ func ff_hevc_put_epel_uni_v_8_\lmul\()_rvv, zve32x
     vse8.v      v0, (t3)
     add         t3, t3, a1
     addi        a4, a4, -1
-    bgt         a4, zero, 2b    
+    bgt         a4, zero, 2b
     add         t1, t1, t5
     sub         a7, a7, t5
     mv          a4, t4
@@ -375,7 +375,7 @@ func ff_hevc_put_epel_uni_v_8_\lmul\()_rvv, zve32x
 endfunc
 
 func ff_hevc_put_epel_uni_w_v_8_\lmul\()_rvv, zve32x
-    csrwi       vxrm, 0 
+    csrwi       vxrm, 0
 #if (__riscv_xlen == 32)
     lw          t1, 4(sp)       # my
     lw          t6, 8(sp)       # width
@@ -391,8 +391,8 @@ func ff_hevc_put_epel_uni_w_v_8_\lmul\()_rvv, zve32x
     load_filter t1
     addi        a5, a5, 6       # shift
     sub         a2, a2, a3      # src - src_stride
-    li          t1, 0           # offset   
-    mv          t4, a4 
+    li          t1, 0           # offset
+    mv          t4, a4
 
 1:
     add         t2, a2, t1
@@ -424,7 +424,7 @@ func ff_hevc_put_epel_uni_w_v_8_\lmul\()_rvv, zve32x
     vse8.v      v0, (t3)
     add         t3, t3, a1
     addi        a4, a4, -1
-    bgt         a4, zero, 2b    
+    bgt         a4, zero, 2b
     add         t1, t1, t5
     sub         t6, t6, t5
     mv          a4, t4
@@ -439,7 +439,7 @@ func ff_hevc_put_epel_uni_w_v_8_\lmul\()_rvv, zve32x
 endfunc
 
 func ff_hevc_put_epel_bi_v_8_\lmul\()_rvv, zve32x
-    csrwi       vxrm, 0 
+    csrwi       vxrm, 0
     lw          t6, 0(sp)      # width
     addi        sp, sp, -32
     sx          s1, 0(sp)
@@ -448,8 +448,8 @@ func ff_hevc_put_epel_bi_v_8_\lmul\()_rvv, zve32x
     sx          s4, 24(sp)
     load_filter a7
     sub         a2, a2, a3      # src - src_stride
-    li          t1, 0           # offset   
-    mv          t4, a5 
+    li          t1, 0           # offset
+    mv          t4, a5
 
 1:
     add         t2, a2, t1
@@ -495,4 +495,299 @@ func ff_hevc_put_epel_bi_v_8_\lmul\()_rvv, zve32x
 endfunc
 .endm
 
-hevc_epel_v m1, m2, m4
\ No newline at end of file
+hevc_epel_v m1, m2, m4
+
+.macro filter_v_s         vdst, vsrc0, vsrc1, vsrc2, vsrc3
+        vwmul.vx       \vdst, \vsrc0, s5
+        vwmacc.vx      \vdst, s6, \vsrc1
+        vmv.v.v        \vsrc0, \vsrc1
+        vwmacc.vx      \vdst, s7, \vsrc2
+        vmv.v.v        \vsrc1, \vsrc2
+        vwmacc.vx      \vdst, s8, \vsrc3
+        vmv.v.v        \vsrc2, \vsrc3
+.endm
+
+/* clobbers t0, t1 */
+.macro load_filter2 m
+        la          t0, qpel_filters
+        slli        t1, \m, 2
+        add         t0, t0, t1
+        lb          s5, 0(t0)
+        lb          s6, 1(t0)
+        lb          s7, 2(t0)
+        lb          s8, 3(t0)
+.endm
+
+.macro hevc_epel_hv       lmul, lmul2, lmul4
+func ff_hevc_put_epel_hv_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 2
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter  a4
+    load_filter2 a5
+    sub         a1, a1, a2     # src - src_stride
+    mv          t0, a3
+    li          t1, 0          # offset
+
+1:
+    add         t2, a1, t1
+    slli        t3, t1, 1
+    add         t3, a0, t3
+
+    vsetvli     t6, a6, e8, \lmul, ta, ma
+    filter_h    v4, v24, v26, v28, v30, t2
+    add         t2, t2, a2
+    filter_h    v8, v24, v26, v28, v30, t2
+    add         t2, t2, a2
+    filter_h    v12, v24, v26, v28, v30, t2
+    add         t2, t2, a2
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    filter_h    v16, v24, v26, v28, v30, t2
+    add         t2, t2, a2
+
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    filter_v_s  v0, v4, v8, v12, v16
+    vnclip.wi   v0, v0, 6
+    vse16.v     v0, (t3)
+    addi        a3, a3, -1
+    addi        t3, t3, 2*HEVC_MAX_PB_SIZE
+    bgt         a3, zero, 2b
+    mv          a3, t0
+    add         t1, t1, t6
+    sub         a6, a6, t6
+    bgt         a6, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_epel_uni_hv_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter  a5
+    load_filter2 a6
+    sub         a2, a2, a3     # src - src_stride
+    mv          t0, a4
+    li          t1, 0          # offset
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+
+    vsetvli     t6, a7, e8, \lmul, ta, ma
+    filter_h    v4, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+    filter_h    v8, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+    filter_h    v12, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    filter_h    v16, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    filter_v_s  v0, v4, v8, v12, v16
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vsra.vi     v0, v0, 6
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vnclipu.wi   v0, v0, 6
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    vse8.v     v0, (t3)
+    addi        a4, a4, -1
+    add         t3, t3, a1
+    bgt         a4, zero, 2b
+    mv          a4, t0
+    add         t1, t1, t6
+    sub         a7, a7, t6
+    bgt         a7, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_epel_uni_w_hv_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lx          t2, 0(sp)       # mx
+#if (__riscv_xlen == 32)
+    lw          t4, 4(sp)       # my
+    lw          t5, 8(sp)       # width
+#elif (__riscv_xlen == 64)
+    ld          t4, 8(sp)
+    lw          t5, 16(sp)
+#endif
+    addi        a5, a5, 6       # shift
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter  t2
+    load_filter2 t4
+    sub         a2, a2, a3     # src - src_stride
+    mv          t0, a4
+    li          t1, 0          # offset
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+
+    vsetvli     t6, t5, e8, \lmul, ta, ma
+    filter_h    v4, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+    filter_h    v8, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+    filter_h    v12, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    filter_h    v16, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    filter_v_s  v0, v4, v8, v12, v16
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vsra.vi     v0, v0, 6
+    vmul.vx     v0, v0, a6
+    vssra.vx    v0, v0, a5
+    vsadd.vx    v0, v0, a7
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vnclip.wi   v0, v0, 0
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    vse8.v      v0, (t3)
+    addi        a4, a4, -1
+    add         t3, t3, a1
+    bgt         a4, zero, 2b
+    mv          a4, t0
+    add         t1, t1, t6
+    sub         t5, t5, t6
+    bgt         t5, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+
+func ff_hevc_put_epel_bi_hv_8_\lmul\()_rvv, zve32x
+    csrwi       vxrm, 0
+    lw          t3, 0(sp)      # width
+    addi        sp, sp, -64
+    sx          s1, 0(sp)
+    sx          s2, 8(sp)
+    sx          s3, 16(sp)
+    sx          s4, 24(sp)
+    sx          s5, 32(sp)
+    sx          s6, 40(sp)
+    sx          s7, 48(sp)
+    sx          s8, 56(sp)
+    load_filter  a6
+    load_filter2 a7
+    mv          a6, t3
+    sub         a2, a2, a3     # src - src_stride
+    mv          t0, a5
+    li          t1, 0          # offset
+
+1:
+    add         t2, a2, t1
+    add         t3, a0, t1
+    slli        t5, t1, 1
+    add         t5, a4, t5
+
+    vsetvli     t6, a6, e8, \lmul, ta, ma
+    filter_h    v4, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+    filter_h    v8, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+    filter_h    v12, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+
+2:
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    filter_h    v16, v24, v26, v28, v30, t2
+    add         t2, t2, a3
+
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vle16.V     v24, (t5)
+    addi        t5, t5, 2*HEVC_MAX_PB_SIZE
+    filter_v_s  v0, v4, v8, v12, v16
+    vsetvli     zero, zero, e32, \lmul4, ta, ma
+    vsra.vi     v0, v0, 6
+    vsetvli     zero, zero, e16, \lmul2, ta, ma
+    vwadd.wv    v0, v0, v24
+    vnclip.wi   v0, v0, 7
+    vmax.vx     v0, v0, zero
+    vsetvli     zero, zero, e8, \lmul, ta, ma
+    vnclipu.wi  v0, v0, 0
+    vse8.v     v0, (t3)
+    addi        a5, a5, -1
+    add         t3, t3, a1
+    bgt         a5, zero, 2b
+    mv          a5, t0
+    add         t1, t1, t6
+    sub         a6, a6, t6
+    bgt         a6, zero, 1b
+
+    lx          s1, 0(sp)
+    lx          s2, 8(sp)
+    lx          s3, 16(sp)
+    lx          s4, 24(sp)
+    lx          s5, 32(sp)
+    lx          s6, 40(sp)
+    lx          s7, 48(sp)
+    lx          s8, 56(sp)
+    addi        sp, sp, 64
+    ret
+endfunc
+.endm
+
+hevc_epel_hv m1, m2, m4
diff --git a/libavcodec/riscv/hevcdsp_init.c b/libavcodec/riscv/hevcdsp_init.c
index 53c800626f..1df7eb654a 100644
--- a/libavcodec/riscv/hevcdsp_init.c
+++ b/libavcodec/riscv/hevcdsp_init.c
@@ -102,6 +102,10 @@ void ff_hevc_dsp_init_riscv(HEVCDSPContext *c, const int bit_depth)
                 RVV_FNASSIGN_PEL(c->put_hevc_epel_uni, 1, 0, ff_hevc_put_epel_uni_v_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_epel_uni_w, 1, 0, ff_hevc_put_epel_uni_w_v_8_m1_rvv);
                 RVV_FNASSIGN_PEL(c->put_hevc_epel_bi, 1, 0, ff_hevc_put_epel_bi_v_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel, 1, 1, ff_hevc_put_epel_hv_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_uni, 1, 1, ff_hevc_put_epel_uni_hv_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_uni_w, 1, 1, ff_hevc_put_epel_uni_w_hv_8_m1_rvv);
+                RVV_FNASSIGN_PEL(c->put_hevc_epel_bi, 1, 1, ff_hevc_put_epel_bi_hv_8_m1_rvv);
                 break;
             default:
                 break;
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-24 15:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-22  4:23 [FFmpeg-devel] [PATCH 1/6] libavcodec/riscv: add RVV optimized for qpel_h in HEVC zhanheng.yang--- via ffmpeg-devel
2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 2/6] libavcodec/riscv: add RVV optimized for qpel_v " zhanheng.yang--- via ffmpeg-devel
2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 3/6] libavcodec/riscv: add RVV optimized for epel_h " zhanheng.yang--- via ffmpeg-devel
2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 4/6] libavcodec/riscv: add RVV optimized for epel_v " zhanheng.yang--- via ffmpeg-devel
2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 5/6] libavcodec/riscv: add RVV optimized for qpel_hv " zhanheng.yang--- via ffmpeg-devel
2026-01-22  4:23 ` [FFmpeg-devel] [PATCH 6/6] libavcodec/riscv: add RVV optimized for epel_hv " zhanheng.yang--- via ffmpeg-devel

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git