Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH v2] lavc/vvc_mc: R-V V avg w_avg
@ 2024-06-01 18:01 uk7b
  2024-06-01 18:02 ` flow gg
  2024-06-01 19:54 ` Rémi Denis-Courmont
  0 siblings, 2 replies; 4+ messages in thread
From: uk7b @ 2024-06-01 18:01 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: sunyuechi

From: sunyuechi <sunyuechi@iscas.ac.cn>

                                                      C908   X60
avg_8_2x2_c                                        :    1.0    1.0
avg_8_2x2_rvv_i32                                  :    1.0    1.0
avg_8_2x4_c                                        :    1.7    2.0
avg_8_2x4_rvv_i32                                  :    1.2    1.2
avg_8_2x8_c                                        :    3.7    4.0
avg_8_2x8_rvv_i32                                  :    2.0    2.0
avg_8_2x16_c                                       :    7.2    7.5
avg_8_2x16_rvv_i32                                 :    3.2    3.0
avg_8_2x32_c                                       :   14.2   15.0
avg_8_2x32_rvv_i32                                 :    5.7    5.0
avg_8_2x64_c                                       :   46.7   44.2
avg_8_2x64_rvv_i32                                 :   39.2   36.0
avg_8_2x128_c                                      :   99.7   80.0
avg_8_2x128_rvv_i32                                :   86.2   65.5
avg_8_4x2_c                                        :    2.0    2.0
avg_8_4x2_rvv_i32                                  :    1.0    1.0
avg_8_4x4_c                                        :    3.5    3.7
avg_8_4x4_rvv_i32                                  :    1.5    1.2
avg_8_4x8_c                                        :    6.5    7.0
avg_8_4x8_rvv_i32                                  :    2.0    1.7
avg_8_4x16_c                                       :   13.5   14.0
avg_8_4x16_rvv_i32                                 :    3.2    2.7
avg_8_4x32_c                                       :   26.2   27.5
avg_8_4x32_rvv_i32                                 :    5.7    5.0
avg_8_4x64_c                                       :   75.0   65.7
avg_8_4x64_rvv_i32                                 :   44.0   32.0
avg_8_4x128_c                                      :  165.2  118.5
avg_8_4x128_rvv_i32                                :   81.5   71.0
avg_8_8x2_c                                        :    3.2    3.5
avg_8_8x2_rvv_i32                                  :    1.2    1.0
avg_8_8x4_c                                        :    6.5    6.5
avg_8_8x4_rvv_i32                                  :    1.5    1.5
avg_8_8x8_c                                        :   12.5   13.2
avg_8_8x8_rvv_i32                                  :    2.2    1.7
avg_8_8x16_c                                       :   25.2   26.5
avg_8_8x16_rvv_i32                                 :    3.7    2.7
avg_8_8x32_c                                       :   50.0   52.5
avg_8_8x32_rvv_i32                                 :    6.7    5.2
avg_8_8x64_c                                       :  120.7  119.0
avg_8_8x64_rvv_i32                                 :   43.2   33.5
avg_8_8x128_c                                      :  247.5  217.7
avg_8_8x128_rvv_i32                                :  100.5   74.7
avg_8_16x2_c                                       :    6.2    6.5
avg_8_16x2_rvv_i32                                 :    1.2    1.0
avg_8_16x4_c                                       :   12.2   13.0
avg_8_16x4_rvv_i32                                 :    2.0    1.2
avg_8_16x8_c                                       :   24.5   25.7
avg_8_16x8_rvv_i32                                 :    3.2    2.0
avg_8_16x16_c                                      :   48.7   51.2
avg_8_16x16_rvv_i32                                :    5.7    3.2
avg_8_16x32_c                                      :   97.5  102.7
avg_8_16x32_rvv_i32                                :   10.7    6.0
avg_8_16x64_c                                      :  213.0  215.0
avg_8_16x64_rvv_i32                                :   51.5   33.5
avg_8_16x128_c                                     :  408.5  417.0
avg_8_16x128_rvv_i32                               :  102.0   71.5
avg_8_32x2_c                                       :   12.2   13.0
avg_8_32x2_rvv_i32                                 :    2.0    1.2
avg_8_32x4_c                                       :   24.5   25.5
avg_8_32x4_rvv_i32                                 :    3.2    1.7
avg_8_32x8_c                                       :   48.5   50.7
avg_8_32x8_rvv_i32                                 :    5.7    3.0
avg_8_32x16_c                                      :   96.5  101.5
avg_8_32x16_rvv_i32                                :   10.5    5.0
avg_8_32x32_c                                      :  210.2  202.5
avg_8_32x32_rvv_i32                                :   20.2    9.7
avg_8_32x64_c                                      :  431.7  417.2
avg_8_32x64_rvv_i32                                :   68.0   46.0
avg_8_32x128_c                                     :  822.2  819.0
avg_8_32x128_rvv_i32                               :  152.2   69.0
avg_8_64x2_c                                       :   24.0   25.2
avg_8_64x2_rvv_i32                                 :    3.0    1.7
avg_8_64x4_c                                       :   48.2   51.0
avg_8_64x4_rvv_i32                                 :    5.5    2.7
avg_8_64x8_c                                       :   96.7  101.5
avg_8_64x8_rvv_i32                                 :   10.0    5.0
avg_8_64x16_c                                      :  193.5  203.0
avg_8_64x16_rvv_i32                                :   19.2    9.2
avg_8_64x32_c                                      :  404.2  405.7
avg_8_64x32_rvv_i32                                :   37.7   18.0
avg_8_64x64_c                                      :  846.2  841.7
avg_8_64x64_rvv_i32                                :  136.5   35.2
avg_8_64x128_c                                     : 1659.5 1662.2
avg_8_64x128_rvv_i32                               :  236.7  177.7
avg_8_128x2_c                                      :   48.7   51.0
avg_8_128x2_rvv_i32                                :    5.2    2.7
avg_8_128x4_c                                      :   96.7  101.0
avg_8_128x4_rvv_i32                                :    9.7    4.7
avg_8_128x8_c                                      :  226.0  201.5
avg_8_128x8_rvv_i32                                :   18.7    8.7
avg_8_128x16_c                                     :  402.7  402.7
avg_8_128x16_rvv_i32                               :   37.0   17.2
avg_8_128x32_c                                     :  791.2  805.0
avg_8_128x32_rvv_i32                               :   73.5   33.5
avg_8_128x64_c                                     : 1616.7 1645.2
avg_8_128x64_rvv_i32                               :  223.7   68.2
avg_8_128x128_c                                    : 3202.0 3235.2
avg_8_128x128_rvv_i32                              :  390.0  314.2
w_avg_8_2x2_c                                      :    1.7    1.5
w_avg_8_2x2_rvv_i32                                :    1.7    1.5
w_avg_8_2x4_c                                      :    2.7    2.5
w_avg_8_2x4_rvv_i32                                :    2.7    2.5
w_avg_8_2x8_c                                      :    5.0    5.0
w_avg_8_2x8_rvv_i32                                :    4.5    4.0
w_avg_8_2x16_c                                     :   26.5    9.5
w_avg_8_2x16_rvv_i32                               :    8.0    7.0
w_avg_8_2x32_c                                     :   18.7   18.5
w_avg_8_2x32_rvv_i32                               :   15.0   13.2
w_avg_8_2x64_c                                     :   58.5   46.5
w_avg_8_2x64_rvv_i32                               :   49.7   38.7
w_avg_8_2x128_c                                    :  121.7   85.2
w_avg_8_2x128_rvv_i32                              :   89.7   81.0
w_avg_8_4x2_c                                      :    2.5    2.5
w_avg_8_4x2_rvv_i32                                :    1.7    1.5
w_avg_8_4x4_c                                      :    4.7    4.7
w_avg_8_4x4_rvv_i32                                :    2.7    2.2
w_avg_8_4x8_c                                      :    9.0    9.0
w_avg_8_4x8_rvv_i32                                :    4.5    4.0
w_avg_8_4x16_c                                     :   17.7   17.7
w_avg_8_4x16_rvv_i32                               :    8.0    7.0
w_avg_8_4x32_c                                     :   35.0   35.0
w_avg_8_4x32_rvv_i32                               :   15.0   13.5
w_avg_8_4x64_c                                     :   95.2   80.2
w_avg_8_4x64_rvv_i32                               :   47.7   38.0
w_avg_8_4x128_c                                    :  197.7  164.7
w_avg_8_4x128_rvv_i32                              :  101.7   81.5
w_avg_8_8x2_c                                      :    4.5    4.5
w_avg_8_8x2_rvv_i32                                :    2.0    1.7
w_avg_8_8x4_c                                      :    8.7    8.7
w_avg_8_8x4_rvv_i32                                :    2.7    2.5
w_avg_8_8x8_c                                      :   33.5   17.0
w_avg_8_8x8_rvv_i32                                :    4.7    4.0
w_avg_8_8x16_c                                     :   34.0   34.0
w_avg_8_8x16_rvv_i32                               :    8.5    7.2
w_avg_8_8x32_c                                     :   85.5   67.7
w_avg_8_8x32_rvv_i32                               :   16.2   13.5
w_avg_8_8x64_c                                     :  162.5  148.2
w_avg_8_8x64_rvv_i32                               :   50.0   36.5
w_avg_8_8x128_c                                    :  380.2  301.5
w_avg_8_8x128_rvv_i32                              :   87.2   79.5
w_avg_8_16x2_c                                     :    8.5    8.7
w_avg_8_16x2_rvv_i32                               :    2.2    1.7
w_avg_8_16x4_c                                     :   16.7   17.0
w_avg_8_16x4_rvv_i32                               :    3.7    2.5
w_avg_8_16x8_c                                     :   33.2   33.7
w_avg_8_16x8_rvv_i32                               :    6.5    4.2
w_avg_8_16x16_c                                    :   66.2   66.5
w_avg_8_16x16_rvv_i32                              :   12.0    7.5
w_avg_8_16x32_c                                    :  133.2  134.0
w_avg_8_16x32_rvv_i32                              :   23.0   14.2
w_avg_8_16x64_c                                    :  296.0  276.7
w_avg_8_16x64_rvv_i32                              :   66.7   38.2
w_avg_8_16x128_c                                   :  625.2  539.7
w_avg_8_16x128_rvv_i32                             :  135.5   79.2
w_avg_8_32x2_c                                     :   16.7   16.7
w_avg_8_32x2_rvv_i32                               :    3.5    2.0
w_avg_8_32x4_c                                     :   33.2   33.2
w_avg_8_32x4_rvv_i32                               :    6.0    3.5
w_avg_8_32x8_c                                     :   65.7   66.2
w_avg_8_32x8_rvv_i32                               :   11.2    5.7
w_avg_8_32x16_c                                    :  132.0  132.0
w_avg_8_32x16_rvv_i32                              :   21.5   10.7
w_avg_8_32x32_c                                    :  261.7  272.2
w_avg_8_32x32_rvv_i32                              :   42.2   20.5
w_avg_8_32x64_c                                    :  528.2  562.7
w_avg_8_32x64_rvv_i32                              :   83.5   59.2
w_avg_8_32x128_c                                   : 1135.5 1070.0
w_avg_8_32x128_rvv_i32                             :  208.7   96.5
w_avg_8_64x2_c                                     :   33.0   33.0
w_avg_8_64x2_rvv_i32                               :    6.0    3.0
w_avg_8_64x4_c                                     :   65.5   67.0
w_avg_8_64x4_rvv_i32                               :   11.0    5.2
w_avg_8_64x8_c                                     :  150.0  134.7
w_avg_8_64x8_rvv_i32                               :   21.5   10.0
w_avg_8_64x16_c                                    :  265.2  273.7
w_avg_8_64x16_rvv_i32                              :   42.2   19.0
w_avg_8_64x32_c                                    :  629.7  541.7
w_avg_8_64x32_rvv_i32                              :   83.7   37.7
w_avg_8_64x64_c                                    : 1259.0 1237.7
w_avg_8_64x64_rvv_i32                              :  190.7   76.0
w_avg_8_64x128_c                                   : 2967.0 2209.5
w_avg_8_64x128_rvv_i32                             :  437.0  190.5
w_avg_8_128x2_c                                    :   65.7   66.0
w_avg_8_128x2_rvv_i32                              :   11.2    5.5
w_avg_8_128x4_c                                    :  131.7  134.7
w_avg_8_128x4_rvv_i32                              :   21.5   10.0
w_avg_8_128x8_c                                    :  270.2  264.2
w_avg_8_128x8_rvv_i32                              :   42.2   19.2
w_avg_8_128x16_c                                   :  580.0  554.2
w_avg_8_128x16_rvv_i32                             :   83.5   37.5
w_avg_8_128x32_c                                   : 1141.0 1206.2
w_avg_8_128x32_rvv_i32                             :  166.2   74.2
w_avg_8_128x64_c                                   : 2295.5 2403.5
w_avg_8_128x64_rvv_i32                             :  408.2  159.0
w_avg_8_128x128_c                                  : 5367.5 4915.2
w_avg_8_128x128_rvv_i32                            :  741.2  331.5

test

u makefile
---
 libavcodec/riscv/vvc/Makefile      |   2 +
 libavcodec/riscv/vvc/vvc_mc_rvv.S  | 295 +++++++++++++++++++++++++++++
 libavcodec/riscv/vvc/vvcdsp_init.c |  71 +++++++
 libavcodec/vvc/dsp.c               |   4 +-
 libavcodec/vvc/dsp.h               |   1 +
 5 files changed, 372 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/riscv/vvc/Makefile
 create mode 100644 libavcodec/riscv/vvc/vvc_mc_rvv.S
 create mode 100644 libavcodec/riscv/vvc/vvcdsp_init.c

diff --git a/libavcodec/riscv/vvc/Makefile b/libavcodec/riscv/vvc/Makefile
new file mode 100644
index 0000000000..582b051579
--- /dev/null
+++ b/libavcodec/riscv/vvc/Makefile
@@ -0,0 +1,2 @@
+OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvcdsp_init.o
+RVV-OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvc_mc_rvv.o
diff --git a/libavcodec/riscv/vvc/vvc_mc_rvv.S b/libavcodec/riscv/vvc/vvc_mc_rvv.S
new file mode 100644
index 0000000000..f2128fa776
--- /dev/null
+++ b/libavcodec/riscv/vvc/vvc_mc_rvv.S
@@ -0,0 +1,295 @@
+/*
+ * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences (ISCAS).
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/riscv/asm.S"
+
+.macro vsetvlstatic8 w, vlen, is_w
+        .if \w <= 2
+                vsetivli        zero, \w, e8, mf8, ta, ma
+        .elseif \w <= 4 && \vlen == 128
+                vsetivli        zero, \w, e8, mf4, ta, ma
+        .elseif \w <= 4 && \vlen >= 256
+                vsetivli        zero, \w, e8, mf8, ta, ma
+        .elseif \w <= 8 && \vlen == 128
+                vsetivli        zero, \w, e8, mf2, ta, ma
+        .elseif \w <= 8 && \vlen >= 256
+                vsetivli        zero, \w, e8, mf4, ta, ma
+        .elseif \w <= 16 && \vlen == 128
+                vsetivli        zero, \w, e8, m1, ta, ma
+        .elseif \w <= 16 && \vlen >= 256
+                vsetivli        zero, \w, e8, mf2, ta, ma
+        .elseif \w <= 32 && \vlen >= 256
+                li t0, \w
+                vsetvli         zero, t0, e8, m1, ta, ma
+        .elseif \w <= (\vlen / 4) || \is_w
+                li t0, 64
+                vsetvli         zero, t0, e8, m2, ta, ma
+        .else
+                li t0, \w
+                vsetvli         zero, t0, e8, m4, ta, ma
+        .endif
+.endm
+
+.macro vsetvlstatic16 w, vlen, is_w
+        .if \w <= 2
+                vsetivli        zero, \w, e16, mf4, ta, ma
+        .elseif \w <= 4 && \vlen == 128
+                vsetivli        zero, \w, e16, mf2, ta, ma
+        .elseif \w <= 4 && \vlen >= 256
+                vsetivli        zero, \w, e16, mf4, ta, ma
+        .elseif \w <= 8 && \vlen == 128
+                vsetivli        zero, \w, e16, m1, ta, ma
+        .elseif \w <= 8 && \vlen >= 256
+                vsetivli        zero, \w, e16, mf2, ta, ma
+        .elseif \w <= 16 && \vlen == 128
+                vsetivli        zero, \w, e16, m2, ta, ma
+        .elseif \w <= 16 && \vlen >= 256
+                vsetivli        zero, \w, e16, m1, ta, ma
+        .elseif \w <= 32 && \vlen >= 256
+                li t0, \w
+                vsetvli         zero, t0, e16, m2, ta, ma
+        .elseif \w <= (\vlen / 4) || \is_w
+                li t0, 64
+                vsetvli         zero, t0, e16, m4, ta, ma
+        .else
+                li t0, \w
+                vsetvli         zero, t0, e16, m8, ta, ma
+        .endif
+.endm
+
+.macro vsetvlstatic32 w, vlen
+        .if \w <= 2
+                vsetivli        zero, \w, e32, mf2, ta, ma
+        .elseif \w <= 4 && \vlen == 128
+                vsetivli        zero, \w, e32, m1, ta, ma
+        .elseif \w <= 4 && \vlen >= 256
+                vsetivli        zero, \w, e32, mf2, ta, ma
+        .elseif \w <= 8 && \vlen == 128
+                vsetivli        zero, \w, e32, m2, ta, ma
+        .elseif \w <= 8 && \vlen >= 256
+                vsetivli        zero, \w, e32, m1, ta, ma
+        .elseif \w <= 16 && \vlen == 128
+                vsetivli        zero, \w, e32, m4, ta, ma
+        .elseif \w <= 16 && \vlen >= 256
+                vsetivli        zero, \w, e32, m2, ta, ma
+        .elseif \w <= 32 && \vlen >= 256
+                li t0, \w
+                vsetvli         zero, t0, e32, m4, ta, ma
+        .else
+                li t0, \w
+                vsetvli         zero, t0, e32, m8, ta, ma
+        .endif
+.endm
+
+.macro avg_nx1 w, vlen
+        vsetvlstatic16    \w, \vlen, 0
+        vle16.v           v0, (a2)
+        vle16.v           v8, (a3)
+        vadd.vv           v8, v8, v0
+        vmax.vx           v8, v8, zero
+        vsetvlstatic8     \w, \vlen, 0
+        vnclipu.wi        v8, v8, 7
+        vse8.v            v8, (a0)
+.endm
+
+.macro avg w, vlen, id
+\id\w\vlen:
+.if \w < 128
+        vsetvlstatic16    \w, \vlen, 0
+        addi              t0, a2, 128*2
+        addi              t1, a3, 128*2
+        add               t2, a0, a1
+        vle16.v           v0, (a2)
+        vle16.v           v8, (a3)
+        addi              a5, a5, -2
+        vle16.v           v16, (t0)
+        vle16.v           v24, (t1)
+        vadd.vv           v8, v8, v0
+        vadd.vv           v24, v24, v16
+        vmax.vx           v8, v8, zero
+        vmax.vx           v24, v24, zero
+        vsetvlstatic8     \w, \vlen, 0
+        addi              a2, a2, 128*4
+        vnclipu.wi        v8, v8, 7
+        vnclipu.wi        v24, v24, 7
+        addi              a3, a3, 128*4
+        vse8.v            v8, (a0)
+        vse8.v            v24, (t2)
+        sh1add            a0, a1, a0
+.else
+        avg_nx1           128, \vlen
+        addi              a5, a5, -1
+        .if \vlen == 128
+        addi              a2, a2, 64*2
+        addi              a3, a3, 64*2
+        addi              a0, a0, 64
+        avg_nx1           128, \vlen
+        addi              a0, a0, -64
+        addi              a2, a2, 128
+        addi              a3, a3, 128
+        .else
+        addi              a2, a2, 128*2
+        addi              a3, a3, 128*2
+        .endif
+        add               a0, a0, a1
+.endif
+        bnez              a5, \id\w\vlen\()b
+        ret
+.endm
+
+
+.macro AVG_JMP_TABLE id, vlen
+const jmp_table_\id\vlen
+    .8byte \id\()2\vlen\()f
+    .8byte \id\()4\vlen\()f
+    .8byte \id\()8\vlen\()f
+    .8byte \id\()16\vlen\()f
+    .8byte \id\()32\vlen\()f
+    .8byte \id\()64\vlen\()f
+    .8byte \id\()128\vlen\()f
+endconst
+.endm
+
+.macro AVG_J vlen, id
+        clz               a4, a4
+        li                t0, __riscv_xlen-2
+        sub               a4, t0, a4
+        lla               t0, jmp_table_\id\vlen
+        sh3add            t0, a4, t0
+        ld                t0, 0(t0)
+        jr                t0
+.endm
+
+.macro func_avg vlen
+func ff_vvc_avg_8_rvv_\vlen\(), zve32x
+        AVG_JMP_TABLE     1, \vlen
+        csrw              vxrm, zero
+        AVG_J             \vlen, 1
+        .irp w,2,4,8,16,32,64,128
+        avg               \w, \vlen, 1
+        .endr
+endfunc
+.endm
+
+.macro w_avg_nx1 w, vlen
+        vsetvlstatic16    \w, \vlen, 1
+        vle16.v           v0, (a2)
+        vle16.v           v8, (a3)
+        vwmul.vx          v16, v0, a7
+        vwmacc.vx         v16, t3, v8
+        vsetvlstatic32    \w, \vlen
+        vadd.vx           v16, v16, t4
+        vsetvlstatic16    \w, \vlen, 1
+        vnsrl.wx          v16, v16, t6
+        vmax.vx           v16, v16, zero
+        vsetvlstatic8     \w, \vlen, 1
+        vnclipu.wi        v16, v16, 0
+        vse8.v            v16, (a0)
+.endm
+
+.macro w_avg w, vlen, id
+\id\w\vlen:
+.if \vlen <= 16
+        vsetvlstatic16    \w, \vlen, 1
+        addi              t0, a2, 128*2
+        addi              t1, a3, 128*2
+        vle16.v           v0, (a2)
+        vle16.v           v8, (a3)
+        addi              a5, a5, -2
+        vle16.v           v20, (t0)
+        vle16.v           v24, (t1)
+        vwmul.vx          v16, v0, a7
+        vwmul.vx          v28, v20, a7
+        vwmacc.vx         v16, t3, v8
+        vwmacc.vx         v28, t3, v24
+        vsetvlstatic32    \w, \vlen
+        add               t2, a0, a1
+        vadd.vx           v16, v16, t4
+        vadd.vx           v28, v28, t4
+        vsetvlstatic16    \w, \vlen, 1
+        vnsrl.wx          v16, v16, t6
+        vnsrl.wx          v28, v28, t6
+        vmax.vx           v16, v16, zero
+        vmax.vx           v28, v28, zero
+        vsetvlstatic8     \w, \vlen, 1
+        addi              a2, a2, 128*4
+        vnclipu.wi        v16, v16, 0
+        vnclipu.wi        v28, v28, 0
+        vse8.v            v16, (a0)
+        addi              a3, a3, 128*4
+        vse8.v            v28, (t2)
+        sh1add            a0, a1, a0
+.else
+        w_avg_nx1         \w, \vlen
+        addi              a5, a5, -1
+        .if \w == (\vlen / 2)
+        addi              a2, a2, (\vlen / 2)
+        addi              a3, a3, (\vlen / 2)
+        addi              a0, a0, (\vlen / 4)
+        w_avg_nx1         \w, \vlen
+        addi              a2, a2, -(\vlen / 2)
+        addi              a3, a3, -(\vlen / 2)
+        addi              a0, a0, -(\vlen / 4)
+        .elseif \w == 128 && \vlen == 128
+        .rept 3
+        addi              a2, a2, (\vlen / 2)
+        addi              a3, a3, (\vlen / 2)
+        addi              a0, a0, (\vlen / 4)
+        w_avg_nx1         \w, \vlen
+        .endr
+        addi              a2, a2, -(\vlen / 2) * 3
+        addi              a3, a3, -(\vlen / 2) * 3
+        addi              a0, a0, -(\vlen / 4) * 3
+        .endif
+
+        addi              a2, a2, 128*2
+        addi              a3, a3, 128*2
+        add               a0, a0, a1
+.endif
+        bnez              a5, \id\w\vlen\()b
+        ret
+.endm
+
+
+.macro func_w_avg vlen
+func ff_vvc_w_avg_8_rvv_\vlen\(), zve32x
+        AVG_JMP_TABLE     2, \vlen
+        csrw              vxrm, zero
+        addi              t6, a6, 7
+        ld                t3, (sp)
+        ld                t4, 8(sp)
+        ld                t5, 16(sp)
+        add               t4, t4, t5
+        addi              t4, t4, 1       // o0 + o1 + 1
+        addi              t5, t6, -1      // shift - 1
+        sll               t4, t4, t5
+        AVG_J             \vlen, 2
+        .irp w,2,4,8,16,32,64,128
+        w_avg             \w, \vlen, 2
+        .endr
+endfunc
+.endm
+
+func_avg 128
+func_avg 256
+#if (__riscv_xlen == 64)
+func_w_avg 128
+func_w_avg 256
+#endif
diff --git a/libavcodec/riscv/vvc/vvcdsp_init.c b/libavcodec/riscv/vvc/vvcdsp_init.c
new file mode 100644
index 0000000000..85b1ede061
--- /dev/null
+++ b/libavcodec/riscv/vvc/vvcdsp_init.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences (ISCAS).
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "config.h"
+
+#include "libavutil/attributes.h"
+#include "libavutil/cpu.h"
+#include "libavutil/riscv/cpu.h"
+#include "libavcodec/vvc/dsp.h"
+
+#define bf(fn, bd,  opt) fn##_##bd##_##opt
+
+#define AVG_PROTOTYPES(bd, opt)                                                                      \
+void bf(ff_vvc_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride,                                     \
+    const int16_t *src0, const int16_t *src1, int width, int height);                                \
+void bf(ff_vvc_w_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride,                                   \
+    const int16_t *src0, const int16_t *src1, int width, int height,                                 \
+    int denom, int w0, int w1, int o0, int o1);
+
+AVG_PROTOTYPES(8, rvv_128)
+AVG_PROTOTYPES(8, rvv_256)
+
+void ff_vvc_dsp_init_riscv(VVCDSPContext *const c, const int bd)
+{
+#if HAVE_RVV
+    const int flags = av_get_cpu_flags();
+
+    if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR) &&
+        ff_rv_vlen_least(256)) {
+        switch (bd) {
+            case 8:
+                c->inter.avg    = ff_vvc_avg_8_rvv_256;
+# if (__riscv_xlen == 64)
+                c->inter.w_avg    = ff_vvc_w_avg_8_rvv_256;
+# endif
+                break;
+            default:
+                break;
+        }
+    } else if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR) &&
+               ff_rv_vlen_least(128)) {
+        switch (bd) {
+            case 8:
+                c->inter.avg    = ff_vvc_avg_8_rvv_128;
+# if (__riscv_xlen == 64)
+                c->inter.w_avg    = ff_vvc_w_avg_8_rvv_128;
+# endif
+                break;
+            default:
+                break;
+        }
+    }
+#endif
+}
diff --git a/libavcodec/vvc/dsp.c b/libavcodec/vvc/dsp.c
index 41e830a98a..c55a37d255 100644
--- a/libavcodec/vvc/dsp.c
+++ b/libavcodec/vvc/dsp.c
@@ -121,7 +121,9 @@ void ff_vvc_dsp_init(VVCDSPContext *vvcdsp, int bit_depth)
         break;
     }
 
-#if ARCH_X86
+#if ARCH_RISCV
+    ff_vvc_dsp_init_riscv(vvcdsp, bit_depth);
+#elif ARCH_X86
     ff_vvc_dsp_init_x86(vvcdsp, bit_depth);
 #endif
 }
diff --git a/libavcodec/vvc/dsp.h b/libavcodec/vvc/dsp.h
index 1f14096c41..e03236dd76 100644
--- a/libavcodec/vvc/dsp.h
+++ b/libavcodec/vvc/dsp.h
@@ -180,6 +180,7 @@ typedef struct VVCDSPContext {
 
 void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth);
 
+void ff_vvc_dsp_init_riscv(VVCDSPContext *hpc, const int bit_depth);
 void ff_vvc_dsp_init_x86(VVCDSPContext *hpc, const int bit_depth);
 
 #endif /* AVCODEC_VVC_DSP_H */
-- 
2.45.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2] lavc/vvc_mc: R-V V avg w_avg
  2024-06-01 18:01 [FFmpeg-devel] [PATCH v2] lavc/vvc_mc: R-V V avg w_avg uk7b
@ 2024-06-01 18:02 ` flow gg
  2024-06-01 19:54 ` Rémi Denis-Courmont
  1 sibling, 0 replies; 4+ messages in thread
From: flow gg @ 2024-06-01 18:02 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

> In keeping in line with the rest of the project, that should probably go
into
> **libavcodec/riscv/vvc/**
> Expanding the macro 49 times, with up to 14 **branches** to get there is
maybe not
> such a great idea. It might look nice on the checkasm µbenchmarks because
the
> branches under test get predicted and cached.
>
> But in real use, branch prediction will not work so well, and the I-cache
will be filled with all variants of the same function.
>
> Indeed, this seems to result in about .5 MiB of code.
>
> Even if only one half is needed (128-bit or 256+-bit variants). that's a
lot.
>
> For comparison, x86 uses just about 10 KiB, also with two variants.
>
> What I make out from the arcane forbidden CISC arts there:
>
> - functions are specialised only in one dimension, not both,
> - dispatch tables avoid multiplying branches.

Referring to x86, the code has been updated. The current code size is 6k,
and a jmp table has been added.

<uk7b@foxmail.com> 于2024年6月2日周日 02:01写道:

> From: sunyuechi <sunyuechi@iscas.ac.cn>
>
>                                                       C908   X60
> avg_8_2x2_c                                        :    1.0    1.0
> avg_8_2x2_rvv_i32                                  :    1.0    1.0
> avg_8_2x4_c                                        :    1.7    2.0
> avg_8_2x4_rvv_i32                                  :    1.2    1.2
> avg_8_2x8_c                                        :    3.7    4.0
> avg_8_2x8_rvv_i32                                  :    2.0    2.0
> avg_8_2x16_c                                       :    7.2    7.5
> avg_8_2x16_rvv_i32                                 :    3.2    3.0
> avg_8_2x32_c                                       :   14.2   15.0
> avg_8_2x32_rvv_i32                                 :    5.7    5.0
> avg_8_2x64_c                                       :   46.7   44.2
> avg_8_2x64_rvv_i32                                 :   39.2   36.0
> avg_8_2x128_c                                      :   99.7   80.0
> avg_8_2x128_rvv_i32                                :   86.2   65.5
> avg_8_4x2_c                                        :    2.0    2.0
> avg_8_4x2_rvv_i32                                  :    1.0    1.0
> avg_8_4x4_c                                        :    3.5    3.7
> avg_8_4x4_rvv_i32                                  :    1.5    1.2
> avg_8_4x8_c                                        :    6.5    7.0
> avg_8_4x8_rvv_i32                                  :    2.0    1.7
> avg_8_4x16_c                                       :   13.5   14.0
> avg_8_4x16_rvv_i32                                 :    3.2    2.7
> avg_8_4x32_c                                       :   26.2   27.5
> avg_8_4x32_rvv_i32                                 :    5.7    5.0
> avg_8_4x64_c                                       :   75.0   65.7
> avg_8_4x64_rvv_i32                                 :   44.0   32.0
> avg_8_4x128_c                                      :  165.2  118.5
> avg_8_4x128_rvv_i32                                :   81.5   71.0
> avg_8_8x2_c                                        :    3.2    3.5
> avg_8_8x2_rvv_i32                                  :    1.2    1.0
> avg_8_8x4_c                                        :    6.5    6.5
> avg_8_8x4_rvv_i32                                  :    1.5    1.5
> avg_8_8x8_c                                        :   12.5   13.2
> avg_8_8x8_rvv_i32                                  :    2.2    1.7
> avg_8_8x16_c                                       :   25.2   26.5
> avg_8_8x16_rvv_i32                                 :    3.7    2.7
> avg_8_8x32_c                                       :   50.0   52.5
> avg_8_8x32_rvv_i32                                 :    6.7    5.2
> avg_8_8x64_c                                       :  120.7  119.0
> avg_8_8x64_rvv_i32                                 :   43.2   33.5
> avg_8_8x128_c                                      :  247.5  217.7
> avg_8_8x128_rvv_i32                                :  100.5   74.7
> avg_8_16x2_c                                       :    6.2    6.5
> avg_8_16x2_rvv_i32                                 :    1.2    1.0
> avg_8_16x4_c                                       :   12.2   13.0
> avg_8_16x4_rvv_i32                                 :    2.0    1.2
> avg_8_16x8_c                                       :   24.5   25.7
> avg_8_16x8_rvv_i32                                 :    3.2    2.0
> avg_8_16x16_c                                      :   48.7   51.2
> avg_8_16x16_rvv_i32                                :    5.7    3.2
> avg_8_16x32_c                                      :   97.5  102.7
> avg_8_16x32_rvv_i32                                :   10.7    6.0
> avg_8_16x64_c                                      :  213.0  215.0
> avg_8_16x64_rvv_i32                                :   51.5   33.5
> avg_8_16x128_c                                     :  408.5  417.0
> avg_8_16x128_rvv_i32                               :  102.0   71.5
> avg_8_32x2_c                                       :   12.2   13.0
> avg_8_32x2_rvv_i32                                 :    2.0    1.2
> avg_8_32x4_c                                       :   24.5   25.5
> avg_8_32x4_rvv_i32                                 :    3.2    1.7
> avg_8_32x8_c                                       :   48.5   50.7
> avg_8_32x8_rvv_i32                                 :    5.7    3.0
> avg_8_32x16_c                                      :   96.5  101.5
> avg_8_32x16_rvv_i32                                :   10.5    5.0
> avg_8_32x32_c                                      :  210.2  202.5
> avg_8_32x32_rvv_i32                                :   20.2    9.7
> avg_8_32x64_c                                      :  431.7  417.2
> avg_8_32x64_rvv_i32                                :   68.0   46.0
> avg_8_32x128_c                                     :  822.2  819.0
> avg_8_32x128_rvv_i32                               :  152.2   69.0
> avg_8_64x2_c                                       :   24.0   25.2
> avg_8_64x2_rvv_i32                                 :    3.0    1.7
> avg_8_64x4_c                                       :   48.2   51.0
> avg_8_64x4_rvv_i32                                 :    5.5    2.7
> avg_8_64x8_c                                       :   96.7  101.5
> avg_8_64x8_rvv_i32                                 :   10.0    5.0
> avg_8_64x16_c                                      :  193.5  203.0
> avg_8_64x16_rvv_i32                                :   19.2    9.2
> avg_8_64x32_c                                      :  404.2  405.7
> avg_8_64x32_rvv_i32                                :   37.7   18.0
> avg_8_64x64_c                                      :  846.2  841.7
> avg_8_64x64_rvv_i32                                :  136.5   35.2
> avg_8_64x128_c                                     : 1659.5 1662.2
> avg_8_64x128_rvv_i32                               :  236.7  177.7
> avg_8_128x2_c                                      :   48.7   51.0
> avg_8_128x2_rvv_i32                                :    5.2    2.7
> avg_8_128x4_c                                      :   96.7  101.0
> avg_8_128x4_rvv_i32                                :    9.7    4.7
> avg_8_128x8_c                                      :  226.0  201.5
> avg_8_128x8_rvv_i32                                :   18.7    8.7
> avg_8_128x16_c                                     :  402.7  402.7
> avg_8_128x16_rvv_i32                               :   37.0   17.2
> avg_8_128x32_c                                     :  791.2  805.0
> avg_8_128x32_rvv_i32                               :   73.5   33.5
> avg_8_128x64_c                                     : 1616.7 1645.2
> avg_8_128x64_rvv_i32                               :  223.7   68.2
> avg_8_128x128_c                                    : 3202.0 3235.2
> avg_8_128x128_rvv_i32                              :  390.0  314.2
> w_avg_8_2x2_c                                      :    1.7    1.5
> w_avg_8_2x2_rvv_i32                                :    1.7    1.5
> w_avg_8_2x4_c                                      :    2.7    2.5
> w_avg_8_2x4_rvv_i32                                :    2.7    2.5
> w_avg_8_2x8_c                                      :    5.0    5.0
> w_avg_8_2x8_rvv_i32                                :    4.5    4.0
> w_avg_8_2x16_c                                     :   26.5    9.5
> w_avg_8_2x16_rvv_i32                               :    8.0    7.0
> w_avg_8_2x32_c                                     :   18.7   18.5
> w_avg_8_2x32_rvv_i32                               :   15.0   13.2
> w_avg_8_2x64_c                                     :   58.5   46.5
> w_avg_8_2x64_rvv_i32                               :   49.7   38.7
> w_avg_8_2x128_c                                    :  121.7   85.2
> w_avg_8_2x128_rvv_i32                              :   89.7   81.0
> w_avg_8_4x2_c                                      :    2.5    2.5
> w_avg_8_4x2_rvv_i32                                :    1.7    1.5
> w_avg_8_4x4_c                                      :    4.7    4.7
> w_avg_8_4x4_rvv_i32                                :    2.7    2.2
> w_avg_8_4x8_c                                      :    9.0    9.0
> w_avg_8_4x8_rvv_i32                                :    4.5    4.0
> w_avg_8_4x16_c                                     :   17.7   17.7
> w_avg_8_4x16_rvv_i32                               :    8.0    7.0
> w_avg_8_4x32_c                                     :   35.0   35.0
> w_avg_8_4x32_rvv_i32                               :   15.0   13.5
> w_avg_8_4x64_c                                     :   95.2   80.2
> w_avg_8_4x64_rvv_i32                               :   47.7   38.0
> w_avg_8_4x128_c                                    :  197.7  164.7
> w_avg_8_4x128_rvv_i32                              :  101.7   81.5
> w_avg_8_8x2_c                                      :    4.5    4.5
> w_avg_8_8x2_rvv_i32                                :    2.0    1.7
> w_avg_8_8x4_c                                      :    8.7    8.7
> w_avg_8_8x4_rvv_i32                                :    2.7    2.5
> w_avg_8_8x8_c                                      :   33.5   17.0
> w_avg_8_8x8_rvv_i32                                :    4.7    4.0
> w_avg_8_8x16_c                                     :   34.0   34.0
> w_avg_8_8x16_rvv_i32                               :    8.5    7.2
> w_avg_8_8x32_c                                     :   85.5   67.7
> w_avg_8_8x32_rvv_i32                               :   16.2   13.5
> w_avg_8_8x64_c                                     :  162.5  148.2
> w_avg_8_8x64_rvv_i32                               :   50.0   36.5
> w_avg_8_8x128_c                                    :  380.2  301.5
> w_avg_8_8x128_rvv_i32                              :   87.2   79.5
> w_avg_8_16x2_c                                     :    8.5    8.7
> w_avg_8_16x2_rvv_i32                               :    2.2    1.7
> w_avg_8_16x4_c                                     :   16.7   17.0
> w_avg_8_16x4_rvv_i32                               :    3.7    2.5
> w_avg_8_16x8_c                                     :   33.2   33.7
> w_avg_8_16x8_rvv_i32                               :    6.5    4.2
> w_avg_8_16x16_c                                    :   66.2   66.5
> w_avg_8_16x16_rvv_i32                              :   12.0    7.5
> w_avg_8_16x32_c                                    :  133.2  134.0
> w_avg_8_16x32_rvv_i32                              :   23.0   14.2
> w_avg_8_16x64_c                                    :  296.0  276.7
> w_avg_8_16x64_rvv_i32                              :   66.7   38.2
> w_avg_8_16x128_c                                   :  625.2  539.7
> w_avg_8_16x128_rvv_i32                             :  135.5   79.2
> w_avg_8_32x2_c                                     :   16.7   16.7
> w_avg_8_32x2_rvv_i32                               :    3.5    2.0
> w_avg_8_32x4_c                                     :   33.2   33.2
> w_avg_8_32x4_rvv_i32                               :    6.0    3.5
> w_avg_8_32x8_c                                     :   65.7   66.2
> w_avg_8_32x8_rvv_i32                               :   11.2    5.7
> w_avg_8_32x16_c                                    :  132.0  132.0
> w_avg_8_32x16_rvv_i32                              :   21.5   10.7
> w_avg_8_32x32_c                                    :  261.7  272.2
> w_avg_8_32x32_rvv_i32                              :   42.2   20.5
> w_avg_8_32x64_c                                    :  528.2  562.7
> w_avg_8_32x64_rvv_i32                              :   83.5   59.2
> w_avg_8_32x128_c                                   : 1135.5 1070.0
> w_avg_8_32x128_rvv_i32                             :  208.7   96.5
> w_avg_8_64x2_c                                     :   33.0   33.0
> w_avg_8_64x2_rvv_i32                               :    6.0    3.0
> w_avg_8_64x4_c                                     :   65.5   67.0
> w_avg_8_64x4_rvv_i32                               :   11.0    5.2
> w_avg_8_64x8_c                                     :  150.0  134.7
> w_avg_8_64x8_rvv_i32                               :   21.5   10.0
> w_avg_8_64x16_c                                    :  265.2  273.7
> w_avg_8_64x16_rvv_i32                              :   42.2   19.0
> w_avg_8_64x32_c                                    :  629.7  541.7
> w_avg_8_64x32_rvv_i32                              :   83.7   37.7
> w_avg_8_64x64_c                                    : 1259.0 1237.7
> w_avg_8_64x64_rvv_i32                              :  190.7   76.0
> w_avg_8_64x128_c                                   : 2967.0 2209.5
> w_avg_8_64x128_rvv_i32                             :  437.0  190.5
> w_avg_8_128x2_c                                    :   65.7   66.0
> w_avg_8_128x2_rvv_i32                              :   11.2    5.5
> w_avg_8_128x4_c                                    :  131.7  134.7
> w_avg_8_128x4_rvv_i32                              :   21.5   10.0
> w_avg_8_128x8_c                                    :  270.2  264.2
> w_avg_8_128x8_rvv_i32                              :   42.2   19.2
> w_avg_8_128x16_c                                   :  580.0  554.2
> w_avg_8_128x16_rvv_i32                             :   83.5   37.5
> w_avg_8_128x32_c                                   : 1141.0 1206.2
> w_avg_8_128x32_rvv_i32                             :  166.2   74.2
> w_avg_8_128x64_c                                   : 2295.5 2403.5
> w_avg_8_128x64_rvv_i32                             :  408.2  159.0
> w_avg_8_128x128_c                                  : 5367.5 4915.2
> w_avg_8_128x128_rvv_i32                            :  741.2  331.5
>
> test
>
> u makefile
> ---
>  libavcodec/riscv/vvc/Makefile      |   2 +
>  libavcodec/riscv/vvc/vvc_mc_rvv.S  | 295 +++++++++++++++++++++++++++++
>  libavcodec/riscv/vvc/vvcdsp_init.c |  71 +++++++
>  libavcodec/vvc/dsp.c               |   4 +-
>  libavcodec/vvc/dsp.h               |   1 +
>  5 files changed, 372 insertions(+), 1 deletion(-)
>  create mode 100644 libavcodec/riscv/vvc/Makefile
>  create mode 100644 libavcodec/riscv/vvc/vvc_mc_rvv.S
>  create mode 100644 libavcodec/riscv/vvc/vvcdsp_init.c
>
> diff --git a/libavcodec/riscv/vvc/Makefile b/libavcodec/riscv/vvc/Makefile
> new file mode 100644
> index 0000000000..582b051579
> --- /dev/null
> +++ b/libavcodec/riscv/vvc/Makefile
> @@ -0,0 +1,2 @@
> +OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvcdsp_init.o
> +RVV-OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvc_mc_rvv.o
> diff --git a/libavcodec/riscv/vvc/vvc_mc_rvv.S
> b/libavcodec/riscv/vvc/vvc_mc_rvv.S
> new file mode 100644
> index 0000000000..f2128fa776
> --- /dev/null
> +++ b/libavcodec/riscv/vvc/vvc_mc_rvv.S
> @@ -0,0 +1,295 @@
> +/*
> + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences
> (ISCAS).
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> 02110-1301 USA
> + */
> +
> +#include "libavutil/riscv/asm.S"
> +
> +.macro vsetvlstatic8 w, vlen, is_w
> +        .if \w <= 2
> +                vsetivli        zero, \w, e8, mf8, ta, ma
> +        .elseif \w <= 4 && \vlen == 128
> +                vsetivli        zero, \w, e8, mf4, ta, ma
> +        .elseif \w <= 4 && \vlen >= 256
> +                vsetivli        zero, \w, e8, mf8, ta, ma
> +        .elseif \w <= 8 && \vlen == 128
> +                vsetivli        zero, \w, e8, mf2, ta, ma
> +        .elseif \w <= 8 && \vlen >= 256
> +                vsetivli        zero, \w, e8, mf4, ta, ma
> +        .elseif \w <= 16 && \vlen == 128
> +                vsetivli        zero, \w, e8, m1, ta, ma
> +        .elseif \w <= 16 && \vlen >= 256
> +                vsetivli        zero, \w, e8, mf2, ta, ma
> +        .elseif \w <= 32 && \vlen >= 256
> +                li t0, \w
> +                vsetvli         zero, t0, e8, m1, ta, ma
> +        .elseif \w <= (\vlen / 4) || \is_w
> +                li t0, 64
> +                vsetvli         zero, t0, e8, m2, ta, ma
> +        .else
> +                li t0, \w
> +                vsetvli         zero, t0, e8, m4, ta, ma
> +        .endif
> +.endm
> +
> +.macro vsetvlstatic16 w, vlen, is_w
> +        .if \w <= 2
> +                vsetivli        zero, \w, e16, mf4, ta, ma
> +        .elseif \w <= 4 && \vlen == 128
> +                vsetivli        zero, \w, e16, mf2, ta, ma
> +        .elseif \w <= 4 && \vlen >= 256
> +                vsetivli        zero, \w, e16, mf4, ta, ma
> +        .elseif \w <= 8 && \vlen == 128
> +                vsetivli        zero, \w, e16, m1, ta, ma
> +        .elseif \w <= 8 && \vlen >= 256
> +                vsetivli        zero, \w, e16, mf2, ta, ma
> +        .elseif \w <= 16 && \vlen == 128
> +                vsetivli        zero, \w, e16, m2, ta, ma
> +        .elseif \w <= 16 && \vlen >= 256
> +                vsetivli        zero, \w, e16, m1, ta, ma
> +        .elseif \w <= 32 && \vlen >= 256
> +                li t0, \w
> +                vsetvli         zero, t0, e16, m2, ta, ma
> +        .elseif \w <= (\vlen / 4) || \is_w
> +                li t0, 64
> +                vsetvli         zero, t0, e16, m4, ta, ma
> +        .else
> +                li t0, \w
> +                vsetvli         zero, t0, e16, m8, ta, ma
> +        .endif
> +.endm
> +
> +.macro vsetvlstatic32 w, vlen
> +        .if \w <= 2
> +                vsetivli        zero, \w, e32, mf2, ta, ma
> +        .elseif \w <= 4 && \vlen == 128
> +                vsetivli        zero, \w, e32, m1, ta, ma
> +        .elseif \w <= 4 && \vlen >= 256
> +                vsetivli        zero, \w, e32, mf2, ta, ma
> +        .elseif \w <= 8 && \vlen == 128
> +                vsetivli        zero, \w, e32, m2, ta, ma
> +        .elseif \w <= 8 && \vlen >= 256
> +                vsetivli        zero, \w, e32, m1, ta, ma
> +        .elseif \w <= 16 && \vlen == 128
> +                vsetivli        zero, \w, e32, m4, ta, ma
> +        .elseif \w <= 16 && \vlen >= 256
> +                vsetivli        zero, \w, e32, m2, ta, ma
> +        .elseif \w <= 32 && \vlen >= 256
> +                li t0, \w
> +                vsetvli         zero, t0, e32, m4, ta, ma
> +        .else
> +                li t0, \w
> +                vsetvli         zero, t0, e32, m8, ta, ma
> +        .endif
> +.endm
> +
> +.macro avg_nx1 w, vlen
> +        vsetvlstatic16    \w, \vlen, 0
> +        vle16.v           v0, (a2)
> +        vle16.v           v8, (a3)
> +        vadd.vv           v8, v8, v0
> +        vmax.vx           v8, v8, zero
> +        vsetvlstatic8     \w, \vlen, 0
> +        vnclipu.wi        v8, v8, 7
> +        vse8.v            v8, (a0)
> +.endm
> +
> +.macro avg w, vlen, id
> +\id\w\vlen:
> +.if \w < 128
> +        vsetvlstatic16    \w, \vlen, 0
> +        addi              t0, a2, 128*2
> +        addi              t1, a3, 128*2
> +        add               t2, a0, a1
> +        vle16.v           v0, (a2)
> +        vle16.v           v8, (a3)
> +        addi              a5, a5, -2
> +        vle16.v           v16, (t0)
> +        vle16.v           v24, (t1)
> +        vadd.vv           v8, v8, v0
> +        vadd.vv           v24, v24, v16
> +        vmax.vx           v8, v8, zero
> +        vmax.vx           v24, v24, zero
> +        vsetvlstatic8     \w, \vlen, 0
> +        addi              a2, a2, 128*4
> +        vnclipu.wi        v8, v8, 7
> +        vnclipu.wi        v24, v24, 7
> +        addi              a3, a3, 128*4
> +        vse8.v            v8, (a0)
> +        vse8.v            v24, (t2)
> +        sh1add            a0, a1, a0
> +.else
> +        avg_nx1           128, \vlen
> +        addi              a5, a5, -1
> +        .if \vlen == 128
> +        addi              a2, a2, 64*2
> +        addi              a3, a3, 64*2
> +        addi              a0, a0, 64
> +        avg_nx1           128, \vlen
> +        addi              a0, a0, -64
> +        addi              a2, a2, 128
> +        addi              a3, a3, 128
> +        .else
> +        addi              a2, a2, 128*2
> +        addi              a3, a3, 128*2
> +        .endif
> +        add               a0, a0, a1
> +.endif
> +        bnez              a5, \id\w\vlen\()b
> +        ret
> +.endm
> +
> +
> +.macro AVG_JMP_TABLE id, vlen
> +const jmp_table_\id\vlen
> +    .8byte \id\()2\vlen\()f
> +    .8byte \id\()4\vlen\()f
> +    .8byte \id\()8\vlen\()f
> +    .8byte \id\()16\vlen\()f
> +    .8byte \id\()32\vlen\()f
> +    .8byte \id\()64\vlen\()f
> +    .8byte \id\()128\vlen\()f
> +endconst
> +.endm
> +
> +.macro AVG_J vlen, id
> +        clz               a4, a4
> +        li                t0, __riscv_xlen-2
> +        sub               a4, t0, a4
> +        lla               t0, jmp_table_\id\vlen
> +        sh3add            t0, a4, t0
> +        ld                t0, 0(t0)
> +        jr                t0
> +.endm
> +
> +.macro func_avg vlen
> +func ff_vvc_avg_8_rvv_\vlen\(), zve32x
> +        AVG_JMP_TABLE     1, \vlen
> +        csrw              vxrm, zero
> +        AVG_J             \vlen, 1
> +        .irp w,2,4,8,16,32,64,128
> +        avg               \w, \vlen, 1
> +        .endr
> +endfunc
> +.endm
> +
> +.macro w_avg_nx1 w, vlen
> +        vsetvlstatic16    \w, \vlen, 1
> +        vle16.v           v0, (a2)
> +        vle16.v           v8, (a3)
> +        vwmul.vx          v16, v0, a7
> +        vwmacc.vx         v16, t3, v8
> +        vsetvlstatic32    \w, \vlen
> +        vadd.vx           v16, v16, t4
> +        vsetvlstatic16    \w, \vlen, 1
> +        vnsrl.wx          v16, v16, t6
> +        vmax.vx           v16, v16, zero
> +        vsetvlstatic8     \w, \vlen, 1
> +        vnclipu.wi        v16, v16, 0
> +        vse8.v            v16, (a0)
> +.endm
> +
> +.macro w_avg w, vlen, id
> +\id\w\vlen:
> +.if \vlen <= 16
> +        vsetvlstatic16    \w, \vlen, 1
> +        addi              t0, a2, 128*2
> +        addi              t1, a3, 128*2
> +        vle16.v           v0, (a2)
> +        vle16.v           v8, (a3)
> +        addi              a5, a5, -2
> +        vle16.v           v20, (t0)
> +        vle16.v           v24, (t1)
> +        vwmul.vx          v16, v0, a7
> +        vwmul.vx          v28, v20, a7
> +        vwmacc.vx         v16, t3, v8
> +        vwmacc.vx         v28, t3, v24
> +        vsetvlstatic32    \w, \vlen
> +        add               t2, a0, a1
> +        vadd.vx           v16, v16, t4
> +        vadd.vx           v28, v28, t4
> +        vsetvlstatic16    \w, \vlen, 1
> +        vnsrl.wx          v16, v16, t6
> +        vnsrl.wx          v28, v28, t6
> +        vmax.vx           v16, v16, zero
> +        vmax.vx           v28, v28, zero
> +        vsetvlstatic8     \w, \vlen, 1
> +        addi              a2, a2, 128*4
> +        vnclipu.wi        v16, v16, 0
> +        vnclipu.wi        v28, v28, 0
> +        vse8.v            v16, (a0)
> +        addi              a3, a3, 128*4
> +        vse8.v            v28, (t2)
> +        sh1add            a0, a1, a0
> +.else
> +        w_avg_nx1         \w, \vlen
> +        addi              a5, a5, -1
> +        .if \w == (\vlen / 2)
> +        addi              a2, a2, (\vlen / 2)
> +        addi              a3, a3, (\vlen / 2)
> +        addi              a0, a0, (\vlen / 4)
> +        w_avg_nx1         \w, \vlen
> +        addi              a2, a2, -(\vlen / 2)
> +        addi              a3, a3, -(\vlen / 2)
> +        addi              a0, a0, -(\vlen / 4)
> +        .elseif \w == 128 && \vlen == 128
> +        .rept 3
> +        addi              a2, a2, (\vlen / 2)
> +        addi              a3, a3, (\vlen / 2)
> +        addi              a0, a0, (\vlen / 4)
> +        w_avg_nx1         \w, \vlen
> +        .endr
> +        addi              a2, a2, -(\vlen / 2) * 3
> +        addi              a3, a3, -(\vlen / 2) * 3
> +        addi              a0, a0, -(\vlen / 4) * 3
> +        .endif
> +
> +        addi              a2, a2, 128*2
> +        addi              a3, a3, 128*2
> +        add               a0, a0, a1
> +.endif
> +        bnez              a5, \id\w\vlen\()b
> +        ret
> +.endm
> +
> +
> +.macro func_w_avg vlen
> +func ff_vvc_w_avg_8_rvv_\vlen\(), zve32x
> +        AVG_JMP_TABLE     2, \vlen
> +        csrw              vxrm, zero
> +        addi              t6, a6, 7
> +        ld                t3, (sp)
> +        ld                t4, 8(sp)
> +        ld                t5, 16(sp)
> +        add               t4, t4, t5
> +        addi              t4, t4, 1       // o0 + o1 + 1
> +        addi              t5, t6, -1      // shift - 1
> +        sll               t4, t4, t5
> +        AVG_J             \vlen, 2
> +        .irp w,2,4,8,16,32,64,128
> +        w_avg             \w, \vlen, 2
> +        .endr
> +endfunc
> +.endm
> +
> +func_avg 128
> +func_avg 256
> +#if (__riscv_xlen == 64)
> +func_w_avg 128
> +func_w_avg 256
> +#endif
> diff --git a/libavcodec/riscv/vvc/vvcdsp_init.c
> b/libavcodec/riscv/vvc/vvcdsp_init.c
> new file mode 100644
> index 0000000000..85b1ede061
> --- /dev/null
> +++ b/libavcodec/riscv/vvc/vvcdsp_init.c
> @@ -0,0 +1,71 @@
> +/*
> + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences
> (ISCAS).
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> 02110-1301 USA
> + */
> +
> +#include "config.h"
> +
> +#include "libavutil/attributes.h"
> +#include "libavutil/cpu.h"
> +#include "libavutil/riscv/cpu.h"
> +#include "libavcodec/vvc/dsp.h"
> +
> +#define bf(fn, bd,  opt) fn##_##bd##_##opt
> +
> +#define AVG_PROTOTYPES(bd, opt)
>                             \
> +void bf(ff_vvc_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride,
>                            \
> +    const int16_t *src0, const int16_t *src1, int width, int height);
>                             \
> +void bf(ff_vvc_w_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride,
>                            \
> +    const int16_t *src0, const int16_t *src1, int width, int height,
>                            \
> +    int denom, int w0, int w1, int o0, int o1);
> +
> +AVG_PROTOTYPES(8, rvv_128)
> +AVG_PROTOTYPES(8, rvv_256)
> +
> +void ff_vvc_dsp_init_riscv(VVCDSPContext *const c, const int bd)
> +{
> +#if HAVE_RVV
> +    const int flags = av_get_cpu_flags();
> +
> +    if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR) &&
> +        ff_rv_vlen_least(256)) {
> +        switch (bd) {
> +            case 8:
> +                c->inter.avg    = ff_vvc_avg_8_rvv_256;
> +# if (__riscv_xlen == 64)
> +                c->inter.w_avg    = ff_vvc_w_avg_8_rvv_256;
> +# endif
> +                break;
> +            default:
> +                break;
> +        }
> +    } else if ((flags & AV_CPU_FLAG_RVV_I32) && (flags &
> AV_CPU_FLAG_RVB_ADDR) &&
> +               ff_rv_vlen_least(128)) {
> +        switch (bd) {
> +            case 8:
> +                c->inter.avg    = ff_vvc_avg_8_rvv_128;
> +# if (__riscv_xlen == 64)
> +                c->inter.w_avg    = ff_vvc_w_avg_8_rvv_128;
> +# endif
> +                break;
> +            default:
> +                break;
> +        }
> +    }
> +#endif
> +}
> diff --git a/libavcodec/vvc/dsp.c b/libavcodec/vvc/dsp.c
> index 41e830a98a..c55a37d255 100644
> --- a/libavcodec/vvc/dsp.c
> +++ b/libavcodec/vvc/dsp.c
> @@ -121,7 +121,9 @@ void ff_vvc_dsp_init(VVCDSPContext *vvcdsp, int
> bit_depth)
>          break;
>      }
>
> -#if ARCH_X86
> +#if ARCH_RISCV
> +    ff_vvc_dsp_init_riscv(vvcdsp, bit_depth);
> +#elif ARCH_X86
>      ff_vvc_dsp_init_x86(vvcdsp, bit_depth);
>  #endif
>  }
> diff --git a/libavcodec/vvc/dsp.h b/libavcodec/vvc/dsp.h
> index 1f14096c41..e03236dd76 100644
> --- a/libavcodec/vvc/dsp.h
> +++ b/libavcodec/vvc/dsp.h
> @@ -180,6 +180,7 @@ typedef struct VVCDSPContext {
>
>  void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth);
>
> +void ff_vvc_dsp_init_riscv(VVCDSPContext *hpc, const int bit_depth);
>  void ff_vvc_dsp_init_x86(VVCDSPContext *hpc, const int bit_depth);
>
>  #endif /* AVCODEC_VVC_DSP_H */
> --
> 2.45.1
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2] lavc/vvc_mc: R-V V avg w_avg
  2024-06-01 18:01 [FFmpeg-devel] [PATCH v2] lavc/vvc_mc: R-V V avg w_avg uk7b
  2024-06-01 18:02 ` flow gg
@ 2024-06-01 19:54 ` Rémi Denis-Courmont
  2024-06-01 20:11   ` flow gg
  1 sibling, 1 reply; 4+ messages in thread
From: Rémi Denis-Courmont @ 2024-06-01 19:54 UTC (permalink / raw)
  To: ffmpeg-devel

Le lauantaina 1. kesäkuuta 2024, 21.01.16 EEST uk7b@foxmail.com a écrit :
> From: sunyuechi <sunyuechi@iscas.ac.cn>
> 
>                                                       C908   X60
> avg_8_2x2_c                                        :    1.0    1.0
> avg_8_2x2_rvv_i32                                  :    1.0    1.0

I think we can drop the 2x2 transforms. In all likelihood, scalar code will 
end up faster than vector code on future hardware, especially out-of-order 
pipelines.

> avg_8_2x4_c                                        :    1.7    2.0
> avg_8_2x4_rvv_i32                                  :    1.2    1.2
> avg_8_2x8_c                                        :    3.7    4.0
> avg_8_2x8_rvv_i32                                  :    2.0    2.0
> avg_8_2x16_c                                       :    7.2    7.5
> avg_8_2x16_rvv_i32                                 :    3.2    3.0
> avg_8_2x32_c                                       :   14.2   15.0
> avg_8_2x32_rvv_i32                                 :    5.7    5.0
> avg_8_2x64_c                                       :   46.7   44.2
> avg_8_2x64_rvv_i32                                 :   39.2   36.0
> avg_8_2x128_c                                      :   99.7   80.0
> avg_8_2x128_rvv_i32                                :   86.2   65.5
> avg_8_4x2_c                                        :    2.0    2.0
> avg_8_4x2_rvv_i32                                  :    1.0    1.0
> avg_8_4x4_c                                        :    3.5    3.7
> avg_8_4x4_rvv_i32                                  :    1.5    1.2
> avg_8_4x8_c                                        :    6.5    7.0
> avg_8_4x8_rvv_i32                                  :    2.0    1.7
> avg_8_4x16_c                                       :   13.5   14.0
> avg_8_4x16_rvv_i32                                 :    3.2    2.7
> avg_8_4x32_c                                       :   26.2   27.5
> avg_8_4x32_rvv_i32                                 :    5.7    5.0
> avg_8_4x64_c                                       :   75.0   65.7
> avg_8_4x64_rvv_i32                                 :   44.0   32.0
> avg_8_4x128_c                                      :  165.2  118.5
> avg_8_4x128_rvv_i32                                :   81.5   71.0
> avg_8_8x2_c                                        :    3.2    3.5
> avg_8_8x2_rvv_i32                                  :    1.2    1.0
> avg_8_8x4_c                                        :    6.5    6.5
> avg_8_8x4_rvv_i32                                  :    1.5    1.5
> avg_8_8x8_c                                        :   12.5   13.2
> avg_8_8x8_rvv_i32                                  :    2.2    1.7
> avg_8_8x16_c                                       :   25.2   26.5
> avg_8_8x16_rvv_i32                                 :    3.7    2.7
> avg_8_8x32_c                                       :   50.0   52.5
> avg_8_8x32_rvv_i32                                 :    6.7    5.2
> avg_8_8x64_c                                       :  120.7  119.0
> avg_8_8x64_rvv_i32                                 :   43.2   33.5
> avg_8_8x128_c                                      :  247.5  217.7
> avg_8_8x128_rvv_i32                                :  100.5   74.7
> avg_8_16x2_c                                       :    6.2    6.5
> avg_8_16x2_rvv_i32                                 :    1.2    1.0
> avg_8_16x4_c                                       :   12.2   13.0
> avg_8_16x4_rvv_i32                                 :    2.0    1.2
> avg_8_16x8_c                                       :   24.5   25.7
> avg_8_16x8_rvv_i32                                 :    3.2    2.0
> avg_8_16x16_c                                      :   48.7   51.2
> avg_8_16x16_rvv_i32                                :    5.7    3.2
> avg_8_16x32_c                                      :   97.5  102.7
> avg_8_16x32_rvv_i32                                :   10.7    6.0
> avg_8_16x64_c                                      :  213.0  215.0
> avg_8_16x64_rvv_i32                                :   51.5   33.5
> avg_8_16x128_c                                     :  408.5  417.0
> avg_8_16x128_rvv_i32                               :  102.0   71.5
> avg_8_32x2_c                                       :   12.2   13.0
> avg_8_32x2_rvv_i32                                 :    2.0    1.2
> avg_8_32x4_c                                       :   24.5   25.5
> avg_8_32x4_rvv_i32                                 :    3.2    1.7
> avg_8_32x8_c                                       :   48.5   50.7
> avg_8_32x8_rvv_i32                                 :    5.7    3.0
> avg_8_32x16_c                                      :   96.5  101.5
> avg_8_32x16_rvv_i32                                :   10.5    5.0
> avg_8_32x32_c                                      :  210.2  202.5
> avg_8_32x32_rvv_i32                                :   20.2    9.7
> avg_8_32x64_c                                      :  431.7  417.2
> avg_8_32x64_rvv_i32                                :   68.0   46.0
> avg_8_32x128_c                                     :  822.2  819.0
> avg_8_32x128_rvv_i32                               :  152.2   69.0
> avg_8_64x2_c                                       :   24.0   25.2
> avg_8_64x2_rvv_i32                                 :    3.0    1.7
> avg_8_64x4_c                                       :   48.2   51.0
> avg_8_64x4_rvv_i32                                 :    5.5    2.7
> avg_8_64x8_c                                       :   96.7  101.5
> avg_8_64x8_rvv_i32                                 :   10.0    5.0
> avg_8_64x16_c                                      :  193.5  203.0
> avg_8_64x16_rvv_i32                                :   19.2    9.2
> avg_8_64x32_c                                      :  404.2  405.7
> avg_8_64x32_rvv_i32                                :   37.7   18.0
> avg_8_64x64_c                                      :  846.2  841.7
> avg_8_64x64_rvv_i32                                :  136.5   35.2
> avg_8_64x128_c                                     : 1659.5 1662.2
> avg_8_64x128_rvv_i32                               :  236.7  177.7
> avg_8_128x2_c                                      :   48.7   51.0
> avg_8_128x2_rvv_i32                                :    5.2    2.7
> avg_8_128x4_c                                      :   96.7  101.0
> avg_8_128x4_rvv_i32                                :    9.7    4.7
> avg_8_128x8_c                                      :  226.0  201.5
> avg_8_128x8_rvv_i32                                :   18.7    8.7
> avg_8_128x16_c                                     :  402.7  402.7
> avg_8_128x16_rvv_i32                               :   37.0   17.2
> avg_8_128x32_c                                     :  791.2  805.0
> avg_8_128x32_rvv_i32                               :   73.5   33.5
> avg_8_128x64_c                                     : 1616.7 1645.2
> avg_8_128x64_rvv_i32                               :  223.7   68.2
> avg_8_128x128_c                                    : 3202.0 3235.2
> avg_8_128x128_rvv_i32                              :  390.0  314.2
> w_avg_8_2x2_c                                      :    1.7    1.5
> w_avg_8_2x2_rvv_i32                                :    1.7    1.5
> w_avg_8_2x4_c                                      :    2.7    2.5
> w_avg_8_2x4_rvv_i32                                :    2.7    2.5
> w_avg_8_2x8_c                                      :    5.0    5.0
> w_avg_8_2x8_rvv_i32                                :    4.5    4.0
> w_avg_8_2x16_c                                     :   26.5    9.5
> w_avg_8_2x16_rvv_i32                               :    8.0    7.0
> w_avg_8_2x32_c                                     :   18.7   18.5
> w_avg_8_2x32_rvv_i32                               :   15.0   13.2
> w_avg_8_2x64_c                                     :   58.5   46.5
> w_avg_8_2x64_rvv_i32                               :   49.7   38.7
> w_avg_8_2x128_c                                    :  121.7   85.2
> w_avg_8_2x128_rvv_i32                              :   89.7   81.0
> w_avg_8_4x2_c                                      :    2.5    2.5
> w_avg_8_4x2_rvv_i32                                :    1.7    1.5
> w_avg_8_4x4_c                                      :    4.7    4.7
> w_avg_8_4x4_rvv_i32                                :    2.7    2.2
> w_avg_8_4x8_c                                      :    9.0    9.0
> w_avg_8_4x8_rvv_i32                                :    4.5    4.0
> w_avg_8_4x16_c                                     :   17.7   17.7
> w_avg_8_4x16_rvv_i32                               :    8.0    7.0
> w_avg_8_4x32_c                                     :   35.0   35.0
> w_avg_8_4x32_rvv_i32                               :   15.0   13.5
> w_avg_8_4x64_c                                     :   95.2   80.2
> w_avg_8_4x64_rvv_i32                               :   47.7   38.0
> w_avg_8_4x128_c                                    :  197.7  164.7
> w_avg_8_4x128_rvv_i32                              :  101.7   81.5
> w_avg_8_8x2_c                                      :    4.5    4.5
> w_avg_8_8x2_rvv_i32                                :    2.0    1.7
> w_avg_8_8x4_c                                      :    8.7    8.7
> w_avg_8_8x4_rvv_i32                                :    2.7    2.5
> w_avg_8_8x8_c                                      :   33.5   17.0
> w_avg_8_8x8_rvv_i32                                :    4.7    4.0
> w_avg_8_8x16_c                                     :   34.0   34.0
> w_avg_8_8x16_rvv_i32                               :    8.5    7.2
> w_avg_8_8x32_c                                     :   85.5   67.7
> w_avg_8_8x32_rvv_i32                               :   16.2   13.5
> w_avg_8_8x64_c                                     :  162.5  148.2
> w_avg_8_8x64_rvv_i32                               :   50.0   36.5
> w_avg_8_8x128_c                                    :  380.2  301.5
> w_avg_8_8x128_rvv_i32                              :   87.2   79.5
> w_avg_8_16x2_c                                     :    8.5    8.7
> w_avg_8_16x2_rvv_i32                               :    2.2    1.7
> w_avg_8_16x4_c                                     :   16.7   17.0
> w_avg_8_16x4_rvv_i32                               :    3.7    2.5
> w_avg_8_16x8_c                                     :   33.2   33.7
> w_avg_8_16x8_rvv_i32                               :    6.5    4.2
> w_avg_8_16x16_c                                    :   66.2   66.5
> w_avg_8_16x16_rvv_i32                              :   12.0    7.5
> w_avg_8_16x32_c                                    :  133.2  134.0
> w_avg_8_16x32_rvv_i32                              :   23.0   14.2
> w_avg_8_16x64_c                                    :  296.0  276.7
> w_avg_8_16x64_rvv_i32                              :   66.7   38.2
> w_avg_8_16x128_c                                   :  625.2  539.7
> w_avg_8_16x128_rvv_i32                             :  135.5   79.2
> w_avg_8_32x2_c                                     :   16.7   16.7
> w_avg_8_32x2_rvv_i32                               :    3.5    2.0
> w_avg_8_32x4_c                                     :   33.2   33.2
> w_avg_8_32x4_rvv_i32                               :    6.0    3.5
> w_avg_8_32x8_c                                     :   65.7   66.2
> w_avg_8_32x8_rvv_i32                               :   11.2    5.7
> w_avg_8_32x16_c                                    :  132.0  132.0
> w_avg_8_32x16_rvv_i32                              :   21.5   10.7
> w_avg_8_32x32_c                                    :  261.7  272.2
> w_avg_8_32x32_rvv_i32                              :   42.2   20.5
> w_avg_8_32x64_c                                    :  528.2  562.7
> w_avg_8_32x64_rvv_i32                              :   83.5   59.2
> w_avg_8_32x128_c                                   : 1135.5 1070.0
> w_avg_8_32x128_rvv_i32                             :  208.7   96.5
> w_avg_8_64x2_c                                     :   33.0   33.0
> w_avg_8_64x2_rvv_i32                               :    6.0    3.0
> w_avg_8_64x4_c                                     :   65.5   67.0
> w_avg_8_64x4_rvv_i32                               :   11.0    5.2
> w_avg_8_64x8_c                                     :  150.0  134.7
> w_avg_8_64x8_rvv_i32                               :   21.5   10.0
> w_avg_8_64x16_c                                    :  265.2  273.7
> w_avg_8_64x16_rvv_i32                              :   42.2   19.0
> w_avg_8_64x32_c                                    :  629.7  541.7
> w_avg_8_64x32_rvv_i32                              :   83.7   37.7
> w_avg_8_64x64_c                                    : 1259.0 1237.7
> w_avg_8_64x64_rvv_i32                              :  190.7   76.0
> w_avg_8_64x128_c                                   : 2967.0 2209.5
> w_avg_8_64x128_rvv_i32                             :  437.0  190.5
> w_avg_8_128x2_c                                    :   65.7   66.0
> w_avg_8_128x2_rvv_i32                              :   11.2    5.5
> w_avg_8_128x4_c                                    :  131.7  134.7
> w_avg_8_128x4_rvv_i32                              :   21.5   10.0
> w_avg_8_128x8_c                                    :  270.2  264.2
> w_avg_8_128x8_rvv_i32                              :   42.2   19.2
> w_avg_8_128x16_c                                   :  580.0  554.2
> w_avg_8_128x16_rvv_i32                             :   83.5   37.5
> w_avg_8_128x32_c                                   : 1141.0 1206.2
> w_avg_8_128x32_rvv_i32                             :  166.2   74.2
> w_avg_8_128x64_c                                   : 2295.5 2403.5
> w_avg_8_128x64_rvv_i32                             :  408.2  159.0
> w_avg_8_128x128_c                                  : 5367.5 4915.2
> w_avg_8_128x128_rvv_i32                            :  741.2  331.5
> 
> test
> 
> u makefile

Squash left overs?

> ---
>  libavcodec/riscv/vvc/Makefile      |   2 +
>  libavcodec/riscv/vvc/vvc_mc_rvv.S  | 295 +++++++++++++++++++++++++++++
>  libavcodec/riscv/vvc/vvcdsp_init.c |  71 +++++++
>  libavcodec/vvc/dsp.c               |   4 +-
>  libavcodec/vvc/dsp.h               |   1 +
>  5 files changed, 372 insertions(+), 1 deletion(-)
>  create mode 100644 libavcodec/riscv/vvc/Makefile
>  create mode 100644 libavcodec/riscv/vvc/vvc_mc_rvv.S
>  create mode 100644 libavcodec/riscv/vvc/vvcdsp_init.c
> 
> diff --git a/libavcodec/riscv/vvc/Makefile b/libavcodec/riscv/vvc/Makefile
> new file mode 100644
> index 0000000000..582b051579
> --- /dev/null
> +++ b/libavcodec/riscv/vvc/Makefile
> @@ -0,0 +1,2 @@
> +OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvcdsp_init.o
> +RVV-OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvc_mc_rvv.o
> diff --git a/libavcodec/riscv/vvc/vvc_mc_rvv.S
> b/libavcodec/riscv/vvc/vvc_mc_rvv.S new file mode 100644
> index 0000000000..f2128fa776
> --- /dev/null
> +++ b/libavcodec/riscv/vvc/vvc_mc_rvv.S
> @@ -0,0 +1,295 @@
> +/*
> + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences
> (ISCAS). + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA + */
> +
> +#include "libavutil/riscv/asm.S"
> +
> +.macro vsetvlstatic8 w, vlen, is_w
> +        .if \w <= 2
> +                vsetivli        zero, \w, e8, mf8, ta, ma
> +        .elseif \w <= 4 && \vlen == 128
> +                vsetivli        zero, \w, e8, mf4, ta, ma
> +        .elseif \w <= 4 && \vlen >= 256
> +                vsetivli        zero, \w, e8, mf8, ta, ma
> +        .elseif \w <= 8 && \vlen == 128
> +                vsetivli        zero, \w, e8, mf2, ta, ma
> +        .elseif \w <= 8 && \vlen >= 256
> +                vsetivli        zero, \w, e8, mf4, ta, ma
> +        .elseif \w <= 16 && \vlen == 128
> +                vsetivli        zero, \w, e8, m1, ta, ma
> +        .elseif \w <= 16 && \vlen >= 256
> +                vsetivli        zero, \w, e8, mf2, ta, ma
> +        .elseif \w <= 32 && \vlen >= 256
> +                li t0, \w
> +                vsetvli         zero, t0, e8, m1, ta, ma
> +        .elseif \w <= (\vlen / 4) || \is_w
> +                li t0, 64
> +                vsetvli         zero, t0, e8, m2, ta, ma
> +        .else
> +                li t0, \w
> +                vsetvli         zero, t0, e8, m4, ta, ma
> +        .endif
> +.endm
> +
> +.macro vsetvlstatic16 w, vlen, is_w
> +        .if \w <= 2
> +                vsetivli        zero, \w, e16, mf4, ta, ma
> +        .elseif \w <= 4 && \vlen == 128
> +                vsetivli        zero, \w, e16, mf2, ta, ma
> +        .elseif \w <= 4 && \vlen >= 256
> +                vsetivli        zero, \w, e16, mf4, ta, ma
> +        .elseif \w <= 8 && \vlen == 128
> +                vsetivli        zero, \w, e16, m1, ta, ma
> +        .elseif \w <= 8 && \vlen >= 256
> +                vsetivli        zero, \w, e16, mf2, ta, ma
> +        .elseif \w <= 16 && \vlen == 128
> +                vsetivli        zero, \w, e16, m2, ta, ma
> +        .elseif \w <= 16 && \vlen >= 256
> +                vsetivli        zero, \w, e16, m1, ta, ma
> +        .elseif \w <= 32 && \vlen >= 256
> +                li t0, \w
> +                vsetvli         zero, t0, e16, m2, ta, ma
> +        .elseif \w <= (\vlen / 4) || \is_w
> +                li t0, 64
> +                vsetvli         zero, t0, e16, m4, ta, ma
> +        .else
> +                li t0, \w
> +                vsetvli         zero, t0, e16, m8, ta, ma
> +        .endif
> +.endm
> +
> +.macro vsetvlstatic32 w, vlen
> +        .if \w <= 2
> +                vsetivli        zero, \w, e32, mf2, ta, ma
> +        .elseif \w <= 4 && \vlen == 128
> +                vsetivli        zero, \w, e32, m1, ta, ma
> +        .elseif \w <= 4 && \vlen >= 256
> +                vsetivli        zero, \w, e32, mf2, ta, ma
> +        .elseif \w <= 8 && \vlen == 128
> +                vsetivli        zero, \w, e32, m2, ta, ma
> +        .elseif \w <= 8 && \vlen >= 256
> +                vsetivli        zero, \w, e32, m1, ta, ma
> +        .elseif \w <= 16 && \vlen == 128
> +                vsetivli        zero, \w, e32, m4, ta, ma
> +        .elseif \w <= 16 && \vlen >= 256
> +                vsetivli        zero, \w, e32, m2, ta, ma
> +        .elseif \w <= 32 && \vlen >= 256
> +                li t0, \w
> +                vsetvli         zero, t0, e32, m4, ta, ma
> +        .else
> +                li t0, \w
> +                vsetvli         zero, t0, e32, m8, ta, ma
> +        .endif
> +.endm
> +
> +.macro avg_nx1 w, vlen
> +        vsetvlstatic16    \w, \vlen, 0
> +        vle16.v           v0, (a2)
> +        vle16.v           v8, (a3)
> +        vadd.vv           v8, v8, v0
> +        vmax.vx           v8, v8, zero
> +        vsetvlstatic8     \w, \vlen, 0
> +        vnclipu.wi        v8, v8, 7
> +        vse8.v            v8, (a0)
> +.endm
> +
> +.macro avg w, vlen, id
> +\id\w\vlen:
> +.if \w < 128
> +        vsetvlstatic16    \w, \vlen, 0
> +        addi              t0, a2, 128*2
> +        addi              t1, a3, 128*2
> +        add               t2, a0, a1
> +        vle16.v           v0, (a2)
> +        vle16.v           v8, (a3)
> +        addi              a5, a5, -2
> +        vle16.v           v16, (t0)
> +        vle16.v           v24, (t1)
> +        vadd.vv           v8, v8, v0
> +        vadd.vv           v24, v24, v16
> +        vmax.vx           v8, v8, zero
> +        vmax.vx           v24, v24, zero
> +        vsetvlstatic8     \w, \vlen, 0
> +        addi              a2, a2, 128*4
> +        vnclipu.wi        v8, v8, 7
> +        vnclipu.wi        v24, v24, 7
> +        addi              a3, a3, 128*4
> +        vse8.v            v8, (a0)
> +        vse8.v            v24, (t2)
> +        sh1add            a0, a1, a0
> +.else
> +        avg_nx1           128, \vlen
> +        addi              a5, a5, -1
> +        .if \vlen == 128
> +        addi              a2, a2, 64*2
> +        addi              a3, a3, 64*2
> +        addi              a0, a0, 64
> +        avg_nx1           128, \vlen
> +        addi              a0, a0, -64
> +        addi              a2, a2, 128
> +        addi              a3, a3, 128
> +        .else
> +        addi              a2, a2, 128*2
> +        addi              a3, a3, 128*2
> +        .endif
> +        add               a0, a0, a1
> +.endif
> +        bnez              a5, \id\w\vlen\()b
> +        ret
> +.endm
> +
> +
> +.macro AVG_JMP_TABLE id, vlen
> +const jmp_table_\id\vlen
> +    .8byte \id\()2\vlen\()f
> +    .8byte \id\()4\vlen\()f
> +    .8byte \id\()8\vlen\()f
> +    .8byte \id\()16\vlen\()f
> +    .8byte \id\()32\vlen\()f
> +    .8byte \id\()64\vlen\()f
> +    .8byte \id\()128\vlen\()f

AFAIU, this will generate relocations. I wonder if the linker smart enough to 
put that into .data.relro rather than whine that it can't live it in .rodata?

In assembler, we can dodge the problem entirely by storing relative offsets 
rather than addresses. You can also stick to 4- or even 2-byte values then. 

> +endconst
> +.endm
> +
> +.macro AVG_J vlen, id
> +        clz               a4, a4
> +        li                t0, __riscv_xlen-2
> +        sub               a4, t0, a4
> +        lla               t0, jmp_table_\id\vlen
> +        sh3add            t0, a4, t0
> +        ld                t0, 0(t0)

LLA is an alias for AUIPC; ADD. You can avoid that ADD by folding the low bits 
into LD. See how ff_h263_loop_filter_strength is addressed in h263dsp_rvv.S.

> +        jr                t0
> +.endm
> +
> +.macro func_avg vlen
> +func ff_vvc_avg_8_rvv_\vlen\(), zve32x
> +        AVG_JMP_TABLE     1, \vlen
> +        csrw              vxrm, zero
> +        AVG_J             \vlen, 1
> +        .irp w,2,4,8,16,32,64,128
> +        avg               \w, \vlen, 1
> +        .endr
> +endfunc
> +.endm
> +
> +.macro w_avg_nx1 w, vlen
> +        vsetvlstatic16    \w, \vlen, 1
> +        vle16.v           v0, (a2)
> +        vle16.v           v8, (a3)
> +        vwmul.vx          v16, v0, a7
> +        vwmacc.vx         v16, t3, v8
> +        vsetvlstatic32    \w, \vlen
> +        vadd.vx           v16, v16, t4
> +        vsetvlstatic16    \w, \vlen, 1
> +        vnsrl.wx          v16, v16, t6
> +        vmax.vx           v16, v16, zero
> +        vsetvlstatic8     \w, \vlen, 1
> +        vnclipu.wi        v16, v16, 0
> +        vse8.v            v16, (a0)
> +.endm
> +
> +.macro w_avg w, vlen, id
> +\id\w\vlen:
> +.if \vlen <= 16
> +        vsetvlstatic16    \w, \vlen, 1
> +        addi              t0, a2, 128*2
> +        addi              t1, a3, 128*2
> +        vle16.v           v0, (a2)
> +        vle16.v           v8, (a3)
> +        addi              a5, a5, -2
> +        vle16.v           v20, (t0)
> +        vle16.v           v24, (t1)
> +        vwmul.vx          v16, v0, a7
> +        vwmul.vx          v28, v20, a7
> +        vwmacc.vx         v16, t3, v8
> +        vwmacc.vx         v28, t3, v24
> +        vsetvlstatic32    \w, \vlen
> +        add               t2, a0, a1
> +        vadd.vx           v16, v16, t4
> +        vadd.vx           v28, v28, t4
> +        vsetvlstatic16    \w, \vlen, 1
> +        vnsrl.wx          v16, v16, t6
> +        vnsrl.wx          v28, v28, t6
> +        vmax.vx           v16, v16, zero
> +        vmax.vx           v28, v28, zero
> +        vsetvlstatic8     \w, \vlen, 1
> +        addi              a2, a2, 128*4
> +        vnclipu.wi        v16, v16, 0
> +        vnclipu.wi        v28, v28, 0
> +        vse8.v            v16, (a0)
> +        addi              a3, a3, 128*4
> +        vse8.v            v28, (t2)
> +        sh1add            a0, a1, a0
> +.else
> +        w_avg_nx1         \w, \vlen
> +        addi              a5, a5, -1
> +        .if \w == (\vlen / 2)
> +        addi              a2, a2, (\vlen / 2)
> +        addi              a3, a3, (\vlen / 2)
> +        addi              a0, a0, (\vlen / 4)
> +        w_avg_nx1         \w, \vlen
> +        addi              a2, a2, -(\vlen / 2)
> +        addi              a3, a3, -(\vlen / 2)
> +        addi              a0, a0, -(\vlen / 4)
> +        .elseif \w == 128 && \vlen == 128
> +        .rept 3
> +        addi              a2, a2, (\vlen / 2)
> +        addi              a3, a3, (\vlen / 2)
> +        addi              a0, a0, (\vlen / 4)
> +        w_avg_nx1         \w, \vlen
> +        .endr
> +        addi              a2, a2, -(\vlen / 2) * 3
> +        addi              a3, a3, -(\vlen / 2) * 3
> +        addi              a0, a0, -(\vlen / 4) * 3
> +        .endif
> +
> +        addi              a2, a2, 128*2
> +        addi              a3, a3, 128*2
> +        add               a0, a0, a1
> +.endif
> +        bnez              a5, \id\w\vlen\()b
> +        ret
> +.endm
> +
> +
> +.macro func_w_avg vlen
> +func ff_vvc_w_avg_8_rvv_\vlen\(), zve32x
> +        AVG_JMP_TABLE     2, \vlen
> +        csrw              vxrm, zero
> +        addi              t6, a6, 7
> +        ld                t3, (sp)
> +        ld                t4, 8(sp)
> +        ld                t5, 16(sp)
> +        add               t4, t4, t5
> +        addi              t4, t4, 1       // o0 + o1 + 1
> +        addi              t5, t6, -1      // shift - 1
> +        sll               t4, t4, t5
> +        AVG_J             \vlen, 2
> +        .irp w,2,4,8,16,32,64,128
> +        w_avg             \w, \vlen, 2
> +        .endr
> +endfunc
> +.endm
> +
> +func_avg 128
> +func_avg 256
> +#if (__riscv_xlen == 64)
> +func_w_avg 128
> +func_w_avg 256
> +#endif
> diff --git a/libavcodec/riscv/vvc/vvcdsp_init.c
> b/libavcodec/riscv/vvc/vvcdsp_init.c new file mode 100644
> index 0000000000..85b1ede061
> --- /dev/null
> +++ b/libavcodec/riscv/vvc/vvcdsp_init.c
> @@ -0,0 +1,71 @@
> +/*
> + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences
> (ISCAS). + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA + */
> +
> +#include "config.h"
> +
> +#include "libavutil/attributes.h"
> +#include "libavutil/cpu.h"
> +#include "libavutil/riscv/cpu.h"
> +#include "libavcodec/vvc/dsp.h"
> +
> +#define bf(fn, bd,  opt) fn##_##bd##_##opt
> +
> +#define AVG_PROTOTYPES(bd, opt)                                            
>                          \ +void bf(ff_vvc_avg, bd, opt)(uint8_t *dst,
> ptrdiff_t dst_stride,                                     \ +    const
> int16_t *src0, const int16_t *src1, int width, int height);                
>                \ +void bf(ff_vvc_w_avg, bd, opt)(uint8_t *dst, ptrdiff_t
> dst_stride,                                   \ +    const int16_t *src0,
> const int16_t *src1, int width, int height,                                
> \ +    int denom, int w0, int w1, int o0, int o1);
> +
> +AVG_PROTOTYPES(8, rvv_128)
> +AVG_PROTOTYPES(8, rvv_256)
> +
> +void ff_vvc_dsp_init_riscv(VVCDSPContext *const c, const int bd)
> +{
> +#if HAVE_RVV
> +    const int flags = av_get_cpu_flags();
> +
> +    if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR) &&
> +        ff_rv_vlen_least(256)) {
> +        switch (bd) {
> +            case 8:
> +                c->inter.avg    = ff_vvc_avg_8_rvv_256;
> +# if (__riscv_xlen == 64)
> +                c->inter.w_avg    = ff_vvc_w_avg_8_rvv_256;
> +# endif
> +                break;
> +            default:
> +                break;
> +        }
> +    } else if ((flags & AV_CPU_FLAG_RVV_I32) && (flags &
> AV_CPU_FLAG_RVB_ADDR) && +               ff_rv_vlen_least(128)) {
> +        switch (bd) {
> +            case 8:
> +                c->inter.avg    = ff_vvc_avg_8_rvv_128;
> +# if (__riscv_xlen == 64)
> +                c->inter.w_avg    = ff_vvc_w_avg_8_rvv_128;
> +# endif
> +                break;
> +            default:
> +                break;
> +        }
> +    }
> +#endif
> +}
> diff --git a/libavcodec/vvc/dsp.c b/libavcodec/vvc/dsp.c
> index 41e830a98a..c55a37d255 100644
> --- a/libavcodec/vvc/dsp.c
> +++ b/libavcodec/vvc/dsp.c
> @@ -121,7 +121,9 @@ void ff_vvc_dsp_init(VVCDSPContext *vvcdsp, int
> bit_depth) break;
>      }
> 
> -#if ARCH_X86
> +#if ARCH_RISCV
> +    ff_vvc_dsp_init_riscv(vvcdsp, bit_depth);
> +#elif ARCH_X86
>      ff_vvc_dsp_init_x86(vvcdsp, bit_depth);
>  #endif
>  }
> diff --git a/libavcodec/vvc/dsp.h b/libavcodec/vvc/dsp.h
> index 1f14096c41..e03236dd76 100644
> --- a/libavcodec/vvc/dsp.h
> +++ b/libavcodec/vvc/dsp.h
> @@ -180,6 +180,7 @@ typedef struct VVCDSPContext {
> 
>  void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth);
> 
> +void ff_vvc_dsp_init_riscv(VVCDSPContext *hpc, const int bit_depth);
>  void ff_vvc_dsp_init_x86(VVCDSPContext *hpc, const int bit_depth);
> 
>  #endif /* AVCODEC_VVC_DSP_H */


-- 
雷米‧德尼-库尔蒙
http://www.remlab.net/



_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2] lavc/vvc_mc: R-V V avg w_avg
  2024-06-01 19:54 ` Rémi Denis-Courmont
@ 2024-06-01 20:11   ` flow gg
  0 siblings, 0 replies; 4+ messages in thread
From: flow gg @ 2024-06-01 20:11 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

> I think we can drop the 2x2 transforms. In all likelihood, scalar code
will
> end up faster than vector code on future hardware, especially out-of-order
> pipelines.

I want to drop 2x2, but since there's only one function to handle all
situations instead of 7*7 functions, how can I drop only 2x2?

Rémi Denis-Courmont <remi@remlab.net> 于2024年6月2日周日 03:54写道:

> Le lauantaina 1. kesäkuuta 2024, 21.01.16 EEST uk7b@foxmail.com a écrit :
> > From: sunyuechi <sunyuechi@iscas.ac.cn>
> >
> >                                                       C908   X60
> > avg_8_2x2_c                                        :    1.0    1.0
> > avg_8_2x2_rvv_i32                                  :    1.0    1.0
>
> I think we can drop the 2x2 transforms. In all likelihood, scalar code
> will
> end up faster than vector code on future hardware, especially out-of-order
> pipelines.
>
> > avg_8_2x4_c                                        :    1.7    2.0
> > avg_8_2x4_rvv_i32                                  :    1.2    1.2
> > avg_8_2x8_c                                        :    3.7    4.0
> > avg_8_2x8_rvv_i32                                  :    2.0    2.0
> > avg_8_2x16_c                                       :    7.2    7.5
> > avg_8_2x16_rvv_i32                                 :    3.2    3.0
> > avg_8_2x32_c                                       :   14.2   15.0
> > avg_8_2x32_rvv_i32                                 :    5.7    5.0
> > avg_8_2x64_c                                       :   46.7   44.2
> > avg_8_2x64_rvv_i32                                 :   39.2   36.0
> > avg_8_2x128_c                                      :   99.7   80.0
> > avg_8_2x128_rvv_i32                                :   86.2   65.5
> > avg_8_4x2_c                                        :    2.0    2.0
> > avg_8_4x2_rvv_i32                                  :    1.0    1.0
> > avg_8_4x4_c                                        :    3.5    3.7
> > avg_8_4x4_rvv_i32                                  :    1.5    1.2
> > avg_8_4x8_c                                        :    6.5    7.0
> > avg_8_4x8_rvv_i32                                  :    2.0    1.7
> > avg_8_4x16_c                                       :   13.5   14.0
> > avg_8_4x16_rvv_i32                                 :    3.2    2.7
> > avg_8_4x32_c                                       :   26.2   27.5
> > avg_8_4x32_rvv_i32                                 :    5.7    5.0
> > avg_8_4x64_c                                       :   75.0   65.7
> > avg_8_4x64_rvv_i32                                 :   44.0   32.0
> > avg_8_4x128_c                                      :  165.2  118.5
> > avg_8_4x128_rvv_i32                                :   81.5   71.0
> > avg_8_8x2_c                                        :    3.2    3.5
> > avg_8_8x2_rvv_i32                                  :    1.2    1.0
> > avg_8_8x4_c                                        :    6.5    6.5
> > avg_8_8x4_rvv_i32                                  :    1.5    1.5
> > avg_8_8x8_c                                        :   12.5   13.2
> > avg_8_8x8_rvv_i32                                  :    2.2    1.7
> > avg_8_8x16_c                                       :   25.2   26.5
> > avg_8_8x16_rvv_i32                                 :    3.7    2.7
> > avg_8_8x32_c                                       :   50.0   52.5
> > avg_8_8x32_rvv_i32                                 :    6.7    5.2
> > avg_8_8x64_c                                       :  120.7  119.0
> > avg_8_8x64_rvv_i32                                 :   43.2   33.5
> > avg_8_8x128_c                                      :  247.5  217.7
> > avg_8_8x128_rvv_i32                                :  100.5   74.7
> > avg_8_16x2_c                                       :    6.2    6.5
> > avg_8_16x2_rvv_i32                                 :    1.2    1.0
> > avg_8_16x4_c                                       :   12.2   13.0
> > avg_8_16x4_rvv_i32                                 :    2.0    1.2
> > avg_8_16x8_c                                       :   24.5   25.7
> > avg_8_16x8_rvv_i32                                 :    3.2    2.0
> > avg_8_16x16_c                                      :   48.7   51.2
> > avg_8_16x16_rvv_i32                                :    5.7    3.2
> > avg_8_16x32_c                                      :   97.5  102.7
> > avg_8_16x32_rvv_i32                                :   10.7    6.0
> > avg_8_16x64_c                                      :  213.0  215.0
> > avg_8_16x64_rvv_i32                                :   51.5   33.5
> > avg_8_16x128_c                                     :  408.5  417.0
> > avg_8_16x128_rvv_i32                               :  102.0   71.5
> > avg_8_32x2_c                                       :   12.2   13.0
> > avg_8_32x2_rvv_i32                                 :    2.0    1.2
> > avg_8_32x4_c                                       :   24.5   25.5
> > avg_8_32x4_rvv_i32                                 :    3.2    1.7
> > avg_8_32x8_c                                       :   48.5   50.7
> > avg_8_32x8_rvv_i32                                 :    5.7    3.0
> > avg_8_32x16_c                                      :   96.5  101.5
> > avg_8_32x16_rvv_i32                                :   10.5    5.0
> > avg_8_32x32_c                                      :  210.2  202.5
> > avg_8_32x32_rvv_i32                                :   20.2    9.7
> > avg_8_32x64_c                                      :  431.7  417.2
> > avg_8_32x64_rvv_i32                                :   68.0   46.0
> > avg_8_32x128_c                                     :  822.2  819.0
> > avg_8_32x128_rvv_i32                               :  152.2   69.0
> > avg_8_64x2_c                                       :   24.0   25.2
> > avg_8_64x2_rvv_i32                                 :    3.0    1.7
> > avg_8_64x4_c                                       :   48.2   51.0
> > avg_8_64x4_rvv_i32                                 :    5.5    2.7
> > avg_8_64x8_c                                       :   96.7  101.5
> > avg_8_64x8_rvv_i32                                 :   10.0    5.0
> > avg_8_64x16_c                                      :  193.5  203.0
> > avg_8_64x16_rvv_i32                                :   19.2    9.2
> > avg_8_64x32_c                                      :  404.2  405.7
> > avg_8_64x32_rvv_i32                                :   37.7   18.0
> > avg_8_64x64_c                                      :  846.2  841.7
> > avg_8_64x64_rvv_i32                                :  136.5   35.2
> > avg_8_64x128_c                                     : 1659.5 1662.2
> > avg_8_64x128_rvv_i32                               :  236.7  177.7
> > avg_8_128x2_c                                      :   48.7   51.0
> > avg_8_128x2_rvv_i32                                :    5.2    2.7
> > avg_8_128x4_c                                      :   96.7  101.0
> > avg_8_128x4_rvv_i32                                :    9.7    4.7
> > avg_8_128x8_c                                      :  226.0  201.5
> > avg_8_128x8_rvv_i32                                :   18.7    8.7
> > avg_8_128x16_c                                     :  402.7  402.7
> > avg_8_128x16_rvv_i32                               :   37.0   17.2
> > avg_8_128x32_c                                     :  791.2  805.0
> > avg_8_128x32_rvv_i32                               :   73.5   33.5
> > avg_8_128x64_c                                     : 1616.7 1645.2
> > avg_8_128x64_rvv_i32                               :  223.7   68.2
> > avg_8_128x128_c                                    : 3202.0 3235.2
> > avg_8_128x128_rvv_i32                              :  390.0  314.2
> > w_avg_8_2x2_c                                      :    1.7    1.5
> > w_avg_8_2x2_rvv_i32                                :    1.7    1.5
> > w_avg_8_2x4_c                                      :    2.7    2.5
> > w_avg_8_2x4_rvv_i32                                :    2.7    2.5
> > w_avg_8_2x8_c                                      :    5.0    5.0
> > w_avg_8_2x8_rvv_i32                                :    4.5    4.0
> > w_avg_8_2x16_c                                     :   26.5    9.5
> > w_avg_8_2x16_rvv_i32                               :    8.0    7.0
> > w_avg_8_2x32_c                                     :   18.7   18.5
> > w_avg_8_2x32_rvv_i32                               :   15.0   13.2
> > w_avg_8_2x64_c                                     :   58.5   46.5
> > w_avg_8_2x64_rvv_i32                               :   49.7   38.7
> > w_avg_8_2x128_c                                    :  121.7   85.2
> > w_avg_8_2x128_rvv_i32                              :   89.7   81.0
> > w_avg_8_4x2_c                                      :    2.5    2.5
> > w_avg_8_4x2_rvv_i32                                :    1.7    1.5
> > w_avg_8_4x4_c                                      :    4.7    4.7
> > w_avg_8_4x4_rvv_i32                                :    2.7    2.2
> > w_avg_8_4x8_c                                      :    9.0    9.0
> > w_avg_8_4x8_rvv_i32                                :    4.5    4.0
> > w_avg_8_4x16_c                                     :   17.7   17.7
> > w_avg_8_4x16_rvv_i32                               :    8.0    7.0
> > w_avg_8_4x32_c                                     :   35.0   35.0
> > w_avg_8_4x32_rvv_i32                               :   15.0   13.5
> > w_avg_8_4x64_c                                     :   95.2   80.2
> > w_avg_8_4x64_rvv_i32                               :   47.7   38.0
> > w_avg_8_4x128_c                                    :  197.7  164.7
> > w_avg_8_4x128_rvv_i32                              :  101.7   81.5
> > w_avg_8_8x2_c                                      :    4.5    4.5
> > w_avg_8_8x2_rvv_i32                                :    2.0    1.7
> > w_avg_8_8x4_c                                      :    8.7    8.7
> > w_avg_8_8x4_rvv_i32                                :    2.7    2.5
> > w_avg_8_8x8_c                                      :   33.5   17.0
> > w_avg_8_8x8_rvv_i32                                :    4.7    4.0
> > w_avg_8_8x16_c                                     :   34.0   34.0
> > w_avg_8_8x16_rvv_i32                               :    8.5    7.2
> > w_avg_8_8x32_c                                     :   85.5   67.7
> > w_avg_8_8x32_rvv_i32                               :   16.2   13.5
> > w_avg_8_8x64_c                                     :  162.5  148.2
> > w_avg_8_8x64_rvv_i32                               :   50.0   36.5
> > w_avg_8_8x128_c                                    :  380.2  301.5
> > w_avg_8_8x128_rvv_i32                              :   87.2   79.5
> > w_avg_8_16x2_c                                     :    8.5    8.7
> > w_avg_8_16x2_rvv_i32                               :    2.2    1.7
> > w_avg_8_16x4_c                                     :   16.7   17.0
> > w_avg_8_16x4_rvv_i32                               :    3.7    2.5
> > w_avg_8_16x8_c                                     :   33.2   33.7
> > w_avg_8_16x8_rvv_i32                               :    6.5    4.2
> > w_avg_8_16x16_c                                    :   66.2   66.5
> > w_avg_8_16x16_rvv_i32                              :   12.0    7.5
> > w_avg_8_16x32_c                                    :  133.2  134.0
> > w_avg_8_16x32_rvv_i32                              :   23.0   14.2
> > w_avg_8_16x64_c                                    :  296.0  276.7
> > w_avg_8_16x64_rvv_i32                              :   66.7   38.2
> > w_avg_8_16x128_c                                   :  625.2  539.7
> > w_avg_8_16x128_rvv_i32                             :  135.5   79.2
> > w_avg_8_32x2_c                                     :   16.7   16.7
> > w_avg_8_32x2_rvv_i32                               :    3.5    2.0
> > w_avg_8_32x4_c                                     :   33.2   33.2
> > w_avg_8_32x4_rvv_i32                               :    6.0    3.5
> > w_avg_8_32x8_c                                     :   65.7   66.2
> > w_avg_8_32x8_rvv_i32                               :   11.2    5.7
> > w_avg_8_32x16_c                                    :  132.0  132.0
> > w_avg_8_32x16_rvv_i32                              :   21.5   10.7
> > w_avg_8_32x32_c                                    :  261.7  272.2
> > w_avg_8_32x32_rvv_i32                              :   42.2   20.5
> > w_avg_8_32x64_c                                    :  528.2  562.7
> > w_avg_8_32x64_rvv_i32                              :   83.5   59.2
> > w_avg_8_32x128_c                                   : 1135.5 1070.0
> > w_avg_8_32x128_rvv_i32                             :  208.7   96.5
> > w_avg_8_64x2_c                                     :   33.0   33.0
> > w_avg_8_64x2_rvv_i32                               :    6.0    3.0
> > w_avg_8_64x4_c                                     :   65.5   67.0
> > w_avg_8_64x4_rvv_i32                               :   11.0    5.2
> > w_avg_8_64x8_c                                     :  150.0  134.7
> > w_avg_8_64x8_rvv_i32                               :   21.5   10.0
> > w_avg_8_64x16_c                                    :  265.2  273.7
> > w_avg_8_64x16_rvv_i32                              :   42.2   19.0
> > w_avg_8_64x32_c                                    :  629.7  541.7
> > w_avg_8_64x32_rvv_i32                              :   83.7   37.7
> > w_avg_8_64x64_c                                    : 1259.0 1237.7
> > w_avg_8_64x64_rvv_i32                              :  190.7   76.0
> > w_avg_8_64x128_c                                   : 2967.0 2209.5
> > w_avg_8_64x128_rvv_i32                             :  437.0  190.5
> > w_avg_8_128x2_c                                    :   65.7   66.0
> > w_avg_8_128x2_rvv_i32                              :   11.2    5.5
> > w_avg_8_128x4_c                                    :  131.7  134.7
> > w_avg_8_128x4_rvv_i32                              :   21.5   10.0
> > w_avg_8_128x8_c                                    :  270.2  264.2
> > w_avg_8_128x8_rvv_i32                              :   42.2   19.2
> > w_avg_8_128x16_c                                   :  580.0  554.2
> > w_avg_8_128x16_rvv_i32                             :   83.5   37.5
> > w_avg_8_128x32_c                                   : 1141.0 1206.2
> > w_avg_8_128x32_rvv_i32                             :  166.2   74.2
> > w_avg_8_128x64_c                                   : 2295.5 2403.5
> > w_avg_8_128x64_rvv_i32                             :  408.2  159.0
> > w_avg_8_128x128_c                                  : 5367.5 4915.2
> > w_avg_8_128x128_rvv_i32                            :  741.2  331.5
> >
> > test
> >
> > u makefile
>
> Squash left overs?
>
> > ---
> >  libavcodec/riscv/vvc/Makefile      |   2 +
> >  libavcodec/riscv/vvc/vvc_mc_rvv.S  | 295 +++++++++++++++++++++++++++++
> >  libavcodec/riscv/vvc/vvcdsp_init.c |  71 +++++++
> >  libavcodec/vvc/dsp.c               |   4 +-
> >  libavcodec/vvc/dsp.h               |   1 +
> >  5 files changed, 372 insertions(+), 1 deletion(-)
> >  create mode 100644 libavcodec/riscv/vvc/Makefile
> >  create mode 100644 libavcodec/riscv/vvc/vvc_mc_rvv.S
> >  create mode 100644 libavcodec/riscv/vvc/vvcdsp_init.c
> >
> > diff --git a/libavcodec/riscv/vvc/Makefile
> b/libavcodec/riscv/vvc/Makefile
> > new file mode 100644
> > index 0000000000..582b051579
> > --- /dev/null
> > +++ b/libavcodec/riscv/vvc/Makefile
> > @@ -0,0 +1,2 @@
> > +OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvcdsp_init.o
> > +RVV-OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvc_mc_rvv.o
> > diff --git a/libavcodec/riscv/vvc/vvc_mc_rvv.S
> > b/libavcodec/riscv/vvc/vvc_mc_rvv.S new file mode 100644
> > index 0000000000..f2128fa776
> > --- /dev/null
> > +++ b/libavcodec/riscv/vvc/vvc_mc_rvv.S
> > @@ -0,0 +1,295 @@
> > +/*
> > + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences
> > (ISCAS). + *
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> 02110-1301
> > USA + */
> > +
> > +#include "libavutil/riscv/asm.S"
> > +
> > +.macro vsetvlstatic8 w, vlen, is_w
> > +        .if \w <= 2
> > +                vsetivli        zero, \w, e8, mf8, ta, ma
> > +        .elseif \w <= 4 && \vlen == 128
> > +                vsetivli        zero, \w, e8, mf4, ta, ma
> > +        .elseif \w <= 4 && \vlen >= 256
> > +                vsetivli        zero, \w, e8, mf8, ta, ma
> > +        .elseif \w <= 8 && \vlen == 128
> > +                vsetivli        zero, \w, e8, mf2, ta, ma
> > +        .elseif \w <= 8 && \vlen >= 256
> > +                vsetivli        zero, \w, e8, mf4, ta, ma
> > +        .elseif \w <= 16 && \vlen == 128
> > +                vsetivli        zero, \w, e8, m1, ta, ma
> > +        .elseif \w <= 16 && \vlen >= 256
> > +                vsetivli        zero, \w, e8, mf2, ta, ma
> > +        .elseif \w <= 32 && \vlen >= 256
> > +                li t0, \w
> > +                vsetvli         zero, t0, e8, m1, ta, ma
> > +        .elseif \w <= (\vlen / 4) || \is_w
> > +                li t0, 64
> > +                vsetvli         zero, t0, e8, m2, ta, ma
> > +        .else
> > +                li t0, \w
> > +                vsetvli         zero, t0, e8, m4, ta, ma
> > +        .endif
> > +.endm
> > +
> > +.macro vsetvlstatic16 w, vlen, is_w
> > +        .if \w <= 2
> > +                vsetivli        zero, \w, e16, mf4, ta, ma
> > +        .elseif \w <= 4 && \vlen == 128
> > +                vsetivli        zero, \w, e16, mf2, ta, ma
> > +        .elseif \w <= 4 && \vlen >= 256
> > +                vsetivli        zero, \w, e16, mf4, ta, ma
> > +        .elseif \w <= 8 && \vlen == 128
> > +                vsetivli        zero, \w, e16, m1, ta, ma
> > +        .elseif \w <= 8 && \vlen >= 256
> > +                vsetivli        zero, \w, e16, mf2, ta, ma
> > +        .elseif \w <= 16 && \vlen == 128
> > +                vsetivli        zero, \w, e16, m2, ta, ma
> > +        .elseif \w <= 16 && \vlen >= 256
> > +                vsetivli        zero, \w, e16, m1, ta, ma
> > +        .elseif \w <= 32 && \vlen >= 256
> > +                li t0, \w
> > +                vsetvli         zero, t0, e16, m2, ta, ma
> > +        .elseif \w <= (\vlen / 4) || \is_w
> > +                li t0, 64
> > +                vsetvli         zero, t0, e16, m4, ta, ma
> > +        .else
> > +                li t0, \w
> > +                vsetvli         zero, t0, e16, m8, ta, ma
> > +        .endif
> > +.endm
> > +
> > +.macro vsetvlstatic32 w, vlen
> > +        .if \w <= 2
> > +                vsetivli        zero, \w, e32, mf2, ta, ma
> > +        .elseif \w <= 4 && \vlen == 128
> > +                vsetivli        zero, \w, e32, m1, ta, ma
> > +        .elseif \w <= 4 && \vlen >= 256
> > +                vsetivli        zero, \w, e32, mf2, ta, ma
> > +        .elseif \w <= 8 && \vlen == 128
> > +                vsetivli        zero, \w, e32, m2, ta, ma
> > +        .elseif \w <= 8 && \vlen >= 256
> > +                vsetivli        zero, \w, e32, m1, ta, ma
> > +        .elseif \w <= 16 && \vlen == 128
> > +                vsetivli        zero, \w, e32, m4, ta, ma
> > +        .elseif \w <= 16 && \vlen >= 256
> > +                vsetivli        zero, \w, e32, m2, ta, ma
> > +        .elseif \w <= 32 && \vlen >= 256
> > +                li t0, \w
> > +                vsetvli         zero, t0, e32, m4, ta, ma
> > +        .else
> > +                li t0, \w
> > +                vsetvli         zero, t0, e32, m8, ta, ma
> > +        .endif
> > +.endm
> > +
> > +.macro avg_nx1 w, vlen
> > +        vsetvlstatic16    \w, \vlen, 0
> > +        vle16.v           v0, (a2)
> > +        vle16.v           v8, (a3)
> > +        vadd.vv           v8, v8, v0
> > +        vmax.vx           v8, v8, zero
> > +        vsetvlstatic8     \w, \vlen, 0
> > +        vnclipu.wi        v8, v8, 7
> > +        vse8.v            v8, (a0)
> > +.endm
> > +
> > +.macro avg w, vlen, id
> > +\id\w\vlen:
> > +.if \w < 128
> > +        vsetvlstatic16    \w, \vlen, 0
> > +        addi              t0, a2, 128*2
> > +        addi              t1, a3, 128*2
> > +        add               t2, a0, a1
> > +        vle16.v           v0, (a2)
> > +        vle16.v           v8, (a3)
> > +        addi              a5, a5, -2
> > +        vle16.v           v16, (t0)
> > +        vle16.v           v24, (t1)
> > +        vadd.vv           v8, v8, v0
> > +        vadd.vv           v24, v24, v16
> > +        vmax.vx           v8, v8, zero
> > +        vmax.vx           v24, v24, zero
> > +        vsetvlstatic8     \w, \vlen, 0
> > +        addi              a2, a2, 128*4
> > +        vnclipu.wi        v8, v8, 7
> > +        vnclipu.wi        v24, v24, 7
> > +        addi              a3, a3, 128*4
> > +        vse8.v            v8, (a0)
> > +        vse8.v            v24, (t2)
> > +        sh1add            a0, a1, a0
> > +.else
> > +        avg_nx1           128, \vlen
> > +        addi              a5, a5, -1
> > +        .if \vlen == 128
> > +        addi              a2, a2, 64*2
> > +        addi              a3, a3, 64*2
> > +        addi              a0, a0, 64
> > +        avg_nx1           128, \vlen
> > +        addi              a0, a0, -64
> > +        addi              a2, a2, 128
> > +        addi              a3, a3, 128
> > +        .else
> > +        addi              a2, a2, 128*2
> > +        addi              a3, a3, 128*2
> > +        .endif
> > +        add               a0, a0, a1
> > +.endif
> > +        bnez              a5, \id\w\vlen\()b
> > +        ret
> > +.endm
> > +
> > +
> > +.macro AVG_JMP_TABLE id, vlen
> > +const jmp_table_\id\vlen
> > +    .8byte \id\()2\vlen\()f
> > +    .8byte \id\()4\vlen\()f
> > +    .8byte \id\()8\vlen\()f
> > +    .8byte \id\()16\vlen\()f
> > +    .8byte \id\()32\vlen\()f
> > +    .8byte \id\()64\vlen\()f
> > +    .8byte \id\()128\vlen\()f
>
> AFAIU, this will generate relocations. I wonder if the linker smart enough
> to
> put that into .data.relro rather than whine that it can't live it in
> .rodata?
>
> In assembler, we can dodge the problem entirely by storing relative
> offsets
> rather than addresses. You can also stick to 4- or even 2-byte values
> then.
>
> > +endconst
> > +.endm
> > +
> > +.macro AVG_J vlen, id
> > +        clz               a4, a4
> > +        li                t0, __riscv_xlen-2
> > +        sub               a4, t0, a4
> > +        lla               t0, jmp_table_\id\vlen
> > +        sh3add            t0, a4, t0
> > +        ld                t0, 0(t0)
>
> LLA is an alias for AUIPC; ADD. You can avoid that ADD by folding the low
> bits
> into LD. See how ff_h263_loop_filter_strength is addressed in
> h263dsp_rvv.S.
>
> > +        jr                t0
> > +.endm
> > +
> > +.macro func_avg vlen
> > +func ff_vvc_avg_8_rvv_\vlen\(), zve32x
> > +        AVG_JMP_TABLE     1, \vlen
> > +        csrw              vxrm, zero
> > +        AVG_J             \vlen, 1
> > +        .irp w,2,4,8,16,32,64,128
> > +        avg               \w, \vlen, 1
> > +        .endr
> > +endfunc
> > +.endm
> > +
> > +.macro w_avg_nx1 w, vlen
> > +        vsetvlstatic16    \w, \vlen, 1
> > +        vle16.v           v0, (a2)
> > +        vle16.v           v8, (a3)
> > +        vwmul.vx          v16, v0, a7
> > +        vwmacc.vx         v16, t3, v8
> > +        vsetvlstatic32    \w, \vlen
> > +        vadd.vx           v16, v16, t4
> > +        vsetvlstatic16    \w, \vlen, 1
> > +        vnsrl.wx          v16, v16, t6
> > +        vmax.vx           v16, v16, zero
> > +        vsetvlstatic8     \w, \vlen, 1
> > +        vnclipu.wi        v16, v16, 0
> > +        vse8.v            v16, (a0)
> > +.endm
> > +
> > +.macro w_avg w, vlen, id
> > +\id\w\vlen:
> > +.if \vlen <= 16
> > +        vsetvlstatic16    \w, \vlen, 1
> > +        addi              t0, a2, 128*2
> > +        addi              t1, a3, 128*2
> > +        vle16.v           v0, (a2)
> > +        vle16.v           v8, (a3)
> > +        addi              a5, a5, -2
> > +        vle16.v           v20, (t0)
> > +        vle16.v           v24, (t1)
> > +        vwmul.vx          v16, v0, a7
> > +        vwmul.vx          v28, v20, a7
> > +        vwmacc.vx         v16, t3, v8
> > +        vwmacc.vx         v28, t3, v24
> > +        vsetvlstatic32    \w, \vlen
> > +        add               t2, a0, a1
> > +        vadd.vx           v16, v16, t4
> > +        vadd.vx           v28, v28, t4
> > +        vsetvlstatic16    \w, \vlen, 1
> > +        vnsrl.wx          v16, v16, t6
> > +        vnsrl.wx          v28, v28, t6
> > +        vmax.vx           v16, v16, zero
> > +        vmax.vx           v28, v28, zero
> > +        vsetvlstatic8     \w, \vlen, 1
> > +        addi              a2, a2, 128*4
> > +        vnclipu.wi        v16, v16, 0
> > +        vnclipu.wi        v28, v28, 0
> > +        vse8.v            v16, (a0)
> > +        addi              a3, a3, 128*4
> > +        vse8.v            v28, (t2)
> > +        sh1add            a0, a1, a0
> > +.else
> > +        w_avg_nx1         \w, \vlen
> > +        addi              a5, a5, -1
> > +        .if \w == (\vlen / 2)
> > +        addi              a2, a2, (\vlen / 2)
> > +        addi              a3, a3, (\vlen / 2)
> > +        addi              a0, a0, (\vlen / 4)
> > +        w_avg_nx1         \w, \vlen
> > +        addi              a2, a2, -(\vlen / 2)
> > +        addi              a3, a3, -(\vlen / 2)
> > +        addi              a0, a0, -(\vlen / 4)
> > +        .elseif \w == 128 && \vlen == 128
> > +        .rept 3
> > +        addi              a2, a2, (\vlen / 2)
> > +        addi              a3, a3, (\vlen / 2)
> > +        addi              a0, a0, (\vlen / 4)
> > +        w_avg_nx1         \w, \vlen
> > +        .endr
> > +        addi              a2, a2, -(\vlen / 2) * 3
> > +        addi              a3, a3, -(\vlen / 2) * 3
> > +        addi              a0, a0, -(\vlen / 4) * 3
> > +        .endif
> > +
> > +        addi              a2, a2, 128*2
> > +        addi              a3, a3, 128*2
> > +        add               a0, a0, a1
> > +.endif
> > +        bnez              a5, \id\w\vlen\()b
> > +        ret
> > +.endm
> > +
> > +
> > +.macro func_w_avg vlen
> > +func ff_vvc_w_avg_8_rvv_\vlen\(), zve32x
> > +        AVG_JMP_TABLE     2, \vlen
> > +        csrw              vxrm, zero
> > +        addi              t6, a6, 7
> > +        ld                t3, (sp)
> > +        ld                t4, 8(sp)
> > +        ld                t5, 16(sp)
> > +        add               t4, t4, t5
> > +        addi              t4, t4, 1       // o0 + o1 + 1
> > +        addi              t5, t6, -1      // shift - 1
> > +        sll               t4, t4, t5
> > +        AVG_J             \vlen, 2
> > +        .irp w,2,4,8,16,32,64,128
> > +        w_avg             \w, \vlen, 2
> > +        .endr
> > +endfunc
> > +.endm
> > +
> > +func_avg 128
> > +func_avg 256
> > +#if (__riscv_xlen == 64)
> > +func_w_avg 128
> > +func_w_avg 256
> > +#endif
> > diff --git a/libavcodec/riscv/vvc/vvcdsp_init.c
> > b/libavcodec/riscv/vvc/vvcdsp_init.c new file mode 100644
> > index 0000000000..85b1ede061
> > --- /dev/null
> > +++ b/libavcodec/riscv/vvc/vvcdsp_init.c
> > @@ -0,0 +1,71 @@
> > +/*
> > + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences
> > (ISCAS). + *
> > + * This file is part of FFmpeg.
> > + *
> > + * FFmpeg is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * FFmpeg is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with FFmpeg; if not, write to the Free Software
> > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> 02110-1301
> > USA + */
> > +
> > +#include "config.h"
> > +
> > +#include "libavutil/attributes.h"
> > +#include "libavutil/cpu.h"
> > +#include "libavutil/riscv/cpu.h"
> > +#include "libavcodec/vvc/dsp.h"
> > +
> > +#define bf(fn, bd,  opt) fn##_##bd##_##opt
> > +
> > +#define AVG_PROTOTYPES(bd, opt)
>
> >                          \ +void bf(ff_vvc_avg, bd, opt)(uint8_t *dst,
> > ptrdiff_t dst_stride,                                     \ +    const
> > int16_t *src0, const int16_t *src1, int width, int height);
>
> >                \ +void bf(ff_vvc_w_avg, bd, opt)(uint8_t *dst, ptrdiff_t
> > dst_stride,                                   \ +    const int16_t *src0,
> > const int16_t *src1, int width, int height,
>
> > \ +    int denom, int w0, int w1, int o0, int o1);
> > +
> > +AVG_PROTOTYPES(8, rvv_128)
> > +AVG_PROTOTYPES(8, rvv_256)
> > +
> > +void ff_vvc_dsp_init_riscv(VVCDSPContext *const c, const int bd)
> > +{
> > +#if HAVE_RVV
> > +    const int flags = av_get_cpu_flags();
> > +
> > +    if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR)
> &&
> > +        ff_rv_vlen_least(256)) {
> > +        switch (bd) {
> > +            case 8:
> > +                c->inter.avg    = ff_vvc_avg_8_rvv_256;
> > +# if (__riscv_xlen == 64)
> > +                c->inter.w_avg    = ff_vvc_w_avg_8_rvv_256;
> > +# endif
> > +                break;
> > +            default:
> > +                break;
> > +        }
> > +    } else if ((flags & AV_CPU_FLAG_RVV_I32) && (flags &
> > AV_CPU_FLAG_RVB_ADDR) && +               ff_rv_vlen_least(128)) {
> > +        switch (bd) {
> > +            case 8:
> > +                c->inter.avg    = ff_vvc_avg_8_rvv_128;
> > +# if (__riscv_xlen == 64)
> > +                c->inter.w_avg    = ff_vvc_w_avg_8_rvv_128;
> > +# endif
> > +                break;
> > +            default:
> > +                break;
> > +        }
> > +    }
> > +#endif
> > +}
> > diff --git a/libavcodec/vvc/dsp.c b/libavcodec/vvc/dsp.c
> > index 41e830a98a..c55a37d255 100644
> > --- a/libavcodec/vvc/dsp.c
> > +++ b/libavcodec/vvc/dsp.c
> > @@ -121,7 +121,9 @@ void ff_vvc_dsp_init(VVCDSPContext *vvcdsp, int
> > bit_depth) break;
> >      }
> >
> > -#if ARCH_X86
> > +#if ARCH_RISCV
> > +    ff_vvc_dsp_init_riscv(vvcdsp, bit_depth);
> > +#elif ARCH_X86
> >      ff_vvc_dsp_init_x86(vvcdsp, bit_depth);
> >  #endif
> >  }
> > diff --git a/libavcodec/vvc/dsp.h b/libavcodec/vvc/dsp.h
> > index 1f14096c41..e03236dd76 100644
> > --- a/libavcodec/vvc/dsp.h
> > +++ b/libavcodec/vvc/dsp.h
> > @@ -180,6 +180,7 @@ typedef struct VVCDSPContext {
> >
> >  void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth);
> >
> > +void ff_vvc_dsp_init_riscv(VVCDSPContext *hpc, const int bit_depth);
> >  void ff_vvc_dsp_init_x86(VVCDSPContext *hpc, const int bit_depth);
> >
> >  #endif /* AVCODEC_VVC_DSP_H */
>
>
> --
> 雷米‧德尼-库尔蒙
> http://www.remlab.net/
>
>
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-06-01 20:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-01 18:01 [FFmpeg-devel] [PATCH v2] lavc/vvc_mc: R-V V avg w_avg uk7b
2024-06-01 18:02 ` flow gg
2024-06-01 19:54 ` Rémi Denis-Courmont
2024-06-01 20:11   ` flow gg

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git