[FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons
@ 2023-12-27  4:50 jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter jinbo
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: jinbo @ 2023-12-27  4:50 UTC (permalink / raw)
  To: ffmpeg-devel

v2: Add patch 7/7.

[PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter
[PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt
[PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
[PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 asm opt
[PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 asm opt
[PATCH v2 6/7] avcodec/hevc: Add qpel_uni_h,epel_uni_w_h|v,epel_bi_h asm opt
[PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter
  2023-12-27  4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
@ 2023-12-27  4:50 ` jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt jinbo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27  4:50 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: jinbo

Forgot to init c->sao_edge_filter[idx] when idx=0/1/2/3.
After this patch, the speedup of decoding H265 4K 30FPS
30Mbps on 3A6000 is about 7% (42fps==>45fps).

Change-Id: I521999b397fa72b931a23c165cf45f276440cdfb
---
 libavcodec/loongarch/hevcdsp_init_loongarch.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 22739c6f5b..5a96f3a4c9 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -167,6 +167,10 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_qpel_uni_w[8][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv48_8_lsx;
             c->put_hevc_qpel_uni_w[9][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv64_8_lsx;
 
+            c->sao_edge_filter[0] = ff_hevc_sao_edge_filter_8_lsx;
+            c->sao_edge_filter[1] = ff_hevc_sao_edge_filter_8_lsx;
+            c->sao_edge_filter[2] = ff_hevc_sao_edge_filter_8_lsx;
+            c->sao_edge_filter[3] = ff_hevc_sao_edge_filter_8_lsx;
             c->sao_edge_filter[4] = ff_hevc_sao_edge_filter_8_lsx;
 
             c->hevc_h_loop_filter_luma = ff_hevc_loop_filter_luma_h_8_lsx;
-- 
2.20.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [FFmpeg-devel] [PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt
  2023-12-27  4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter jinbo
@ 2023-12-27  4:50 ` jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27  4:50 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: jinbo

After this patch, the peformance of decoding H265 4K 30FPS 30Mbps
on 3A6000 with 8 threads improves 2fps (45fps-->47fsp).
---
 libavcodec/loongarch/Makefile                 |   3 +-
 libavcodec/loongarch/hevc_add_res.S           | 162 ++++++++++++++++++
 libavcodec/loongarch/hevcdsp_init_loongarch.c |   5 +
 libavcodec/loongarch/hevcdsp_lsx.h            |   5 +
 4 files changed, 174 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/loongarch/hevc_add_res.S

diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 06cfab5c20..07ea97f803 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -27,7 +27,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER)       += loongarch/hevcdsp_lsx.o \
                                          loongarch/hevc_lpf_sao_lsx.o \
                                          loongarch/hevc_mc_bi_lsx.o \
                                          loongarch/hevc_mc_uni_lsx.o \
-                                         loongarch/hevc_mc_uniw_lsx.o
+                                         loongarch/hevc_mc_uniw_lsx.o \
+                                         loongarch/hevc_add_res.o
 LSX-OBJS-$(CONFIG_H264DSP)            += loongarch/h264idct.o \
                                          loongarch/h264idct_loongarch.o \
                                          loongarch/h264dsp.o
diff --git a/libavcodec/loongarch/hevc_add_res.S b/libavcodec/loongarch/hevc_add_res.S
new file mode 100644
index 0000000000..dd2d820af8
--- /dev/null
+++ b/libavcodec/loongarch/hevc_add_res.S
@@ -0,0 +1,162 @@
+/*
+ * Loongson LSX optimized add_residual functions for HEVC decoding
+ *
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "loongson_asm.S"
+
+/*
+ * void ff_hevc_add_residual4x4_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+.macro ADD_RES_LSX_4x4_8
+    vldrepl.w      vr0,    a0,     0
+    add.d          t0,     a0,     a2
+    vldrepl.w      vr1,    t0,     0
+    vld            vr2,    a1,     0
+
+    vilvl.w        vr1,    vr1,    vr0
+    vsllwil.hu.bu  vr1,    vr1,    0
+    vadd.h         vr1,    vr1,    vr2
+    vssrani.bu.h   vr1,    vr1,    0
+
+    vstelm.w       vr1,    a0,     0,    0
+    vstelm.w       vr1,    t0,     0,    1
+.endm
+
+function ff_hevc_add_residual4x4_8_lsx
+    ADD_RES_LSX_4x4_8
+    alsl.d         a0,     a2,     a0,   1
+    addi.d         a1,     a1,     16
+    ADD_RES_LSX_4x4_8
+endfunc
+
+/*
+ * void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+.macro ADD_RES_LSX_8x8_8
+    vldrepl.d      vr0,    a0,     0
+    add.d          t0,     a0,     a2
+    vldrepl.d      vr1,    t0,     0
+    add.d          t1,     t0,     a2
+    vldrepl.d      vr2,    t1,     0
+    add.d          t2,     t1,     a2
+    vldrepl.d      vr3,    t2,     0
+
+    vld            vr4,    a1,     0
+    addi.d         t3,     zero,   16
+    vldx           vr5,    a1,     t3
+    addi.d         t4,     a1,     32
+    vld            vr6,    t4,     0
+    vldx           vr7,    t4,     t3
+
+    vsllwil.hu.bu  vr0,    vr0,    0
+    vsllwil.hu.bu  vr1,    vr1,    0
+    vsllwil.hu.bu  vr2,    vr2,    0
+    vsllwil.hu.bu  vr3,    vr3,    0
+    vadd.h         vr0,    vr0,    vr4
+    vadd.h         vr1,    vr1,    vr5
+    vadd.h         vr2,    vr2,    vr6
+    vadd.h         vr3,    vr3,    vr7
+    vssrani.bu.h   vr1,    vr0,    0
+    vssrani.bu.h   vr3,    vr2,    0
+
+    vstelm.d       vr1,    a0,     0,     0
+    vstelm.d       vr1,    t0,     0,     1
+    vstelm.d       vr3,    t1,     0,     0
+    vstelm.d       vr3,    t2,     0,     1
+.endm
+
+function ff_hevc_add_residual8x8_8_lsx
+    ADD_RES_LSX_8x8_8
+    alsl.d         a0,     a2,     a0,    2
+    addi.d         a1,     a1,     64
+    ADD_RES_LSX_8x8_8
+endfunc
+
+/*
+ * void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+function ff_hevc_add_residual16x16_8_lsx
+.rept 8
+    vld            vr0,    a0,     0
+    vldx           vr2,    a0,     a2
+
+    vld            vr4,    a1,     0
+    addi.d         t0,     zero,   16
+    vldx           vr5,    a1,     t0
+    addi.d         t1,     a1,     32
+    vld            vr6,    t1,     0
+    vldx           vr7,    t1,     t0
+
+    vexth.hu.bu    vr1,    vr0
+    vsllwil.hu.bu  vr0,    vr0,    0
+    vexth.hu.bu    vr3,    vr2
+    vsllwil.hu.bu  vr2,    vr2,    0
+    vadd.h         vr0,    vr0,    vr4
+    vadd.h         vr1,    vr1,    vr5
+    vadd.h         vr2,    vr2,    vr6
+    vadd.h         vr3,    vr3,    vr7
+
+    vssrani.bu.h   vr1,    vr0,    0
+    vssrani.bu.h   vr3,    vr2,    0
+
+    vst            vr1,    a0,     0
+    vstx           vr3,    a0,     a2
+
+    alsl.d         a0,     a2,     a0,   1
+    addi.d         a1,     a1,     64
+.endr
+endfunc
+
+/*
+ * void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+function ff_hevc_add_residual32x32_8_lsx
+.rept 32
+    vld            vr0,    a0,     0
+    addi.w         t0,     zero,   16
+    vldx           vr2,    a0,     t0
+
+    vld            vr4,    a1,     0
+    vldx           vr5,    a1,     t0
+    addi.d         t1,     a1,     32
+    vld            vr6,    t1,     0
+    vldx           vr7,    t1,     t0
+
+    vexth.hu.bu    vr1,    vr0
+    vsllwil.hu.bu  vr0,    vr0,    0
+    vexth.hu.bu    vr3,    vr2
+    vsllwil.hu.bu  vr2,    vr2,    0
+    vadd.h         vr0,    vr0,    vr4
+    vadd.h         vr1,    vr1,    vr5
+    vadd.h         vr2,    vr2,    vr6
+    vadd.h         vr3,    vr3,    vr7
+
+    vssrani.bu.h   vr1,    vr0,    0
+    vssrani.bu.h   vr3,    vr2,    0
+
+    vst            vr1,    a0,     0
+    vstx           vr3,    a0,     t0
+
+    add.d          a0,     a0,     a2
+    addi.d         a1,     a1,     64
+.endr
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 5a96f3a4c9..a8f753dc86 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -189,6 +189,11 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->idct[1] = ff_hevc_idct_8x8_lsx;
             c->idct[2] = ff_hevc_idct_16x16_lsx;
             c->idct[3] = ff_hevc_idct_32x32_lsx;
+
+            c->add_residual[0] = ff_hevc_add_residual4x4_8_lsx;
+            c->add_residual[1] = ff_hevc_add_residual8x8_8_lsx;
+            c->add_residual[2] = ff_hevc_add_residual16x16_8_lsx;
+            c->add_residual[3] = ff_hevc_add_residual32x32_8_lsx;
         }
     }
 }
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 0d54196caf..ac509984fd 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -227,4 +227,9 @@ void ff_hevc_idct_8x8_lsx(int16_t *coeffs, int col_limit);
 void ff_hevc_idct_16x16_lsx(int16_t *coeffs, int col_limit);
 void ff_hevc_idct_32x32_lsx(int16_t *coeffs, int col_limit);
 
+void ff_hevc_add_residual4x4_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
-- 
2.20.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
  2023-12-27  4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt jinbo
@ 2023-12-27  4:50 ` jinbo
  2023-12-28  2:18   ` yinshiyou-hf
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 " jinbo
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: jinbo @ 2023-12-27  4:50 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: jinbo

tests/checkasm/checkasm:           C       LSX     LASX
put_hevc_pel_uni_w_pixels4_8_c:    2.7     1.0
put_hevc_pel_uni_w_pixels6_8_c:    6.2     2.0     1.5
put_hevc_pel_uni_w_pixels8_8_c:    10.7    2.5     1.7
put_hevc_pel_uni_w_pixels12_8_c:   23.0    5.5     5.0
put_hevc_pel_uni_w_pixels16_8_c:   41.0    8.2     5.0
put_hevc_pel_uni_w_pixels24_8_c:   91.0    19.7    13.2
put_hevc_pel_uni_w_pixels32_8_c:   161.7   32.5    16.2
put_hevc_pel_uni_w_pixels48_8_c:   354.5   73.7    43.0
put_hevc_pel_uni_w_pixels64_8_c:   641.5   130.0   64.2

Speedup of decoding H265 4K 30FPS 30Mbps on 3A6000 with
8 threads is 1fps(47fps-->48fps).
---
 libavcodec/loongarch/Makefile                 |   3 +-
 libavcodec/loongarch/hevc_mc.S                | 471 ++++++++++++++++++
 libavcodec/loongarch/hevcdsp_init_loongarch.c |  43 ++
 libavcodec/loongarch/hevcdsp_lasx.h           |  53 ++
 libavcodec/loongarch/hevcdsp_lsx.h            |  27 +
 5 files changed, 596 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/loongarch/hevc_mc.S
 create mode 100644 libavcodec/loongarch/hevcdsp_lasx.h

diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 07ea97f803..ad98cd4054 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -28,7 +28,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER)       += loongarch/hevcdsp_lsx.o \
                                          loongarch/hevc_mc_bi_lsx.o \
                                          loongarch/hevc_mc_uni_lsx.o \
                                          loongarch/hevc_mc_uniw_lsx.o \
-                                         loongarch/hevc_add_res.o
+                                         loongarch/hevc_add_res.o \
+                                         loongarch/hevc_mc.o
 LSX-OBJS-$(CONFIG_H264DSP)            += loongarch/h264idct.o \
                                          loongarch/h264idct_loongarch.o \
                                          loongarch/h264dsp.o
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
new file mode 100644
index 0000000000..c5d553effe
--- /dev/null
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -0,0 +1,471 @@
+/*
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "loongson_asm.S"
+
+.macro LOAD_VAR bit
+    addi.w         t1,     a5,      6  //shift
+    addi.w         t3,     zero,    1  //one
+    sub.w          t4,     t1,      t3
+    sll.w          t3,     t3,      t4 //offset
+.if \bit == 128
+    vreplgr2vr.w   vr1,    a6          //wx
+    vreplgr2vr.w   vr2,    t3          //offset
+    vreplgr2vr.w   vr3,    t1          //shift
+    vreplgr2vr.w   vr4,    a7          //ox
+.else
+    xvreplgr2vr.w  xr1,    a6
+    xvreplgr2vr.w  xr2,    t3
+    xvreplgr2vr.w  xr3,    t1
+    xvreplgr2vr.w  xr4,    a7
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS8_LSX src0, dst0, w
+    vldrepl.d      vr0,    \src0,   0
+    vsllwil.hu.bu  vr0,    vr0,     0
+    vexth.wu.hu    vr5,    vr0
+    vsllwil.wu.hu  vr0,    vr0,     0
+    vslli.w        vr0,    vr0,     6
+    vslli.w        vr5,    vr5,     6
+    vmul.w         vr0,    vr0,     vr1
+    vmul.w         vr5,    vr5,     vr1
+    vadd.w         vr0,    vr0,     vr2
+    vadd.w         vr5,    vr5,     vr2
+    vsra.w         vr0,    vr0,     vr3
+    vsra.w         vr5,    vr5,     vr3
+    vadd.w         vr0,    vr0,     vr4
+    vadd.w         vr5,    vr5,     vr4
+    vssrani.h.w    vr5,    vr0,     0
+    vssrani.bu.h   vr5,    vr5,     0
+.if \w == 6
+    fst.s          f5,     \dst0,   0
+    vstelm.h       vr5,    \dst0,   4,     2
+.else
+    fst.d          f5,     \dst0,   0
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS8x2_LASX src0, dst0, w
+    vldrepl.d      vr0,    \src0,   0
+    add.d          t2,     \src0,   a3
+    vldrepl.d      vr5,    t2,      0
+    xvpermi.q      xr0,    xr5,     0x02
+    xvsllwil.hu.bu xr0,    xr0,     0
+    xvexth.wu.hu   xr5,    xr0
+    xvsllwil.wu.hu xr0,    xr0,     0
+    xvslli.w       xr0,    xr0,     6
+    xvslli.w       xr5,    xr5,     6
+    xvmul.w        xr0,    xr0,     xr1
+    xvmul.w        xr5,    xr5,     xr1
+    xvadd.w        xr0,    xr0,     xr2
+    xvadd.w        xr5,    xr5,     xr2
+    xvsra.w        xr0,    xr0,     xr3
+    xvsra.w        xr5,    xr5,     xr3
+    xvadd.w        xr0,    xr0,     xr4
+    xvadd.w        xr5,    xr5,     xr4
+    xvssrani.h.w   xr5,    xr0,     0
+    xvpermi.q      xr0,    xr5,     0x01
+    xvssrani.bu.h  xr0,    xr5,     0
+    add.d          t3,     \dst0,   a1
+.if \w == 6
+    vstelm.w       vr0,    \dst0,   0,     0
+    vstelm.h       vr0,    \dst0,   4,     2
+    vstelm.w       vr0,    t3,      0,     2
+    vstelm.h       vr0,    t3,      4,     6
+.else
+    vstelm.d       vr0,    \dst0,   0,     0
+    vstelm.d       vr0,    t3,      0,     1
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS16_LSX src0, dst0
+    vld            vr0,    \src0,   0
+    vexth.hu.bu    vr7,    vr0
+    vexth.wu.hu    vr8,    vr7
+    vsllwil.wu.hu  vr7,    vr7,     0
+    vsllwil.hu.bu  vr5,    vr0,     0
+    vexth.wu.hu    vr6,    vr5
+    vsllwil.wu.hu  vr5,    vr5,     0
+    vslli.w        vr5,    vr5,     6
+    vslli.w        vr6,    vr6,     6
+    vslli.w        vr7,    vr7,     6
+    vslli.w        vr8,    vr8,     6
+    vmul.w         vr5,    vr5,     vr1
+    vmul.w         vr6,    vr6,     vr1
+    vmul.w         vr7,    vr7,     vr1
+    vmul.w         vr8,    vr8,     vr1
+    vadd.w         vr5,    vr5,     vr2
+    vadd.w         vr6,    vr6,     vr2
+    vadd.w         vr7,    vr7,     vr2
+    vadd.w         vr8,    vr8,     vr2
+    vsra.w         vr5,    vr5,     vr3
+    vsra.w         vr6,    vr6,     vr3
+    vsra.w         vr7,    vr7,     vr3
+    vsra.w         vr8,    vr8,     vr3
+    vadd.w         vr5,    vr5,     vr4
+    vadd.w         vr6,    vr6,     vr4
+    vadd.w         vr7,    vr7,     vr4
+    vadd.w         vr8,    vr8,     vr4
+    vssrani.h.w    vr6,    vr5,     0
+    vssrani.h.w    vr8,    vr7,     0
+    vssrani.bu.h   vr8,    vr6,     0
+    vst            vr8,    \dst0,   0
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS16_LASX src0, dst0
+    vld            vr0,    \src0,   0
+    xvpermi.d      xr0,    xr0,     0xd8
+    xvsllwil.hu.bu xr0,    xr0,     0
+    xvexth.wu.hu   xr6,    xr0
+    xvsllwil.wu.hu xr5,    xr0,     0
+    xvslli.w       xr5,    xr5,     6
+    xvslli.w       xr6,    xr6,     6
+    xvmul.w        xr5,    xr5,     xr1
+    xvmul.w        xr6,    xr6,     xr1
+    xvadd.w        xr5,    xr5,     xr2
+    xvadd.w        xr6,    xr6,     xr2
+    xvsra.w        xr5,    xr5,     xr3
+    xvsra.w        xr6,    xr6,     xr3
+    xvadd.w        xr5,    xr5,     xr4
+    xvadd.w        xr6,    xr6,     xr4
+    xvssrani.h.w   xr6,    xr5,     0
+    xvpermi.q      xr7,    xr6,     0x01
+    xvssrani.bu.h  xr7,    xr6,     0
+    vst            vr7,    \dst0,   0
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS32_LASX src0, dst0, w
+.if \w == 16
+    vld            vr0,    \src0,   0
+    add.d          t2,     \src0,   a3
+    vld            vr5,    t2,      0
+    xvpermi.q      xr0,    xr5,     0x02
+.else //w=24/32
+    xvld           xr0,    \src0,   0
+.endif
+    xvexth.hu.bu   xr7,    xr0
+    xvexth.wu.hu   xr8,    xr7
+    xvsllwil.wu.hu xr7,    xr7,     0
+    xvsllwil.hu.bu xr5,    xr0,     0
+    xvexth.wu.hu   xr6,    xr5
+    xvsllwil.wu.hu xr5,    xr5,     0
+    xvslli.w       xr5,    xr5,     6
+    xvslli.w       xr6,    xr6,     6
+    xvslli.w       xr7,    xr7,     6
+    xvslli.w       xr8,    xr8,     6
+    xvmul.w        xr5,    xr5,     xr1
+    xvmul.w        xr6,    xr6,     xr1
+    xvmul.w        xr7,    xr7,     xr1
+    xvmul.w        xr8,    xr8,     xr1
+    xvadd.w        xr5,    xr5,     xr2
+    xvadd.w        xr6,    xr6,     xr2
+    xvadd.w        xr7,    xr7,     xr2
+    xvadd.w        xr8,    xr8,     xr2
+    xvsra.w        xr5,    xr5,     xr3
+    xvsra.w        xr6,    xr6,     xr3
+    xvsra.w        xr7,    xr7,     xr3
+    xvsra.w        xr8,    xr8,     xr3
+    xvadd.w        xr5,    xr5,     xr4
+    xvadd.w        xr6,    xr6,     xr4
+    xvadd.w        xr7,    xr7,     xr4
+    xvadd.w        xr8,    xr8,     xr4
+    xvssrani.h.w   xr6,    xr5,     0
+    xvssrani.h.w   xr8,    xr7,     0
+    xvssrani.bu.h  xr8,    xr6,     0
+.if \w == 16
+    vst            vr8,    \dst0,   0
+    add.d          t2,     \dst0,   a1
+    xvpermi.q      xr8,    xr8,     0x01
+    vst            vr8,    t2,      0
+.elseif \w == 24
+    vst            vr8,    \dst0,   0
+    xvstelm.d      xr8,    \dst0,   16,    2
+.else
+    xvst           xr8,    \dst0,   0
+.endif
+.endm
+
+function ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx
+    LOAD_VAR 128
+    srli.w         t0,     a4,      1
+.LOOP_PIXELS4:
+    vldrepl.w      vr0,    a2,      0
+    add.d          t1,     a2,      a3
+    vldrepl.w      vr5,    t1,      0
+    vsllwil.hu.bu  vr0,    vr0,     0
+    vsllwil.wu.hu  vr0,    vr0,     0
+    vsllwil.hu.bu  vr5,    vr5,     0
+    vsllwil.wu.hu  vr5,    vr5,     0
+    vslli.w        vr0,    vr0,     6
+    vslli.w        vr5,    vr5,     6
+    vmul.w         vr0,    vr0,     vr1
+    vmul.w         vr5,    vr5,     vr1
+    vadd.w         vr0,    vr0,     vr2
+    vadd.w         vr5,    vr5,     vr2
+    vsra.w         vr0,    vr0,     vr3
+    vsra.w         vr5,    vr5,     vr3
+    vadd.w         vr0,    vr0,     vr4
+    vadd.w         vr5,    vr5,     vr4
+    vssrani.h.w    vr5,    vr0,     0
+    vssrani.bu.h   vr5,    vr5,     0
+    fst.s          f5,     a0,      0
+    add.d          t2,     a0,      a1
+    vstelm.w       vr5,    t2,      0,     1
+    alsl.d         a2,     a3,      a2,    1
+    alsl.d         a0,     a1,      a0,    1
+    addi.w         t0,     t0,      -1
+    bnez           t0,     .LOOP_PIXELS4
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx
+    LOAD_VAR 128
+.LOOP_PIXELS6:
+    HEVC_PEL_UNI_W_PIXELS8_LSX      a2,    a0,    6
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS6
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx
+    LOAD_VAR 256
+    srli.w         t0,     a4,      1
+.LOOP_PIXELS6_LASX:
+    HEVC_PEL_UNI_W_PIXELS8x2_LASX   a2,    a0,    6
+    alsl.d         a2,     a3,      a2,    1
+    alsl.d         a0,     a1,      a0,    1
+    addi.w         t0,     t0,      -1
+    bnez           t0,     .LOOP_PIXELS6_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx
+    LOAD_VAR 128
+.LOOP_PIXELS8:
+    HEVC_PEL_UNI_W_PIXELS8_LSX      a2,    a0,    8
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS8
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx
+    LOAD_VAR 256
+    srli.w         t0,     a4,      1
+.LOOP_PIXELS8_LASX:
+    HEVC_PEL_UNI_W_PIXELS8x2_LASX   a2,    a0,    8
+    alsl.d         a2,     a3,      a2,    1
+    alsl.d         a0,     a1,      a0,    1
+    addi.w         t0,     t0,      -1
+    bnez           t0,     .LOOP_PIXELS8_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx
+    LOAD_VAR 128
+.LOOP_PIXELS12:
+    vld            vr0,    a2,      0
+    vexth.hu.bu    vr7,    vr0
+    vsllwil.wu.hu  vr7,    vr7,     0
+    vsllwil.hu.bu  vr5,    vr0,     0
+    vexth.wu.hu    vr6,    vr5
+    vsllwil.wu.hu  vr5,    vr5,     0
+    vslli.w        vr5,    vr5,     6
+    vslli.w        vr6,    vr6,     6
+    vslli.w        vr7,    vr7,     6
+    vmul.w         vr5,    vr5,     vr1
+    vmul.w         vr6,    vr6,     vr1
+    vmul.w         vr7,    vr7,     vr1
+    vadd.w         vr5,    vr5,     vr2
+    vadd.w         vr6,    vr6,     vr2
+    vadd.w         vr7,    vr7,     vr2
+    vsra.w         vr5,    vr5,     vr3
+    vsra.w         vr6,    vr6,     vr3
+    vsra.w         vr7,    vr7,     vr3
+    vadd.w         vr5,    vr5,     vr4
+    vadd.w         vr6,    vr6,     vr4
+    vadd.w         vr7,    vr7,     vr4
+    vssrani.h.w    vr6,    vr5,     0
+    vssrani.h.w    vr7,    vr7,     0
+    vssrani.bu.h   vr7,    vr6,     0
+    fst.d          f7,     a0,      0
+    vstelm.w       vr7,    a0,      8,     2
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS12
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx
+    LOAD_VAR 256
+.LOOP_PIXELS12_LASX:
+    vld            vr0,    a2,      0
+    xvpermi.d      xr0,    xr0,     0xd8
+    xvsllwil.hu.bu xr0,    xr0,     0
+    xvexth.wu.hu   xr6,    xr0
+    xvsllwil.wu.hu xr5,    xr0,     0
+    xvslli.w       xr5,    xr5,     6
+    xvslli.w       xr6,    xr6,     6
+    xvmul.w        xr5,    xr5,     xr1
+    xvmul.w        xr6,    xr6,     xr1
+    xvadd.w        xr5,    xr5,     xr2
+    xvadd.w        xr6,    xr6,     xr2
+    xvsra.w        xr5,    xr5,     xr3
+    xvsra.w        xr6,    xr6,     xr3
+    xvadd.w        xr5,    xr5,     xr4
+    xvadd.w        xr6,    xr6,     xr4
+    xvssrani.h.w   xr6,    xr5,     0
+    xvpermi.q      xr7,    xr6,     0x01
+    xvssrani.bu.h  xr7,    xr6,     0
+    fst.d          f7,     a0,      0
+    vstelm.w       vr7,    a0,      8,     2
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS12_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx
+    LOAD_VAR 128
+.LOOP_PIXELS16:
+    HEVC_PEL_UNI_W_PIXELS16_LSX     a2,    a0
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS16
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx
+    LOAD_VAR 256
+    srli.w         t0,     a4,      1
+.LOOP_PIXELS16_LASX:
+    HEVC_PEL_UNI_W_PIXELS32_LASX    a2,    a0,   16
+    alsl.d         a2,     a3,      a2,    1
+    alsl.d         a0,     a1,      a0,    1
+    addi.w         t0,     t0,      -1
+    bnez           t0,     .LOOP_PIXELS16_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx
+    LOAD_VAR 128
+.LOOP_PIXELS24:
+    HEVC_PEL_UNI_W_PIXELS16_LSX     a2,    a0
+    addi.d         t0,     a2,      16
+    addi.d         t1,     a0,      16
+    HEVC_PEL_UNI_W_PIXELS8_LSX      t0,    t1,   8
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS24
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx
+    LOAD_VAR 256
+.LOOP_PIXELS24_LASX:
+    HEVC_PEL_UNI_W_PIXELS32_LASX    a2,    a0,   24
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS24_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx
+    LOAD_VAR 128
+.LOOP_PIXELS32:
+    HEVC_PEL_UNI_W_PIXELS16_LSX     a2,    a0
+    addi.d         t0,     a2,      16
+    addi.d         t1,     a0,      16
+    HEVC_PEL_UNI_W_PIXELS16_LSX     t0,    t1
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS32
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx
+    LOAD_VAR 256
+.LOOP_PIXELS32_LASX:
+    HEVC_PEL_UNI_W_PIXELS32_LASX    a2,    a0,    32
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS32_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx
+    LOAD_VAR 128
+.LOOP_PIXELS48:
+    HEVC_PEL_UNI_W_PIXELS16_LSX     a2,    a0
+    addi.d         t0,     a2,      16
+    addi.d         t1,     a0,      16
+    HEVC_PEL_UNI_W_PIXELS16_LSX     t0,    t1
+    addi.d         t0,     a2,      32
+    addi.d         t1,     a0,      32
+    HEVC_PEL_UNI_W_PIXELS16_LSX     t0,    t1
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS48
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx
+    LOAD_VAR 256
+.LOOP_PIXELS48_LASX:
+    HEVC_PEL_UNI_W_PIXELS32_LASX    a2,    a0,    32
+    addi.d         t0,     a2,      32
+    addi.d         t1,     a0,      32
+    HEVC_PEL_UNI_W_PIXELS16_LASX    t0,    t1
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS48_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx
+    LOAD_VAR 128
+.LOOP_PIXELS64:
+    HEVC_PEL_UNI_W_PIXELS16_LSX     a2,    a0
+    addi.d         t0,     a2,      16
+    addi.d         t1,     a0,      16
+    HEVC_PEL_UNI_W_PIXELS16_LSX     t0,    t1
+    addi.d         t0,     a2,      32
+    addi.d         t1,     a0,      32
+    HEVC_PEL_UNI_W_PIXELS16_LSX     t0,    t1
+    addi.d         t0,     a2,      48
+    addi.d         t1,     a0,      48
+    HEVC_PEL_UNI_W_PIXELS16_LSX     t0,    t1
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS64
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx
+    LOAD_VAR 256
+.LOOP_PIXELS64_LASX:
+    HEVC_PEL_UNI_W_PIXELS32_LASX    a2,    a0,    32
+    addi.d         t0,     a2,      32
+    addi.d         t1,     a0,      32
+    HEVC_PEL_UNI_W_PIXELS32_LASX    t0,    t1,    32
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.w         a4,     a4,      -1
+    bnez           a4,     .LOOP_PIXELS64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index a8f753dc86..d0ee99d6b5 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -22,6 +22,7 @@
 
 #include "libavutil/loongarch/cpu.h"
 #include "hevcdsp_lsx.h"
+#include "hevcdsp_lasx.h"
 
 void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
 {
@@ -160,6 +161,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_epel_uni[6][1][1] = ff_hevc_put_hevc_uni_epel_hv24_8_lsx;
             c->put_hevc_epel_uni[7][1][1] = ff_hevc_put_hevc_uni_epel_hv32_8_lsx;
 
+            c->put_hevc_qpel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
+            c->put_hevc_qpel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
+            c->put_hevc_qpel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
+            c->put_hevc_qpel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx;
+            c->put_hevc_qpel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx;
+            c->put_hevc_qpel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx;
+            c->put_hevc_qpel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx;
+            c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
+            c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+
+            c->put_hevc_epel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
+            c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
+            c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
+            c->put_hevc_epel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx;
+            c->put_hevc_epel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx;
+            c->put_hevc_epel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx;
+            c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx;
+            c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
+            c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+
             c->put_hevc_qpel_uni_w[3][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv8_8_lsx;
             c->put_hevc_qpel_uni_w[5][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv16_8_lsx;
             c->put_hevc_qpel_uni_w[6][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv24_8_lsx;
@@ -196,4 +217,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->add_residual[3] = ff_hevc_add_residual32x32_8_lsx;
         }
     }
+
+    if (have_lasx(cpu_flags)) {
+        if (bit_depth == 8) {
+            c->put_hevc_qpel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx;
+            c->put_hevc_qpel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx;
+            c->put_hevc_qpel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx;
+            c->put_hevc_qpel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx;
+            c->put_hevc_qpel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx;
+            c->put_hevc_qpel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
+            c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
+            c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+
+            c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx;
+            c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx;
+            c->put_hevc_epel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx;
+            c->put_hevc_epel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx;
+            c->put_hevc_epel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx;
+            c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
+            c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
+            c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+        }
+    }
 }
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
new file mode 100644
index 0000000000..819c3c3ecf
--- /dev/null
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -0,0 +1,53 @@
+/*
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
+#define AVCODEC_LOONGARCH_HEVCDSP_LASX_H
+
+#include "libavcodec/hevcdsp.h"
+
+#define PEL_UNI_W(PEL, DIR, WIDTH)                                       \
+void ff_hevc_put_hevc_##PEL##_uni_w_##DIR##WIDTH##_8_lasx(uint8_t *dst,  \
+                                                          ptrdiff_t      \
+                                                          dst_stride,    \
+                                                          const uint8_t *src,  \
+                                                          ptrdiff_t      \
+                                                          src_stride,    \
+                                                          int height,    \
+                                                          int denom,     \
+                                                          int wx,        \
+                                                          int ox,        \
+                                                          intptr_t mx,   \
+                                                          intptr_t my,   \
+                                                          int width)
+
+PEL_UNI_W(pel, pixels, 6);
+PEL_UNI_W(pel, pixels, 8);
+PEL_UNI_W(pel, pixels, 12);
+PEL_UNI_W(pel, pixels, 16);
+PEL_UNI_W(pel, pixels, 24);
+PEL_UNI_W(pel, pixels, 32);
+PEL_UNI_W(pel, pixels, 48);
+PEL_UNI_W(pel, pixels, 64);
+
+#undef PEL_UNI_W
+
+#endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index ac509984fd..0d724a90ef 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -232,4 +232,31 @@ void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t s
 void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
 void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
 
+#define PEL_UNI_W(PEL, DIR, WIDTH)                                      \
+void ff_hevc_put_hevc_##PEL##_uni_w_##DIR##WIDTH##_8_lsx(uint8_t *dst,  \
+                                                         ptrdiff_t      \
+                                                         dst_stride,    \
+                                                         const uint8_t *src,  \
+                                                         ptrdiff_t      \
+                                                         src_stride,    \
+                                                         int height,    \
+                                                         int denom,     \
+                                                         int wx,        \
+                                                         int ox,        \
+                                                         intptr_t mx,   \
+                                                         intptr_t my,   \
+                                                         int width)
+
+PEL_UNI_W(pel, pixels, 4);
+PEL_UNI_W(pel, pixels, 6);
+PEL_UNI_W(pel, pixels, 8);
+PEL_UNI_W(pel, pixels, 12);
+PEL_UNI_W(pel, pixels, 16);
+PEL_UNI_W(pel, pixels, 24);
+PEL_UNI_W(pel, pixels, 32);
+PEL_UNI_W(pel, pixels, 48);
+PEL_UNI_W(pel, pixels, 64);
+
+#undef PEL_UNI_W
+
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
-- 
2.20.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-28  2:18   ` yinshiyou-hf
  0 siblings, 0 replies; 10+ messages in thread
From: yinshiyou-hf @ 2023-12-28  2:18 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: jinbo

> -----原始邮件-----
> 发件人: jinbo <jinbo@loongson.cn>
> 发送时间:2023-12-27 12:50:15 (星期三)
> 收件人: ffmpeg-devel@ffmpeg.org
> 抄送: jinbo <jinbo@loongson.cn>
> 主题: [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
> 

> +
> +.macro HEVC_PEL_UNI_W_PIXELS8_LSX src0, dst0, w
> +    vldrepl.d      vr0,    \src0,   0
> +    vsllwil.hu.bu  vr0,    vr0,     0
> +    vexth.wu.hu    vr5,    vr0
> +    vsllwil.wu.hu  vr0,    vr0,     0
> +    vslli.w        vr0,    vr0,     6
> +    vslli.w        vr5,    vr5,     6
> +    vmul.w         vr0,    vr0,     vr1
> +    vmul.w         vr5,    vr5,     vr1
> +    vadd.w         vr0,    vr0,     vr2
> +    vadd.w         vr5,    vr5,     vr2
You can use 'vmadd.w' here.
> +    vsra.w         vr0,    vr0,     vr3
> +    vsra.w         vr5,    vr5,     vr3
> +    vadd.w         vr0,    vr0,     vr4
> +    vadd.w         vr5,    vr5,     vr4
> +    vssrani.h.w    vr5,    vr0,     0
> +    vssrani.bu.h   vr5,    vr5,     0
> +.if \w == 6
> +    fst.s          f5,     \dst0,   0
> +    vstelm.h       vr5,    \dst0,   4,     2
> +.else
> +    fst.d          f5,     \dst0,   0
> +.endif
> +.endm
> +
> +.macro HEVC_PEL_UNI_W_PIXELS8x2_LASX src0, dst0, w
> +    vldrepl.d      vr0,    \src0,   0
> +    add.d          t2,     \src0,   a3
> +    vldrepl.d      vr5,    t2,      0
> +    xvpermi.q      xr0,    xr5,     0x02
> +    xvsllwil.hu.bu xr0,    xr0,     0
> +    xvexth.wu.hu   xr5,    xr0
> +    xvsllwil.wu.hu xr0,    xr0,     0
> +    xvslli.w       xr0,    xr0,     6
> +    xvslli.w       xr5,    xr5,     6
> +    xvmul.w        xr0,    xr0,     xr1
> +    xvmul.w        xr5,    xr5,     xr1
> +    xvadd.w        xr0,    xr0,     xr2
> +    xvadd.w        xr5,    xr5,     xr2

Use 'vmadd.w' will be better.

> +    xvsra.w        xr0,    xr0,     xr3
> +    xvsra.w        xr5,    xr5,     xr3
> +    xvadd.w        xr0,    xr0,     xr4
> +    xvadd.w        xr5,    xr5,     xr4
> +    xvssrani.h.w   xr5,    xr0,     0
> +    xvpermi.q      xr0,    xr5,     0x01
> +    xvssrani.bu.h  xr0,    xr5,     0
> +    add.d          t3,     \dst0,   a1
> +.if \w == 6
> +    vstelm.w       vr0,    \dst0,   0,     0
> +    vstelm.h       vr0,    \dst0,   4,     2
> +    vstelm.w       vr0,    t3,      0,     2
> +    vstelm.h       vr0,    t3,      4,     6
> +.else
> +    vstelm.d       vr0,    \dst0,   0,     0
> +    vstelm.d       vr0,    t3,      0,     1
> +.endif
> +.endm
> +
> +.macro HEVC_PEL_UNI_W_PIXELS16_LSX src0, dst0
> +    vld            vr0,    \src0,   0
> +    vexth.hu.bu    vr7,    vr0
> +    vexth.wu.hu    vr8,    vr7
> +    vsllwil.wu.hu  vr7,    vr7,     0
> +    vsllwil.hu.bu  vr5,    vr0,     0
> +    vexth.wu.hu    vr6,    vr5
> +    vsllwil.wu.hu  vr5,    vr5,     0
> +    vslli.w        vr5,    vr5,     6
> +    vslli.w        vr6,    vr6,     6
> +    vslli.w        vr7,    vr7,     6
> +    vslli.w        vr8,    vr8,     6
> +    vmul.w         vr5,    vr5,     vr1
> +    vmul.w         vr6,    vr6,     vr1
> +    vmul.w         vr7,    vr7,     vr1
> +    vmul.w         vr8,    vr8,     vr1
> +    vadd.w         vr5,    vr5,     vr2
> +    vadd.w         vr6,    vr6,     vr2
> +    vadd.w         vr7,    vr7,     vr2
> +    vadd.w         vr8,    vr8,     vr2

Use 'vmadd.w', please check it in your left code.


本邮件及其附件含有龙芯中科的商业秘密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制或散发）本邮件及其附件中的信息。如果您错收本邮件，请您立即电话或邮件通知发件人并删除本邮件。 
This email and its attachments contain confidential information from Loongson Technology , which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it. 
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [FFmpeg-devel] [PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 asm opt
  2023-12-27  4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
                   ` (2 preceding siblings ...)
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-27  4:50 ` jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 " jinbo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27  4:50 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: jinbo

tests/checkasm/checkasm:           C       LSX     LASX
put_hevc_qpel_uni_w_h4_8_c:        6.5     1.7     1.2
put_hevc_qpel_uni_w_h6_8_c:        14.5    4.5     3.7
put_hevc_qpel_uni_w_h8_8_c:        24.5    5.7     4.5
put_hevc_qpel_uni_w_h12_8_c:       54.7    17.5    12.0
put_hevc_qpel_uni_w_h16_8_c:       96.5    22.7    13.2
put_hevc_qpel_uni_w_h24_8_c:       216.0   51.2    33.2
put_hevc_qpel_uni_w_h32_8_c:       385.7   87.0    53.2
put_hevc_qpel_uni_w_h48_8_c:       860.5   192.0   113.2
put_hevc_qpel_uni_w_h64_8_c:       1531.0  334.2   200.0

put_hevc_qpel_uni_w_v4_8_c:        8.0     1.7
put_hevc_qpel_uni_w_v6_8_c:        17.2    4.5
put_hevc_qpel_uni_w_v8_8_c:        29.5    6.0     5.2
put_hevc_qpel_uni_w_v12_8_c:       65.2    16.0    11.7
put_hevc_qpel_uni_w_v16_8_c:       116.5   20.5    14.0
put_hevc_qpel_uni_w_v24_8_c:       259.2   48.5    37.2
put_hevc_qpel_uni_w_v32_8_c:       459.5   80.5    56.0
put_hevc_qpel_uni_w_v48_8_c:       1028.5  180.2   126.5
put_hevc_qpel_uni_w_v64_8_c:       1831.2  319.2   224.2

Speedup of decoding H265 4K 30FPS 30Mbps on
3A6000 with 8 threads is 4fps(48fps-->52fps).

Change-Id: I1178848541d90083869225ba98a02e6aa8bb8c5a
---
 libavcodec/loongarch/hevc_mc.S                | 1294 +++++++++++++++++
 libavcodec/loongarch/hevcdsp_init_loongarch.c |   38 +
 libavcodec/loongarch/hevcdsp_lasx.h           |   18 +
 libavcodec/loongarch/hevcdsp_lsx.h            |   20 +
 4 files changed, 1370 insertions(+)

diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index c5d553effe..2ee338fb8e 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -21,6 +21,8 @@
 
 #include "loongson_asm.S"
 
+.extern ff_hevc_qpel_filters
+
 .macro LOAD_VAR bit
     addi.w         t1,     a5,      6  //shift
     addi.w         t3,     zero,    1  //one
@@ -469,3 +471,1295 @@ function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx
     addi.w         a4,     a4,      -1
     bnez           a4,     .LOOP_PIXELS64_LASX
 endfunc
+
+.macro  vhaddw.d.h  in0
+    vhaddw.w.h  \in0,  \in0,  \in0
+    vhaddw.d.w  \in0,  \in0,  \in0
+.endm
+
+.macro  xvhaddw.d.h  in0
+    xvhaddw.w.h  \in0,  \in0,  \in0
+    xvhaddw.d.w  \in0,  \in0,  \in0
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0  //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    fld.s          f6,     a2,      0  //0
+    fldx.s         f7,     a2,      a3 //1
+    fldx.s         f8,     a2,      t0 //2
+    add.d          a2,     a2,      t1
+    fld.s          f9,     a2,      0  //3
+    fldx.s         f10,    a2,      a3 //4
+    fldx.s         f11,    a2,      t0 //5
+    fldx.s         f12,    a2,      t1 //6
+    add.d          a2,     a2,      t2
+    vilvl.b        vr6,    vr7,     vr6
+    vilvl.b        vr7,    vr9,     vr8
+    vilvl.b        vr8,    vr11,    vr10
+    vilvl.b        vr9,    vr13,    vr12
+    vilvl.h        vr6,    vr7,     vr6
+    vilvl.h        vr7,    vr9,     vr8
+    vilvl.w        vr8,    vr7,     vr6
+    vilvh.w        vr9,    vr7,     vr6
+.LOOP_V4:
+    fld.s          f13,    a2,      0  //7
+    fldx.s         f14,    a2,      a3 //8 next loop
+    add.d          a2,     a2,      t0
+    vextrins.b     vr8,    vr13,    0x70
+    vextrins.b     vr8,    vr13,    0xf1
+    vextrins.b     vr9,    vr13,    0x72
+    vextrins.b     vr9,    vr13,    0xf3
+    vbsrl.v        vr10,   vr8,     1
+    vbsrl.v        vr11,   vr9,     1
+    vextrins.b     vr10,   vr14,    0x70
+    vextrins.b     vr10,   vr14,    0xf1
+    vextrins.b     vr11,   vr14,    0x72
+    vextrins.b     vr11,   vr14,    0xf3
+    vdp2.h.bu.b    vr6,    vr8,     vr5 //QPEL_FILTER(src, stride)
+    vdp2.h.bu.b    vr7,    vr9,     vr5
+    vdp2.h.bu.b    vr12,   vr10,    vr5
+    vdp2.h.bu.b    vr13,   vr11,    vr5
+    vbsrl.v        vr8,    vr10,    1
+    vbsrl.v        vr9,    vr11,    1
+    vhaddw.d.h     vr6
+    vhaddw.d.h     vr7
+    vhaddw.d.h     vr12
+    vhaddw.d.h     vr13
+    vpickev.w      vr6,    vr7,     vr6
+    vpickev.w      vr12,   vr13,    vr12
+    vmulwev.w.h    vr6,    vr6,     vr1 //QPEL_FILTER(src, stride) * wx
+    vmulwev.w.h    vr12,   vr12,    vr1
+    vadd.w         vr6,    vr6,     vr2
+    vsra.w         vr6,    vr6,     vr3
+    vadd.w         vr6,    vr6,     vr4
+    vadd.w         vr12,   vr12,    vr2
+    vsra.w         vr12,   vr12,    vr3
+    vadd.w         vr12,   vr12,    vr4
+    vssrani.h.w    vr12,   vr6,     0
+    vssrani.bu.h   vr12,   vr12,    0
+    fst.s          f12,    a0,      0
+    add.d          a0,     a0,      a1
+    vstelm.w       vr12,   a0,      0,     1
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -2
+    bnez           a4,     .LOOP_V4
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v6_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    fld.d          f6,     a2,      0
+    fldx.d         f7,     a2,      a3
+    fldx.d         f8,     a2,      t0
+    add.d          a2,     a2,      t1
+    fld.d          f9,     a2,      0
+    fldx.d         f10,    a2,      a3
+    fldx.d         f11,    a2,      t0
+    fldx.d         f12,    a2,      t1
+    add.d          a2,     a2,      t2
+    vilvl.b        vr6,    vr7,     vr6 //transpose 8x6 to 3x16
+    vilvl.b        vr7,    vr9,     vr8
+    vilvl.b        vr8,    vr11,    vr10
+    vilvl.b        vr9,    vr13,    vr12
+    vilvl.h        vr10,   vr7,     vr6
+    vilvh.h        vr11,   vr7,     vr6
+    vilvl.h        vr12,   vr9,     vr8
+    vilvh.h        vr13,   vr9,     vr8
+    vilvl.w        vr6,    vr12,    vr10
+    vilvh.w        vr7,    vr12,    vr10
+    vilvl.w        vr8,    vr13,    vr11
+.LOOP_V6:
+    fld.d          f13,    a2,      0
+    add.d          a2,     a2,      a3
+    vextrins.b     vr6,    vr13,    0x70
+    vextrins.b     vr6,    vr13,    0xf1
+    vextrins.b     vr7,    vr13,    0x72
+    vextrins.b     vr7,    vr13,    0xf3
+    vextrins.b     vr8,    vr13,    0x74
+    vextrins.b     vr8,    vr13,    0xf5
+    vdp2.h.bu.b    vr10,   vr6,     vr5 //QPEL_FILTER(src, stride)
+    vdp2.h.bu.b    vr11,   vr7,     vr5
+    vdp2.h.bu.b    vr12,   vr8,     vr5
+    vbsrl.v        vr6,    vr6,     1
+    vbsrl.v        vr7,    vr7,     1
+    vbsrl.v        vr8,    vr8,     1
+    vhaddw.d.h     vr10
+    vhaddw.d.h     vr11
+    vhaddw.d.h     vr12
+    vpickev.w      vr10,   vr11,    vr10
+    vpickev.w      vr11,   vr13,    vr12
+    vmulwev.w.h    vr10,   vr10,    vr1 //QPEL_FILTER(src, stride) * wx
+    vmulwev.w.h    vr11,   vr11,    vr1
+    vadd.w         vr10,   vr10,    vr2
+    vadd.w         vr11,   vr11,    vr2
+    vsra.w         vr10,   vr10,    vr3
+    vsra.w         vr11,   vr11,    vr3
+    vadd.w         vr10,   vr10,    vr4
+    vadd.w         vr11,   vr11,    vr4
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.s          f11,    a0,      0
+    vstelm.h       vr11,   a0,      4,    2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_V6
+endfunc
+
+// transpose 8x8b to 4x16b
+.macro TRANSPOSE8X8B_LSX in0, in1, in2, in3, in4, in5, in6, in7, \
+                         out0, out1, out2, out3
+    vilvl.b        \in0,    \in1,     \in0
+    vilvl.b        \in1,    \in3,     \in2
+    vilvl.b        \in2,    \in5,     \in4
+    vilvl.b        \in3,    \in7,     \in6
+    vilvl.h        \in4,    \in1,     \in0
+    vilvh.h        \in5,    \in1,     \in0
+    vilvl.h        \in6,    \in3,     \in2
+    vilvh.h        \in7,    \in3,     \in2
+    vilvl.w        \out0,   \in6,     \in4
+    vilvh.w        \out1,   \in6,     \in4
+    vilvl.w        \out2,   \in7,     \in5
+    vilvh.w        \out3,   \in7,     \in5
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_V8_LSX in0, in1, in2, in3, out0, out1, pos
+.if \pos == 0
+    vextrins.b     \in0,    vr13,    0x70 //insert the 8th load
+    vextrins.b     \in0,    vr13,    0xf1
+    vextrins.b     \in1,    vr13,    0x72
+    vextrins.b     \in1,    vr13,    0xf3
+    vextrins.b     \in2,    vr13,    0x74
+    vextrins.b     \in2,    vr13,    0xf5
+    vextrins.b     \in3,    vr13,    0x76
+    vextrins.b     \in3,    vr13,    0xf7
+.else// \pos == 8
+    vextrins.b     \in0,    vr13,    0x78
+    vextrins.b     \in0,    vr13,    0xf9
+    vextrins.b     \in1,    vr13,    0x7a
+    vextrins.b     \in1,    vr13,    0xfb
+    vextrins.b     \in2,    vr13,    0x7c
+    vextrins.b     \in2,    vr13,    0xfd
+    vextrins.b     \in3,    vr13,    0x7e
+    vextrins.b     \in3,    vr13,    0xff
+.endif
+    vdp2.h.bu.b    \out0,   \in0,    vr5 //QPEL_FILTER(src, stride)
+    vdp2.h.bu.b    \out1,   \in1,    vr5
+    vdp2.h.bu.b    vr12,    \in2,    vr5
+    vdp2.h.bu.b    vr20,    \in3,    vr5
+    vbsrl.v        \in0,    \in0,    1 //Back up previous 7 loaded datas,
+    vbsrl.v        \in1,    \in1,    1 //so just need to insert the 8th
+    vbsrl.v        \in2,    \in2,    1 //load in the next loop.
+    vbsrl.v        \in3,    \in3,    1
+    vhaddw.d.h     \out0
+    vhaddw.d.h     \out1
+    vhaddw.d.h     vr12
+    vhaddw.d.h     vr20
+    vpickev.w      \out0,   \out1,   \out0
+    vpickev.w      \out1,   vr20,    vr12
+    vmulwev.w.h    \out0,   \out0,   vr1 //QPEL_FILTER(src, stride) * wx
+    vmulwev.w.h    \out1,   \out1,   vr1
+    vadd.w         \out0,   \out0,   vr2
+    vadd.w         \out1,   \out1,   vr2
+    vsra.w         \out0,   \out0,   vr3
+    vsra.w         \out1,   \out1,   vr3
+    vadd.w         \out0,   \out0,   vr4
+    vadd.w         \out1,   \out1,   vr4
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v8_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    fld.d          f6,     a2,      0
+    fldx.d         f7,     a2,      a3
+    fldx.d         f8,     a2,      t0
+    add.d          a2,     a2,      t1
+    fld.d          f9,     a2,      0
+    fldx.d         f10,    a2,      a3
+    fldx.d         f11,    a2,      t0
+    fldx.d         f12,    a2,      t1
+    add.d          a2,     a2,      t2
+    TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \
+                      vr6, vr7, vr8, vr9
+.LOOP_V8:
+    fld.d          f13,    a2,      0 //the 8th load
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_V8_LSX vr6, vr7, vr8, vr9, vr10, vr11, 0
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.d          f11,    a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_V8
+endfunc
+
+.macro PUT_HEVC_UNI_W_V8_LASX w
+    fld.d          f6,     a2,      0
+    fldx.d         f7,     a2,      a3
+    fldx.d         f8,     a2,      t0
+    add.d          a2,     a2,      t1
+    fld.d          f9,     a2,      0
+    fldx.d         f10,    a2,      a3
+    fldx.d         f11,    a2,      t0
+    fldx.d         f12,    a2,      t1
+    add.d          a2,     a2,      t2
+    TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \
+                      vr6, vr7, vr8, vr9
+    xvpermi.q      xr6,    xr7,     0x02
+    xvpermi.q      xr8,    xr9,     0x02
+.LOOP_V8_LASX_\w:
+    fld.d          f13,    a2,      0 // 0 1 2 3 4 5 6 7 the 8th load
+    add.d          a2,     a2,      a3
+    vshuf4i.h      vr13,   vr13,    0xd8
+    vbsrl.v        vr14,   vr13,    4
+    xvpermi.q      xr13,   xr14,    0x02 //0 1 4 5 * * * * 2 3 6 7 * * * *
+    xvextrins.b    xr6,    xr13,    0x70 //begin to insert the 8th load
+    xvextrins.b    xr6,    xr13,    0xf1
+    xvextrins.b    xr8,    xr13,    0x72
+    xvextrins.b    xr8,    xr13,    0xf3
+    xvdp2.h.bu.b   xr20,   xr6,     xr5 //QPEL_FILTER(src, stride)
+    xvdp2.h.bu.b   xr21,   xr8,     xr5
+    xvbsrl.v       xr6,    xr6,     1
+    xvbsrl.v       xr8,    xr8,     1
+    xvhaddw.d.h    xr20
+    xvhaddw.d.h    xr21
+    xvpickev.w     xr20,   xr21,    xr20
+    xvpermi.d      xr20,   xr20,    0xd8
+    xvmulwev.w.h   xr20,   xr20,    xr1 //QPEL_FILTER(src, stride) * wx
+    xvadd.w        xr20,   xr20,    xr2
+    xvsra.w        xr20,   xr20,    xr3
+    xvadd.w        xr10,   xr20,    xr4
+    xvpermi.q      xr11,   xr10,    0x01
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.d          f11,    a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_V8_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    PUT_HEVC_UNI_W_V8_LASX 8
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_V16_LSX w
+    vld            vr6,    a2,      0
+    vldx           vr7,    a2,      a3
+    vldx           vr8,    a2,      t0
+    add.d          a2,     a2,      t1
+    vld            vr9,    a2,      0
+    vldx           vr10,   a2,      a3
+    vldx           vr11,   a2,      t0
+    vldx           vr12,   a2,      t1
+    add.d          a2,     a2,      t2
+.if \w > 8
+    vilvh.d        vr14,   vr14,    vr6
+    vilvh.d        vr15,   vr15,    vr7
+    vilvh.d        vr16,   vr16,    vr8
+    vilvh.d        vr17,   vr17,    vr9
+    vilvh.d        vr18,   vr18,    vr10
+    vilvh.d        vr19,   vr19,    vr11
+    vilvh.d        vr20,   vr20,    vr12
+.endif
+    TRANSPOSE8X8B_LSX vr6,  vr7,  vr8,  vr9,  vr10, vr11, vr12, vr13, \
+                      vr6,  vr7,  vr8,  vr9
+.if \w > 8
+    TRANSPOSE8X8B_LSX vr14, vr15, vr16, vr17, vr18, vr19, vr20, vr21, \
+                      vr14, vr15, vr16, vr17
+.endif
+.LOOP_HORI_16_\w:
+    vld            vr13,   a2,      0
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_V8_LSX vr6,  vr7,  vr8,  vr9,  vr10, vr11, 0
+.if \w > 8
+    PUT_HEVC_QPEL_UNI_W_V8_LSX vr14, vr15, vr16, vr17, vr18, vr19, 8
+.endif
+    vssrani.h.w    vr11,   vr10,    0
+.if \w > 8
+    vssrani.h.w    vr19,   vr18,    0
+    vssrani.bu.h   vr19,   vr11,    0
+.else
+    vssrani.bu.h   vr11,   vr11,    0
+.endif
+.if \w == 8
+    fst.d          f11,    a0,      0
+.elseif \w == 12
+    fst.d          f19,    a0,      0
+    vstelm.w       vr19,   a0,      8,    2
+.else
+    vst            vr19,   a0,      0
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_HORI_16_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v16_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    PUT_HEVC_QPEL_UNI_W_V16_LSX 16
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_V16_LASX w
+    vld            vr6,    a2,      0
+    vldx           vr7,    a2,      a3
+    vldx           vr8,    a2,      t0
+    add.d          a2,     a2,      t1
+    vld            vr9,    a2,      0
+    vldx           vr10,   a2,      a3
+    vldx           vr11,   a2,      t0
+    vldx           vr12,   a2,      t1
+    add.d          a2,     a2,      t2
+    xvpermi.q      xr6,    xr10,    0x02 //pack and transpose the 8x16 to 4x32 begin
+    xvpermi.q      xr7,    xr11,    0x02
+    xvpermi.q      xr8,    xr12,    0x02
+    xvpermi.q      xr9,    xr13,    0x02
+    xvilvl.b       xr14,   xr7,     xr6 //0 2
+    xvilvh.b       xr15,   xr7,     xr6 //1 3
+    xvilvl.b       xr16,   xr9,     xr8 //0 2
+    xvilvh.b       xr17,   xr9,     xr8 //1 3
+    xvpermi.d      xr14,   xr14,    0xd8
+    xvpermi.d      xr15,   xr15,    0xd8
+    xvpermi.d      xr16,   xr16,    0xd8
+    xvpermi.d      xr17,   xr17,    0xd8
+    xvilvl.h       xr6,    xr16,    xr14
+    xvilvh.h       xr7,    xr16,    xr14
+    xvilvl.h       xr8,    xr17,    xr15
+    xvilvh.h       xr9,    xr17,    xr15
+    xvilvl.w       xr14,   xr7,     xr6 //0 1 4 5
+    xvilvh.w       xr15,   xr7,     xr6 //2 3 6 7
+    xvilvl.w       xr16,   xr9,     xr8 //8 9 12 13
+    xvilvh.w       xr17,   xr9,     xr8 //10 11 14 15 end
+.LOOP_HORI_16_LASX_\w:
+    vld            vr13,   a2,      0 //the 8th load
+    add.d          a2,     a2,      a3
+    vshuf4i.w      vr13,   vr13,    0xd8
+    vbsrl.v        vr12,   vr13,    8
+    xvpermi.q      xr13,   xr12,    0x02
+    xvextrins.b    xr14,   xr13,    0x70 //inset the 8th load
+    xvextrins.b    xr14,   xr13,    0xf1
+    xvextrins.b    xr15,   xr13,    0x72
+    xvextrins.b    xr15,   xr13,    0xf3
+    xvextrins.b    xr16,   xr13,    0x74
+    xvextrins.b    xr16,   xr13,    0xf5
+    xvextrins.b    xr17,   xr13,    0x76
+    xvextrins.b    xr17,   xr13,    0xf7
+    xvdp2.h.bu.b   xr6,    xr14,    xr5 //QPEL_FILTER(src, stride)
+    xvdp2.h.bu.b   xr7,    xr15,    xr5
+    xvdp2.h.bu.b   xr8,    xr16,    xr5
+    xvdp2.h.bu.b   xr9,    xr17,    xr5
+    xvhaddw.d.h    xr6
+    xvhaddw.d.h    xr7
+    xvhaddw.d.h    xr8
+    xvhaddw.d.h    xr9
+    xvbsrl.v       xr14,   xr14,    1 //Back up previous 7 loaded datas,
+    xvbsrl.v       xr15,   xr15,    1 //so just need to insert the 8th
+    xvbsrl.v       xr16,   xr16,    1 //load in next loop.
+    xvbsrl.v       xr17,   xr17,    1
+    xvpickev.w     xr6,    xr7,     xr6 //0 1 2 3 4 5 6 7
+    xvpickev.w     xr7,    xr9,     xr8 //8 9 10 11 12 13 14 15
+    xvmulwev.w.h   xr6,    xr6,     xr1 //QPEL_FILTER(src, stride) * wx
+    xvmulwev.w.h   xr7,    xr7,     xr1
+    xvadd.w        xr6,    xr6,     xr2
+    xvadd.w        xr7,    xr7,     xr2
+    xvsra.w        xr6,    xr6,     xr3
+    xvsra.w        xr7,    xr7,     xr3
+    xvadd.w        xr6,    xr6,     xr4
+    xvadd.w        xr7,    xr7,     xr4
+    xvssrani.h.w   xr7,    xr6,     0 //0 1 2 3  8 9 10 11  4 5 6 7 12 13 14 15
+    xvpermi.q      xr6,    xr7,     0x01
+    vssrani.bu.h   vr6,    vr7,     0
+    vshuf4i.w      vr6,    vr6,     0xd8
+.if \w == 12
+    fst.d          f6,     a0,      0
+    vstelm.w       vr6,    a0,      8,   2
+.else
+    vst            vr6,    a0,      0
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_HORI_16_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    PUT_HEVC_QPEL_UNI_W_V16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v12_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    PUT_HEVC_QPEL_UNI_W_V16_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    PUT_HEVC_QPEL_UNI_W_V16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v24_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    addi.d         t4,     a0,      0 //save dst
+    addi.d         t5,     a2,      0 //save src
+    addi.d         t6,     a4,      0
+    PUT_HEVC_QPEL_UNI_W_V16_LSX 24
+    addi.d         a0,     t4,      16
+    addi.d         a2,     t5,      16
+    addi.d         a4,     t6,      0
+    PUT_HEVC_QPEL_UNI_W_V16_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v24_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    addi.d         t4,     a0,      0 //save dst
+    addi.d         t5,     a2,      0 //save src
+    addi.d         t6,     a4,      0
+    PUT_HEVC_QPEL_UNI_W_V16_LASX 24
+    addi.d         a0,     t4,      16
+    addi.d         a2,     t5,      16
+    addi.d         a4,     t6,      0
+    PUT_HEVC_UNI_W_V8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v32_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    addi.d         t3,     zero,    2
+    addi.d         t4,     a0,      0 //save dst
+    addi.d         t5,     a2,      0 //save src
+    addi.d         t6,     a4,      0
+.LOOP_V32:
+    PUT_HEVC_QPEL_UNI_W_V16_LSX 32
+    addi.d         t3,     t3,      -1
+    addi.d         a0,     t4,      16
+    addi.d         a2,     t5,      16
+    addi.d         a4,     t6,      0
+    bnez           t3,     .LOOP_V32
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v32_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    addi.d         t3,     zero,    2
+    addi.d         t4,     a0,      0 //save dst
+    addi.d         t5,     a2,      0 //save src
+    addi.d         t6,     a4,      0
+.LOOP_V32_LASX:
+    PUT_HEVC_QPEL_UNI_W_V16_LASX 32
+    addi.d         t3,     t3,      -1
+    addi.d         a0,     t4,      16
+    addi.d         a2,     t5,      16
+    addi.d         a4,     t6,      0
+    bnez           t3,     .LOOP_V32_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v48_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    addi.d         t3,     zero,    3
+    addi.d         t4,     a0,      0 //save dst
+    addi.d         t5,     a2,      0 //save src
+    addi.d         t6,     a4,      0
+.LOOP_V48:
+    PUT_HEVC_QPEL_UNI_W_V16_LSX 48
+    addi.d         t3,     t3,      -1
+    addi.d         a0,     t4,      16
+    addi.d         t4,     t4,      16
+    addi.d         a2,     t5,      16
+    addi.d         t5,     t5,      16
+    addi.d         a4,     t6,      0
+    bnez           t3,     .LOOP_V48
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    addi.d         t3,     zero,    3
+    addi.d         t4,     a0,      0 //save dst
+    addi.d         t5,     a2,      0 //save src
+    addi.d         t6,     a4,      0
+.LOOP_V48_LASX:
+    PUT_HEVC_QPEL_UNI_W_V16_LASX 48
+    addi.d         t3,     t3,      -1
+    addi.d         a0,     t4,      16
+    addi.d         t4,     t4,      16
+    addi.d         a2,     t5,      16
+    addi.d         t5,     t5,      16
+    addi.d         a4,     t6,      0
+    bnez           t3,     .LOOP_V48_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v64_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    addi.d         t3,     zero,    4
+    addi.d         t4,     a0,      0 //save dst
+    addi.d         t5,     a2,      0 //save src
+    addi.d         t6,     a4,      0
+.LOOP_V64:
+    PUT_HEVC_QPEL_UNI_W_V16_LSX 64
+    addi.d         t3,     t3,      -1
+    addi.d         a0,     t4,      16
+    addi.d         t4,     t4,      16
+    addi.d         a2,     t5,      16
+    addi.d         t5,     t5,      16
+    addi.d         a4,     t6,      0
+    bnez           t3,     .LOOP_V64
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    add.d          t2,     t1,      a3 //stride * 4
+    sub.d          a2,     a2,      t1 //src -= stride*3
+    addi.d         t3,     zero,    4
+    addi.d         t4,     a0,      0 //save dst
+    addi.d         t5,     a2,      0 //save src
+    addi.d         t6,     a4,      0
+.LOOP_V64_LASX:
+    PUT_HEVC_QPEL_UNI_W_V16_LASX 64
+    addi.d         t3,     t3,      -1
+    addi.d         a0,     t4,      16
+    addi.d         t4,     t4,      16
+    addi.d         a2,     t5,      16
+    addi.d         t5,     t5,      16
+    addi.d         a4,     t6,      0
+    bnez           t3,     .LOOP_V64_LASX
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_H8_LSX in0, out0, out1
+    vbsrl.v        vr7,    \in0,    1
+    vbsrl.v        vr8,    \in0,    2
+    vbsrl.v        vr9,    \in0,    3
+    vbsrl.v        vr10,   \in0,    4
+    vbsrl.v        vr11,   \in0,    5
+    vbsrl.v        vr12,   \in0,    6
+    vbsrl.v        vr13,   \in0,    7
+    vilvl.d        vr6,    vr7,     \in0
+    vilvl.d        vr7,    vr9,     vr8
+    vilvl.d        vr8,    vr11,    vr10
+    vilvl.d        vr9,    vr13,    vr12
+    vdp2.h.bu.b    vr10,   vr6,     vr5
+    vdp2.h.bu.b    vr11,   vr7,     vr5
+    vdp2.h.bu.b    vr12,   vr8,     vr5
+    vdp2.h.bu.b    vr13,   vr9,     vr5
+    vhaddw.d.h     vr10
+    vhaddw.d.h     vr11
+    vhaddw.d.h     vr12
+    vhaddw.d.h     vr13
+    vpickev.w      vr10,   vr11,    vr10
+    vpickev.w      vr11,   vr13,    vr12
+    vmulwev.w.h    vr10,   vr10,    vr1
+    vmulwev.w.h    vr11,   vr11,    vr1
+    vadd.w         vr10,   vr10,    vr2
+    vadd.w         vr11,   vr11,    vr2
+    vsra.w         vr10,   vr10,    vr3
+    vsra.w         vr11,   vr11,    vr3
+    vadd.w         \out0,  vr10,    vr4
+    vadd.w         \out1,  vr11,    vr4
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_H8_LASX in0, out0
+    xvbsrl.v       xr7,    \in0,    4
+    xvpermi.q      xr7,    \in0,    0x20
+    xvbsrl.v       xr8,    xr7,     1
+    xvbsrl.v       xr9,    xr7,     2
+    xvbsrl.v       xr10,   xr7,     3
+    xvpackev.d     xr7,    xr8,     xr7
+    xvpackev.d     xr8,    xr10,    xr9
+    xvdp2.h.bu.b   xr10,   xr7,     xr5
+    xvdp2.h.bu.b   xr11,   xr8,     xr5
+    xvhaddw.d.h    xr10
+    xvhaddw.d.h    xr11
+    xvpickev.w     xr10,   xr11,    xr10
+    xvmulwev.w.h   xr10,   xr10,    xr1
+    xvadd.w        xr10,   xr10,    xr2
+    xvsra.w        xr10,   xr10,    xr3
+    xvadd.w        \out0,  xr10,    xr4
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_H16_LASX in0, out0
+    xvpermi.d      xr6,    \in0,    0x94
+    xvbsrl.v       xr7,    xr6,     1
+    xvbsrl.v       xr8,    xr6,     2
+    xvbsrl.v       xr9,    xr6,     3
+    xvbsrl.v       xr10,   xr6,     4
+    xvbsrl.v       xr11,   xr6,     5
+    xvbsrl.v       xr12,   xr6,     6
+    xvbsrl.v       xr13,   xr6,     7
+    xvpackev.d     xr6,    xr7,     xr6
+    xvpackev.d     xr7,    xr9,     xr8
+    xvpackev.d     xr8,    xr11,    xr10
+    xvpackev.d     xr9,    xr13,    xr12
+    xvdp2.h.bu.b   xr10,   xr6,     xr5
+    xvdp2.h.bu.b   xr11,   xr7,     xr5
+    xvdp2.h.bu.b   xr12,   xr8,     xr5
+    xvdp2.h.bu.b   xr13,   xr9,     xr5
+    xvhaddw.d.h    xr10
+    xvhaddw.d.h    xr11
+    xvhaddw.d.h    xr12
+    xvhaddw.d.h    xr13
+    xvpickev.w     xr10,   xr11,    xr10
+    xvpickev.w     xr11,   xr13,    xr12
+    xvmulwev.w.h   xr10,   xr10,    xr1
+    xvmulwev.w.h   xr11,   xr11,    xr1
+    xvadd.w        xr10,   xr10,    xr2
+    xvadd.w        xr11,   xr11,    xr2
+    xvsra.w        xr10,   xr10,    xr3
+    xvsra.w        xr11,   xr11,    xr3
+    xvadd.w        xr10,   xr10,    xr4
+    xvadd.w        xr11,   xr11,    xr4
+    xvssrani.h.w   xr11,   xr10,    0
+    xvpermi.q      \out0,  xr11,    0x01
+    xvssrani.bu.h  \out0,  xr11,    0
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H4:
+    vld            vr18,   a2,      0
+    vldx           vr19,   a2,      a3
+    alsl.d         a2,     a3,      a2,   1
+    vbsrl.v        vr6,    vr18,    1
+    vbsrl.v        vr7,    vr18,    2
+    vbsrl.v        vr8,    vr18,    3
+    vbsrl.v        vr9,    vr19,    1
+    vbsrl.v        vr10,   vr19,    2
+    vbsrl.v        vr11,   vr19,    3
+    vilvl.d        vr6,    vr6,     vr18
+    vilvl.d        vr7,    vr8,     vr7
+    vilvl.d        vr8,    vr9,     vr19
+    vilvl.d        vr9,    vr11,    vr10
+    vdp2.h.bu.b    vr10,   vr6,     vr5
+    vdp2.h.bu.b    vr11,   vr7,     vr5
+    vdp2.h.bu.b    vr12,   vr8,     vr5
+    vdp2.h.bu.b    vr13,   vr9,     vr5
+    vhaddw.d.h     vr10
+    vhaddw.d.h     vr11
+    vhaddw.d.h     vr12
+    vhaddw.d.h     vr13
+    vpickev.w      vr10,   vr11,    vr10
+    vpickev.w      vr11,   vr13,    vr12
+    vmulwev.w.h    vr10,   vr10,    vr1
+    vmulwev.w.h    vr11,   vr11,    vr1
+    vadd.w         vr10,   vr10,    vr2
+    vadd.w         vr11,   vr11,    vr2
+    vsra.w         vr10,   vr10,    vr3
+    vsra.w         vr11,   vr11,    vr3
+    vadd.w         vr10,   vr10,    vr4
+    vadd.w         vr11,   vr11,    vr4
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.s          f11,    a0,      0
+    vbsrl.v        vr11,   vr11,    4
+    fstx.s         f11,    a0,      a1
+    alsl.d         a0,     a1,      a0,   1
+    addi.d         a4,     a4,      -2
+    bnez           a4,     .LOOP_H4
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H4_LASX:
+    vld            vr18,   a2,      0
+    vldx           vr19,   a2,      a3
+    alsl.d         a2,     a3,      a2,   1
+    xvpermi.q      xr18,   xr19,    0x02
+    xvbsrl.v       xr6,    xr18,    1
+    xvbsrl.v       xr7,    xr18,    2
+    xvbsrl.v       xr8,    xr18,    3
+    xvpackev.d     xr6,    xr6,     xr18
+    xvpackev.d     xr7,    xr8,     xr7
+    xvdp2.h.bu.b   xr10,   xr6,     xr5
+    xvdp2.h.bu.b   xr11,   xr7,     xr5
+    xvhaddw.d.h    xr10
+    xvhaddw.d.h    xr11
+    xvpickev.w     xr10,   xr11,    xr10
+    xvmulwev.w.h   xr10,   xr10,    xr1
+    xvadd.w        xr10,   xr10,    xr2
+    xvsra.w        xr10,   xr10,    xr3
+    xvadd.w        xr10,   xr10,    xr4
+    xvpermi.q      xr11,   xr10,    0x01
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.s          f11,    a0,      0
+    vbsrl.v        vr11,   vr11,    4
+    fstx.s         f11,    a0,      a1
+    alsl.d         a0,     a1,      a0,   1
+    addi.d         a4,     a4,      -2
+    bnez           a4,     .LOOP_H4_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h6_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H6:
+    vld            vr6,    a2,      0
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr10, vr11
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.s          f11,    a0,      0
+    vstelm.h       vr11,   a0,      4,   2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H6
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H6_LASX:
+    vld            vr6,    a2,      0
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H8_LASX xr6, xr10
+    xvpermi.q      xr11,   xr10,    0x01
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.s          f11,    a0,      0
+    vstelm.h       vr11,   a0,      4,   2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H6_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h8_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H8:
+    vld            vr6,    a2,      0
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr10, vr11
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.d          f11,    a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H8
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H8_LASX:
+    vld            vr6,    a2,      0
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H8_LASX xr6, xr10
+    xvpermi.q      xr11,   xr10,    0x01
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr11,   vr11,    0
+    fst.d          f11,    a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H8_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h12_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H12:
+    vld            vr6,    a2,      0
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr14, vr15
+    vld            vr6,    a2,      8
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr16, vr17
+    add.d          a2,     a2,      a3
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    fst.d          f17,    a0,      0
+    vbsrl.v        vr17,   vr17,    8
+    fst.s          f17,    a0,      8
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h12_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H12_LASX:
+    xvld           xr6,    a2,      0
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr6, xr14
+    fst.d          f14,    a0,      0
+    vstelm.w       vr14,   a0,      8,    2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h16_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H16:
+    vld            vr6,    a2,      0
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr14, vr15
+    vld            vr6,    a2,      8
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr16, vr17
+    add.d          a2,     a2,      a3
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H16
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h16_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H16_LASX:
+    xvld           xr6,    a2,      0
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr6, xr10
+    vst            vr10,   a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h24_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H24:
+    vld            vr18,   a2,      0
+    vld            vr19,   a2,      16
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+    vshuf4i.d      vr18,   vr19,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      0
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.bu.h   vr15,   vr15,    0
+    fst.d          f15,    a0,      16
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H24
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h24_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H24_LASX:
+    xvld           xr18,   a2,      0
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+    xvpermi.q      xr19,   xr18,    0x01
+    vst            vr20,   a0,      0
+    PUT_HEVC_QPEL_UNI_W_H8_LASX xr19, xr20
+    xvpermi.q      xr21,   xr20,    0x01
+    vssrani.h.w    vr21,   vr20,    0
+    vssrani.bu.h   vr21,   vr21,    0
+    fst.d          f21,    a0,      16
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h32_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H32:
+    vld            vr18,   a2,      0
+    vld            vr19,   a2,      16
+    vld            vr20,   a2,      32
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+    vshuf4i.d      vr18,   vr19,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      0
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+    vshuf4i.d      vr19,   vr20,   0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      16
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H32
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H32_LASX:
+    xvld           xr18,   a2,      0
+    xvld           xr19,   a2,      16
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr21
+    xvpermi.q      xr20,   xr21,    0x02
+    xvst           xr20,   a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h48_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H48:
+    vld            vr18,   a2,      0
+    vld            vr19,   a2,      16
+    vld            vr20,   a2,      32
+    vld            vr21,   a2,      48
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+    vshuf4i.d      vr18,   vr19,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      0
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+    vshuf4i.d      vr19,   vr20,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      16
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr14, vr15
+    vshuf4i.d      vr20,   vr21,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      32
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H48
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H48_LASX:
+    xvld           xr18,   a2,      0
+    xvld           xr19,   a2,      32
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+    xvpermi.q      xr18,   xr19,    0x03
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr21
+    xvpermi.q      xr20,   xr21,    0x02
+    xvst           xr20,   a0,      0
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr20
+    vst            vr20,   a0,      32
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h64_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H64:
+    vld            vr18,   a2,      0
+    vld            vr19,   a2,      16
+    vld            vr20,   a2,      32
+    vld            vr21,   a2,      48
+    vld            vr22,   a2,      64
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+    vshuf4i.d      vr18,   vr19,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      0
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+    vshuf4i.d      vr19,   vr20,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      16
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr14, vr15
+    vshuf4i.d      vr20,   vr21,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      32
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr21, vr14, vr15
+    vshuf4i.d      vr21,   vr22,    0x09
+    PUT_HEVC_QPEL_UNI_W_H8_LSX vr21, vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,      48
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H64
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    xvreplve0.q    xr5,    xr5
+    addi.d         a2,     a2,      -3 //src -= 3
+.LOOP_H64_LASX:
+    xvld           xr18,   a2,      0
+    xvld           xr19,   a2,      32
+    xvld           xr20,   a2,      64
+    add.d          a2,     a2,      a3
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr21
+    xvpermi.q      xr18,   xr19,    0x03
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr22
+    xvpermi.q      xr21,   xr22,    0x02
+    xvst           xr21,   a0,      0
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr21
+    xvpermi.q      xr19,   xr20,    0x03
+    PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr22
+    xvpermi.q      xr21,   xr22,    0x02
+    xvst           xr21,   a0,      32
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_H64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index d0ee99d6b5..3cdb3fb2d7 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -188,6 +188,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_qpel_uni_w[8][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv48_8_lsx;
             c->put_hevc_qpel_uni_w[9][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv64_8_lsx;
 
+            c->put_hevc_qpel_uni_w[1][1][0] = ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx;
+            c->put_hevc_qpel_uni_w[2][1][0] = ff_hevc_put_hevc_qpel_uni_w_v6_8_lsx;
+            c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lsx;
+            c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lsx;
+            c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lsx;
+            c->put_hevc_qpel_uni_w[6][1][0] = ff_hevc_put_hevc_qpel_uni_w_v24_8_lsx;
+            c->put_hevc_qpel_uni_w[7][1][0] = ff_hevc_put_hevc_qpel_uni_w_v32_8_lsx;
+            c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lsx;
+            c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lsx;
+
+            c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx;
+            c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lsx;
+            c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lsx;
+            c->put_hevc_qpel_uni_w[4][0][1] = ff_hevc_put_hevc_qpel_uni_w_h12_8_lsx;
+            c->put_hevc_qpel_uni_w[5][0][1] = ff_hevc_put_hevc_qpel_uni_w_h16_8_lsx;
+            c->put_hevc_qpel_uni_w[6][0][1] = ff_hevc_put_hevc_qpel_uni_w_h24_8_lsx;
+            c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lsx;
+            c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lsx;
+            c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lsx;
+
             c->sao_edge_filter[0] = ff_hevc_sao_edge_filter_8_lsx;
             c->sao_edge_filter[1] = ff_hevc_sao_edge_filter_8_lsx;
             c->sao_edge_filter[2] = ff_hevc_sao_edge_filter_8_lsx;
@@ -237,6 +257,24 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
             c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
             c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+
+            c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
+            c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
+            c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
+            c->put_hevc_qpel_uni_w[6][1][0] = ff_hevc_put_hevc_qpel_uni_w_v24_8_lasx;
+            c->put_hevc_qpel_uni_w[7][1][0] = ff_hevc_put_hevc_qpel_uni_w_v32_8_lasx;
+            c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx;
+            c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx;
+
+            c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx;
+            c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx;
+            c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx;
+            c->put_hevc_qpel_uni_w[4][0][1] = ff_hevc_put_hevc_qpel_uni_w_h12_8_lasx;
+            c->put_hevc_qpel_uni_w[5][0][1] = ff_hevc_put_hevc_qpel_uni_w_h16_8_lasx;
+            c->put_hevc_qpel_uni_w[6][0][1] = ff_hevc_put_hevc_qpel_uni_w_h24_8_lasx;
+            c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx;
+            c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx;
+            c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx;
         }
     }
 }
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 819c3c3ecf..8a9266d375 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -48,6 +48,24 @@ PEL_UNI_W(pel, pixels, 32);
 PEL_UNI_W(pel, pixels, 48);
 PEL_UNI_W(pel, pixels, 64);
 
+PEL_UNI_W(qpel, v, 8);
+PEL_UNI_W(qpel, v, 12);
+PEL_UNI_W(qpel, v, 16);
+PEL_UNI_W(qpel, v, 24);
+PEL_UNI_W(qpel, v, 32);
+PEL_UNI_W(qpel, v, 48);
+PEL_UNI_W(qpel, v, 64);
+
+PEL_UNI_W(qpel, h, 4);
+PEL_UNI_W(qpel, h, 6);
+PEL_UNI_W(qpel, h, 8);
+PEL_UNI_W(qpel, h, 12);
+PEL_UNI_W(qpel, h, 16);
+PEL_UNI_W(qpel, h, 24);
+PEL_UNI_W(qpel, h, 32);
+PEL_UNI_W(qpel, h, 48);
+PEL_UNI_W(qpel, h, 64);
+
 #undef PEL_UNI_W
 
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 0d724a90ef..3291294ed9 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -257,6 +257,26 @@ PEL_UNI_W(pel, pixels, 32);
 PEL_UNI_W(pel, pixels, 48);
 PEL_UNI_W(pel, pixels, 64);
 
+PEL_UNI_W(qpel, v, 4);
+PEL_UNI_W(qpel, v, 6);
+PEL_UNI_W(qpel, v, 8);
+PEL_UNI_W(qpel, v, 12);
+PEL_UNI_W(qpel, v, 16);
+PEL_UNI_W(qpel, v, 24);
+PEL_UNI_W(qpel, v, 32);
+PEL_UNI_W(qpel, v, 48);
+PEL_UNI_W(qpel, v, 64);
+
+PEL_UNI_W(qpel, h, 4);
+PEL_UNI_W(qpel, h, 6);
+PEL_UNI_W(qpel, h, 8);
+PEL_UNI_W(qpel, h, 12);
+PEL_UNI_W(qpel, h, 16);
+PEL_UNI_W(qpel, h, 24);
+PEL_UNI_W(qpel, h, 32);
+PEL_UNI_W(qpel, h, 48);
+PEL_UNI_W(qpel, h, 64);
+
 #undef PEL_UNI_W
 
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
-- 
2.20.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [FFmpeg-devel] [PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 asm opt
  2023-12-27  4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
                   ` (3 preceding siblings ...)
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-27  4:50 ` jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 6/7] avcodec/hevc: Add asm opt for the following functions jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt jinbo
  6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27  4:50 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: jinbo

tests/checkasm/checkasm:           C       LSX     LASX
put_hevc_epel_uni_w_hv4_8_c:       9.5     2.2
put_hevc_epel_uni_w_hv6_8_c:       18.5    5.0     3.7
put_hevc_epel_uni_w_hv8_8_c:       30.7    6.0     4.5
put_hevc_epel_uni_w_hv12_8_c:      63.7    14.0    10.7
put_hevc_epel_uni_w_hv16_8_c:      107.5   22.7    17.0
put_hevc_epel_uni_w_hv24_8_c:      236.7   50.2    31.7
put_hevc_epel_uni_w_hv32_8_c:      414.5   88.0    53.0
put_hevc_epel_uni_w_hv48_8_c:      917.5   197.7   118.5
put_hevc_epel_uni_w_hv64_8_c:      1617.0  349.5   203.0

After this patch, the peformance of decoding H265 4K 30FPS 30Mbps
on 3A6000 with 8 threads improves 3fps (52fps-->55fsp).

Change-Id: If067e394cec4685c62193e7adb829ac93ba4804d
---
 libavcodec/loongarch/hevc_mc.S                | 821 ++++++++++++++++++
 libavcodec/loongarch/hevcdsp_init_loongarch.c |  19 +
 libavcodec/loongarch/hevcdsp_lasx.h           |   9 +
 libavcodec/loongarch/hevcdsp_lsx.h            |  10 +
 4 files changed, 859 insertions(+)

diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index 2ee338fb8e..0b0647546b 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -22,6 +22,7 @@
 #include "loongson_asm.S"
 
 .extern ff_hevc_qpel_filters
+.extern ff_hevc_epel_filters
 
 .macro LOAD_VAR bit
     addi.w         t1,     a5,      6  //shift
@@ -206,6 +207,12 @@
 .endif
 .endm
 
+/*
+ * void FUNC(put_hevc_pel_uni_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride,
+ *                                      const uint8_t *_src, ptrdiff_t _srcstride,
+ *                                      int height, int denom, int wx, int ox,
+ *                                      intptr_t mx, intptr_t my, int width)
+ */
 function ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx
     LOAD_VAR 128
     srli.w         t0,     a4,      1
@@ -482,6 +489,12 @@ endfunc
     xvhaddw.d.w  \in0,  \in0,  \in0
 .endm
 
+/*
+ * void FUNC(put_hevc_qpel_uni_w_v)(uint8_t *_dst,  ptrdiff_t _dststride,
+ *                                  const uint8_t *_src, ptrdiff_t _srcstride,
+ *                                  int height, int denom, int wx, int ox,
+ *                                  intptr_t mx, intptr_t my, int width)
+ */
 function ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx
     LOAD_VAR 128
     ld.d           t0,     sp,      8  //my
@@ -1253,6 +1266,12 @@ endfunc
     xvssrani.bu.h  \out0,  xr11,    0
 .endm
 
+/*
+ * void FUNC(put_hevc_qpel_uni_w_h)(uint8_t *_dst,  ptrdiff_t _dststride,
+ *                                  const uint8_t *_src, ptrdiff_t _srcstride,
+ *                                  int height, int denom, int wx, int ox,
+ *                                  intptr_t mx, intptr_t my, int width)
+ */
 function ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx
     LOAD_VAR 128
     ld.d           t0,     sp,      0  //mx
@@ -1763,3 +1782,805 @@ function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
     addi.d         a4,     a4,      -1
     bnez           a4,     .LOOP_H64_LASX
 endfunc
+
+const shufb
+    .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6
+    .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10
+endconst
+
+.macro PUT_HEVC_EPEL_UNI_W_HV4_LSX w
+    fld.d          f7,     a2,      0  // start to load src
+    fldx.d         f8,     a2,      a3
+    alsl.d         a2,     a3,      a2,    1
+    fld.d          f9,     a2,      0
+    vshuf.b        vr7,    vr7,     vr7,   vr0 // 0123 1234 2345 3456
+    vshuf.b        vr8,    vr8,     vr8,   vr0
+    vshuf.b        vr9,    vr9,     vr9,   vr0
+    vdp2.h.bu.b    vr10,   vr7,     vr5  // EPEL_FILTER(src, 1)
+    vdp2.h.bu.b    vr11,   vr8,     vr5
+    vdp2.h.bu.b    vr12,   vr9,     vr5
+    vhaddw.w.h     vr10,   vr10,    vr10 // tmp[0/1/2/3]
+    vhaddw.w.h     vr11,   vr11,    vr11 // vr10,vr11,vr12 corresponding to EPEL_EXTRA
+    vhaddw.w.h     vr12,   vr12,    vr12
+.LOOP_HV4_\w:
+    add.d          a2,     a2,      a3
+    fld.d          f14,    a2,      0    // height loop begin
+    vshuf.b        vr14,   vr14,    vr14,  vr0
+    vdp2.h.bu.b    vr13,   vr14,    vr5
+    vhaddw.w.h     vr13,   vr13,    vr13
+    vmul.w         vr14,   vr10,    vr16 // EPEL_FILTER(tmp, MAX_PB_SIZE)
+    vmadd.w        vr14,   vr11,    vr17
+    vmadd.w        vr14,   vr12,    vr18
+    vmadd.w        vr14,   vr13,    vr19
+    vaddi.wu       vr10,   vr11,    0    //back up previous value
+    vaddi.wu       vr11,   vr12,    0
+    vaddi.wu       vr12,   vr13,    0
+    vsrai.w        vr14,   vr14,    6    // >> 6
+    vmul.w         vr14,   vr14,    vr1  // * wx
+    vadd.w         vr14,   vr14,    vr2  // + offset
+    vsra.w         vr14,   vr14,    vr3  // >> shift
+    vadd.w         vr14,   vr14,    vr4  // + ox
+    vssrani.h.w    vr14,   vr14,    0
+    vssrani.bu.h   vr14,   vr14,    0    // clip
+    fst.s          f14,    a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_HV4_\w
+.endm
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride,
+ *                                   const uint8_t *_src, ptrdiff_t _srcstride,
+ *                                   int height, int denom, int wx, int ox,
+ *                                   intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_hv4_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    PUT_HEVC_EPEL_UNI_W_HV4_LSX 4
+endfunc
+
+.macro PUT_HEVC_EPEL_UNI_W_HV8_LSX w
+    vld            vr7,    a2,      0  // start to load src
+    vldx           vr8,    a2,      a3
+    alsl.d         a2,     a3,      a2,    1
+    vld            vr9,    a2,      0
+    vshuf.b        vr10,   vr7,     vr7,   vr0 // 0123 1234 2345 3456
+    vshuf.b        vr11,   vr8,     vr8,   vr0
+    vshuf.b        vr12,   vr9,     vr9,   vr0
+    vshuf.b        vr7,    vr7,     vr7,   vr22// 4567 5678 6789 78910
+    vshuf.b        vr8,    vr8,     vr8,   vr22
+    vshuf.b        vr9,    vr9,     vr9,   vr22
+    vdp2.h.bu.b    vr13,   vr10,    vr5  // EPEL_FILTER(src, 1)
+    vdp2.h.bu.b    vr14,   vr11,    vr5
+    vdp2.h.bu.b    vr15,   vr12,    vr5
+    vdp2.h.bu.b    vr23,   vr7,     vr5
+    vdp2.h.bu.b    vr20,   vr8,     vr5
+    vdp2.h.bu.b    vr21,   vr9,     vr5
+    vhaddw.w.h     vr7,    vr13,    vr13
+    vhaddw.w.h     vr8,    vr14,    vr14
+    vhaddw.w.h     vr9,    vr15,    vr15
+    vhaddw.w.h     vr10,   vr23,    vr23
+    vhaddw.w.h     vr11,   vr20,    vr20
+    vhaddw.w.h     vr12,   vr21,    vr21
+.LOOP_HV8_HORI_\w:
+    add.d          a2,     a2,      a3
+    vld            vr15,   a2,      0
+    vshuf.b        vr23,   vr15,    vr15,   vr0
+    vshuf.b        vr15,   vr15,    vr15,   vr22
+    vdp2.h.bu.b    vr13,   vr23,    vr5
+    vdp2.h.bu.b    vr14,   vr15,    vr5
+    vhaddw.w.h     vr13,   vr13,    vr13 //789--13
+    vhaddw.w.h     vr14,   vr14,    vr14 //101112--14
+    vmul.w         vr15,   vr7,     vr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+    vmadd.w        vr15,   vr8,     vr17
+    vmadd.w        vr15,   vr9,     vr18
+    vmadd.w        vr15,   vr13,    vr19
+    vmul.w         vr20,   vr10,    vr16
+    vmadd.w        vr20,   vr11,    vr17
+    vmadd.w        vr20,   vr12,    vr18
+    vmadd.w        vr20,   vr14,    vr19
+    vaddi.wu       vr7,    vr8,     0    //back up previous value
+    vaddi.wu       vr8,    vr9,     0
+    vaddi.wu       vr9,    vr13,    0
+    vaddi.wu       vr10,   vr11,    0
+    vaddi.wu       vr11,   vr12,    0
+    vaddi.wu       vr12,   vr14,    0
+    vsrai.w        vr15,   vr15,    6    // >> 6
+    vsrai.w        vr20,   vr20,    6
+    vmul.w         vr15,   vr15,    vr1  // * wx
+    vmul.w         vr20,   vr20,    vr1
+    vadd.w         vr15,   vr15,    vr2  // + offset
+    vadd.w         vr20,   vr20,    vr2
+    vsra.w         vr15,   vr15,    vr3  // >> shift
+    vsra.w         vr20,   vr20,    vr3
+    vadd.w         vr15,   vr15,    vr4  // + ox
+    vadd.w         vr20,   vr20,    vr4
+    vssrani.h.w    vr20,   vr15,    0
+    vssrani.bu.h   vr20,   vr20,    0
+.if \w > 6
+    fst.d          f20,    a0,      0
+.else
+    fst.s          f20,    a0,      0
+    vstelm.h       vr20,   a0,      4,   2
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_HV8_HORI_\w
+.endm
+
+.macro PUT_HEVC_EPEL_UNI_W_HV8_LASX w
+    vld            vr7,    a2,      0  // start to load src
+    vldx           vr8,    a2,      a3
+    alsl.d         a2,     a3,      a2,    1
+    vld            vr9,    a2,      0
+    xvreplve0.q    xr7,    xr7
+    xvreplve0.q    xr8,    xr8
+    xvreplve0.q    xr9,    xr9
+    xvshuf.b       xr10,   xr7,     xr7,   xr0 // 0123 1234 2345 3456
+    xvshuf.b       xr11,   xr8,     xr8,   xr0
+    xvshuf.b       xr12,   xr9,     xr9,   xr0
+    xvdp2.h.bu.b   xr13,   xr10,    xr5  // EPEL_FILTER(src, 1)
+    xvdp2.h.bu.b   xr14,   xr11,    xr5
+    xvdp2.h.bu.b   xr15,   xr12,    xr5
+    xvhaddw.w.h    xr7,    xr13,    xr13
+    xvhaddw.w.h    xr8,    xr14,    xr14
+    xvhaddw.w.h    xr9,    xr15,    xr15
+.LOOP_HV8_HORI_LASX_\w:
+    add.d          a2,     a2,      a3
+    vld            vr15,   a2,      0
+    xvreplve0.q    xr15,   xr15
+    xvshuf.b       xr23,   xr15,    xr15,   xr0
+    xvdp2.h.bu.b   xr10,   xr23,    xr5
+    xvhaddw.w.h    xr10,   xr10,    xr10
+    xvmul.w        xr15,   xr7,     xr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+    xvmadd.w       xr15,   xr8,     xr17
+    xvmadd.w       xr15,   xr9,     xr18
+    xvmadd.w       xr15,   xr10,    xr19
+    xvaddi.wu      xr7,    xr8,     0    //back up previous value
+    xvaddi.wu      xr8,    xr9,     0
+    xvaddi.wu      xr9,    xr10,    0
+    xvsrai.w       xr15,   xr15,    6    // >> 6
+    xvmul.w        xr15,   xr15,    xr1  // * wx
+    xvadd.w        xr15,   xr15,    xr2  // + offset
+    xvsra.w        xr15,   xr15,    xr3  // >> shift
+    xvadd.w        xr15,   xr15,    xr4  // + ox
+    xvpermi.q      xr20,   xr15,    0x01
+    vssrani.h.w    vr20,   vr15,    0
+    vssrani.bu.h   vr20,   vr20,    0
+.if \w > 6
+    fst.d          f20,    a0,      0
+.else
+    fst.s          f20,    a0,      0
+    vstelm.h       vr20,   a0,      4,   2
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_HV8_HORI_LASX_\w
+.endm
+
+.macro PUT_HEVC_EPEL_UNI_W_HV16_LASX w
+    xvld           xr7,    a2,      0  // start to load src
+    xvldx          xr8,    a2,      a3
+    alsl.d         a2,     a3,      a2,    1
+    xvld           xr9,    a2,      0
+    xvpermi.d      xr10,   xr7,     0x09 //8..18
+    xvpermi.d      xr11,   xr8,     0x09
+    xvpermi.d      xr12,   xr9,     0x09
+    xvreplve0.q    xr7,    xr7
+    xvreplve0.q    xr8,    xr8
+    xvreplve0.q    xr9,    xr9
+    xvshuf.b       xr13,   xr7,     xr7,   xr0 // 0123 1234 2345 3456
+    xvshuf.b       xr14,   xr8,     xr8,   xr0
+    xvshuf.b       xr15,   xr9,     xr9,   xr0
+    xvdp2.h.bu.b   xr20,   xr13,    xr5  // EPEL_FILTER(src, 1)
+    xvdp2.h.bu.b   xr21,   xr14,    xr5
+    xvdp2.h.bu.b   xr22,   xr15,    xr5
+    xvhaddw.w.h    xr7,    xr20,    xr20
+    xvhaddw.w.h    xr8,    xr21,    xr21
+    xvhaddw.w.h    xr9,    xr22,    xr22
+    xvreplve0.q    xr10,   xr10
+    xvreplve0.q    xr11,   xr11
+    xvreplve0.q    xr12,   xr12
+    xvshuf.b       xr13,   xr10,    xr10,  xr0
+    xvshuf.b       xr14,   xr11,    xr11,  xr0
+    xvshuf.b       xr15,   xr12,    xr12,  xr0
+    xvdp2.h.bu.b   xr20,   xr13,    xr5
+    xvdp2.h.bu.b   xr21,   xr14,    xr5
+    xvdp2.h.bu.b   xr22,   xr15,    xr5
+    xvhaddw.w.h    xr10,   xr20,    xr20
+    xvhaddw.w.h    xr11,   xr21,    xr21
+    xvhaddw.w.h    xr12,   xr22,    xr22
+.LOOP_HV16_HORI_LASX_\w:
+    add.d          a2,     a2,      a3
+    xvld           xr15,   a2,      0
+    xvpermi.d      xr20,   xr15,    0x09 //8...18
+    xvreplve0.q    xr15,   xr15
+    xvreplve0.q    xr20,   xr20
+    xvshuf.b       xr21,   xr15,    xr15,   xr0
+    xvshuf.b       xr22,   xr20,    xr20,   xr0
+    xvdp2.h.bu.b   xr13,   xr21,    xr5
+    xvdp2.h.bu.b   xr14,   xr22,    xr5
+    xvhaddw.w.h    xr13,   xr13,    xr13
+    xvhaddw.w.h    xr14,   xr14,    xr14
+    xvmul.w        xr15,   xr7,     xr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+    xvmadd.w       xr15,   xr8,     xr17
+    xvmadd.w       xr15,   xr9,     xr18
+    xvmadd.w       xr15,   xr13,    xr19
+    xvmul.w        xr20,   xr10,    xr16
+    xvmadd.w       xr20,   xr11,    xr17
+    xvmadd.w       xr20,   xr12,    xr18
+    xvmadd.w       xr20,   xr14,    xr19
+    xvaddi.wu      xr7,    xr8,     0    //back up previous value
+    xvaddi.wu      xr8,    xr9,     0
+    xvaddi.wu      xr9,    xr13,    0
+    xvaddi.wu      xr10,   xr11,    0
+    xvaddi.wu      xr11,   xr12,    0
+    xvaddi.wu      xr12,   xr14,    0
+    xvsrai.w       xr15,   xr15,    6    // >> 6
+    xvsrai.w       xr20,   xr20,    6    // >> 6
+    xvmul.w        xr15,   xr15,    xr1  // * wx
+    xvmul.w        xr20,   xr20,    xr1  // * wx
+    xvadd.w        xr15,   xr15,    xr2  // + offset
+    xvadd.w        xr20,   xr20,    xr2  // + offset
+    xvsra.w        xr15,   xr15,    xr3  // >> shift
+    xvsra.w        xr20,   xr20,    xr3  // >> shift
+    xvadd.w        xr15,   xr15,    xr4  // + ox
+    xvadd.w        xr20,   xr20,    xr4  // + ox
+    xvssrani.h.w   xr20,   xr15,    0
+    xvpermi.q      xr21,   xr20,    0x01
+    vssrani.bu.h   vr21,   vr20,    0
+    vpermi.w       vr21,   vr21,    0xd8
+.if \w < 16
+    fst.d          f21,    a0,      0
+    vstelm.w       vr21,   a0,      8,   2
+.else
+    vst            vr21,   a0,      0
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_HV16_HORI_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    vaddi.bu       vr22,   vr0,     4  // update shufb to get high part
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    PUT_HEVC_EPEL_UNI_W_HV8_LSX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    xvreplve0.w    xr5,    xr5
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    xvreplve0.q    xr6,    xr6
+    xvrepl128vei.w xr16,   xr6,     0
+    xvrepl128vei.w xr17,   xr6,     1
+    xvrepl128vei.w xr18,   xr6,     2
+    xvrepl128vei.w xr19,   xr6,     3
+    la.local       t1,     shufb
+    xvld           xr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    PUT_HEVC_EPEL_UNI_W_HV8_LASX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    vaddi.bu       vr22,   vr0,     4  // update shufb to get high part
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    PUT_HEVC_EPEL_UNI_W_HV8_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    xvreplve0.w    xr5,    xr5
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    xvreplve0.q    xr6,    xr6
+    xvrepl128vei.w xr16,   xr6,     0
+    xvrepl128vei.w xr17,   xr6,     1
+    xvrepl128vei.w xr18,   xr6,     2
+    xvrepl128vei.w xr19,   xr6,     3
+    la.local       t1,     shufb
+    xvld           xr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    PUT_HEVC_EPEL_UNI_W_HV8_LASX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    vaddi.bu       vr22,   vr0,     4  // update shufb to get high part
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_HV8_LSX 12
+    addi.d         a0,     t2,      8
+    addi.d         a2,     t3,      8
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_HV4_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    xvreplve0.w    xr5,    xr5
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    xvreplve0.q    xr6,    xr6
+    xvrepl128vei.w xr16,   xr6,     0
+    xvrepl128vei.w xr17,   xr6,     1
+    xvrepl128vei.w xr18,   xr6,     2
+    xvrepl128vei.w xr19,   xr6,     3
+    la.local       t1,     shufb
+    xvld           xr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    PUT_HEVC_EPEL_UNI_W_HV16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    vaddi.bu       vr22,   vr0,     4  // update shufb to get high part
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    addi.d         t5,     zero,    2
+.LOOP_HV16:
+    PUT_HEVC_EPEL_UNI_W_HV8_LSX 16
+    addi.d         a0,     t2,      8
+    addi.d         a2,     t3,      8
+    addi.d         a4,     t4,      0
+    addi.d         t5,     t5,      -1
+    bnez           t5,     .LOOP_HV16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    xvreplve0.w    xr5,    xr5
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    xvreplve0.q    xr6,    xr6
+    xvrepl128vei.w xr16,   xr6,     0
+    xvrepl128vei.w xr17,   xr6,     1
+    xvrepl128vei.w xr18,   xr6,     2
+    xvrepl128vei.w xr19,   xr6,     3
+    la.local       t1,     shufb
+    xvld           xr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    PUT_HEVC_EPEL_UNI_W_HV16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    vaddi.bu       vr22,   vr0,     4  // update shufb to get high part
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    addi.d         t5,     zero,    3
+.LOOP_HV24:
+    PUT_HEVC_EPEL_UNI_W_HV8_LSX 24
+    addi.d         a0,     t2,      8
+    addi.d         t2,     t2,      8
+    addi.d         a2,     t3,      8
+    addi.d         t3,     t3,      8
+    addi.d         a4,     t4,      0
+    addi.d         t5,     t5,      -1
+    bnez           t5,     .LOOP_HV24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    xvreplve0.w    xr5,    xr5
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    xvreplve0.q    xr6,    xr6
+    xvrepl128vei.w xr16,   xr6,     0
+    xvrepl128vei.w xr17,   xr6,     1
+    xvrepl128vei.w xr18,   xr6,     2
+    xvrepl128vei.w xr19,   xr6,     3
+    la.local       t1,     shufb
+    xvld           xr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_HV16_LASX 24
+    addi.d         a0,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_HV8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv32_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    vaddi.bu       vr22,   vr0,     4  // update shufb to get high part
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    addi.d         t5,     zero,    4
+.LOOP_HV32:
+    PUT_HEVC_EPEL_UNI_W_HV8_LSX 32
+    addi.d         a0,     t2,      8
+    addi.d         t2,     t2,      8
+    addi.d         a2,     t3,      8
+    addi.d         t3,     t3,      8
+    addi.d         a4,     t4,      0
+    addi.d         t5,     t5,      -1
+    bnez           t5,     .LOOP_HV32
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv32_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    xvreplve0.w    xr5,    xr5
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    xvreplve0.q    xr6,    xr6
+    xvrepl128vei.w xr16,   xr6,     0
+    xvrepl128vei.w xr17,   xr6,     1
+    xvrepl128vei.w xr18,   xr6,     2
+    xvrepl128vei.w xr19,   xr6,     3
+    la.local       t1,     shufb
+    xvld           xr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    addi.d         t5,     zero,    2
+.LOOP_HV32_LASX:
+    PUT_HEVC_EPEL_UNI_W_HV16_LASX 32
+    addi.d         a0,     t2,      16
+    addi.d         t2,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         t3,     t3,      16
+    addi.d         a4,     t4,      0
+    addi.d         t5,     t5,      -1
+    bnez           t5,     .LOOP_HV32_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv48_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    vaddi.bu       vr22,   vr0,     4  // update shufb to get high part
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    addi.d         t5,     zero,    6
+.LOOP_HV48:
+    PUT_HEVC_EPEL_UNI_W_HV8_LSX 48
+    addi.d         a0,     t2,      8
+    addi.d         t2,     t2,      8
+    addi.d         a2,     t3,      8
+    addi.d         t3,     t3,      8
+    addi.d         a4,     t4,      0
+    addi.d         t5,     t5,      -1
+    bnez           t5,     .LOOP_HV48
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    xvreplve0.w    xr5,    xr5
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    xvreplve0.q    xr6,    xr6
+    xvrepl128vei.w xr16,   xr6,     0
+    xvrepl128vei.w xr17,   xr6,     1
+    xvrepl128vei.w xr18,   xr6,     2
+    xvrepl128vei.w xr19,   xr6,     3
+    la.local       t1,     shufb
+    xvld           xr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    addi.d         t5,     zero,    3
+.LOOP_HV48_LASX:
+    PUT_HEVC_EPEL_UNI_W_HV16_LASX 48
+    addi.d         a0,     t2,      16
+    addi.d         t2,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         t3,     t3,      16
+    addi.d         a4,     t4,      0
+    addi.d         t5,     t5,      -1
+    bnez           t5,     .LOOP_HV48_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv64_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    vreplvei.w     vr5,    vr5,     0
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    vreplvei.w     vr16,   vr6,     0
+    vreplvei.w     vr17,   vr6,     1
+    vreplvei.w     vr18,   vr6,     2
+    vreplvei.w     vr19,   vr6,     3
+    la.local       t1,     shufb
+    vld            vr0,    t1,      0
+    vaddi.bu       vr22,   vr0,     4  // update shufb to get high part
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    addi.d         t5,     zero,    8
+.LOOP_HV64:
+    PUT_HEVC_EPEL_UNI_W_HV8_LSX 64
+    addi.d         a0,     t2,      8
+    addi.d         t2,     t2,      8
+    addi.d         a2,     t3,      8
+    addi.d         t3,     t3,      8
+    addi.d         a4,     t4,      0
+    addi.d         t5,     t5,      -1
+    bnez           t5,     .LOOP_HV64
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  // mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr5,    t1,      t0 // ff_hevc_epel_filters[mx - 1];
+    xvreplve0.w    xr5,    xr5
+    ld.d           t0,     sp,      8  // my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    vldx           vr6,    t1,      t0 // ff_hevc_epel_filters[my - 1];
+    vsllwil.h.b    vr6,    vr6,     0
+    vsllwil.w.h    vr6,    vr6,     0
+    xvreplve0.q    xr6,    xr6
+    xvrepl128vei.w xr16,   xr6,     0
+    xvrepl128vei.w xr17,   xr6,     1
+    xvrepl128vei.w xr18,   xr6,     2
+    xvrepl128vei.w xr19,   xr6,     3
+    la.local       t1,     shufb
+    xvld           xr0,    t1,      0
+    sub.d          a2,     a2,      a3 // src -= srcstride
+    addi.d         a2,     a2,      -1
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    addi.d         t5,     zero,    4
+.LOOP_HV64_LASX:
+    PUT_HEVC_EPEL_UNI_W_HV16_LASX 64
+    addi.d         a0,     t2,      16
+    addi.d         t2,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         t3,     t3,      16
+    addi.d         a4,     t4,      0
+    addi.d         t5,     t5,      -1
+    bnez           t5,     .LOOP_HV64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 3cdb3fb2d7..245a833947 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -171,6 +171,16 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
             c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
 
+            c->put_hevc_epel_uni_w[1][1][1] = ff_hevc_put_hevc_epel_uni_w_hv4_8_lsx;
+            c->put_hevc_epel_uni_w[2][1][1] = ff_hevc_put_hevc_epel_uni_w_hv6_8_lsx;
+            c->put_hevc_epel_uni_w[3][1][1] = ff_hevc_put_hevc_epel_uni_w_hv8_8_lsx;
+            c->put_hevc_epel_uni_w[4][1][1] = ff_hevc_put_hevc_epel_uni_w_hv12_8_lsx;
+            c->put_hevc_epel_uni_w[5][1][1] = ff_hevc_put_hevc_epel_uni_w_hv16_8_lsx;
+            c->put_hevc_epel_uni_w[6][1][1] = ff_hevc_put_hevc_epel_uni_w_hv24_8_lsx;
+            c->put_hevc_epel_uni_w[7][1][1] = ff_hevc_put_hevc_epel_uni_w_hv32_8_lsx;
+            c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lsx;
+            c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lsx;
+
             c->put_hevc_epel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
             c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
             c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
@@ -258,6 +268,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
             c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
 
+            c->put_hevc_epel_uni_w[2][1][1] = ff_hevc_put_hevc_epel_uni_w_hv6_8_lasx;
+            c->put_hevc_epel_uni_w[3][1][1] = ff_hevc_put_hevc_epel_uni_w_hv8_8_lasx;
+            c->put_hevc_epel_uni_w[4][1][1] = ff_hevc_put_hevc_epel_uni_w_hv12_8_lasx;
+            c->put_hevc_epel_uni_w[5][1][1] = ff_hevc_put_hevc_epel_uni_w_hv16_8_lasx;
+            c->put_hevc_epel_uni_w[6][1][1] = ff_hevc_put_hevc_epel_uni_w_hv24_8_lasx;
+            c->put_hevc_epel_uni_w[7][1][1] = ff_hevc_put_hevc_epel_uni_w_hv32_8_lasx;
+            c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx;
+            c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx;
+
             c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
             c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
             c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 8a9266d375..7f09d0943a 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -66,6 +66,15 @@ PEL_UNI_W(qpel, h, 32);
 PEL_UNI_W(qpel, h, 48);
 PEL_UNI_W(qpel, h, 64);
 
+PEL_UNI_W(epel, hv, 6);
+PEL_UNI_W(epel, hv, 8);
+PEL_UNI_W(epel, hv, 12);
+PEL_UNI_W(epel, hv, 16);
+PEL_UNI_W(epel, hv, 24);
+PEL_UNI_W(epel, hv, 32);
+PEL_UNI_W(epel, hv, 48);
+PEL_UNI_W(epel, hv, 64);
+
 #undef PEL_UNI_W
 
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 3291294ed9..7769cf25ae 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -277,6 +277,16 @@ PEL_UNI_W(qpel, h, 32);
 PEL_UNI_W(qpel, h, 48);
 PEL_UNI_W(qpel, h, 64);
 
+PEL_UNI_W(epel, hv, 4);
+PEL_UNI_W(epel, hv, 6);
+PEL_UNI_W(epel, hv, 8);
+PEL_UNI_W(epel, hv, 12);
+PEL_UNI_W(epel, hv, 16);
+PEL_UNI_W(epel, hv, 24);
+PEL_UNI_W(epel, hv, 32);
+PEL_UNI_W(epel, hv, 48);
+PEL_UNI_W(epel, hv, 64);
+
 #undef PEL_UNI_W
 
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
-- 
2.20.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [FFmpeg-devel] [PATCH v2 6/7] avcodec/hevc: Add asm opt for the following functions
  2023-12-27  4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
                   ` (4 preceding siblings ...)
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-27  4:50 ` jinbo
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt jinbo
  6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27  4:50 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: jinbo

tests/checkasm/checkasm:           C       LSX     LASX
put_hevc_qpel_uni_h4_8_c:          5.7     1.2
put_hevc_qpel_uni_h6_8_c:          12.2    2.7
put_hevc_qpel_uni_h8_8_c:          21.5    3.2
put_hevc_qpel_uni_h12_8_c:         47.2    9.2     7.2
put_hevc_qpel_uni_h16_8_c:         87.0    11.7    9.0
put_hevc_qpel_uni_h24_8_c:         188.2   27.5    21.0
put_hevc_qpel_uni_h32_8_c:         335.2   46.7    28.5
put_hevc_qpel_uni_h48_8_c:         772.5   104.5   65.2
put_hevc_qpel_uni_h64_8_c:         1383.2  142.2   109.0

put_hevc_epel_uni_w_v4_8_c:        5.0     1.5
put_hevc_epel_uni_w_v6_8_c:        10.7    3.5     2.5
put_hevc_epel_uni_w_v8_8_c:        18.2    3.7     3.0
put_hevc_epel_uni_w_v12_8_c:       40.2    10.7    7.5
put_hevc_epel_uni_w_v16_8_c:       70.2    13.0    9.2
put_hevc_epel_uni_w_v24_8_c:       158.2   30.2    22.5
put_hevc_epel_uni_w_v32_8_c:       281.0   52.0    36.5
put_hevc_epel_uni_w_v48_8_c:       631.7   116.7   82.7
put_hevc_epel_uni_w_v64_8_c:       1108.2  207.5   142.2

put_hevc_epel_uni_w_h4_8_c:        4.7     1.2
put_hevc_epel_uni_w_h6_8_c:        9.7     3.5     2.7
put_hevc_epel_uni_w_h8_8_c:        17.2    4.2     3.5
put_hevc_epel_uni_w_h12_8_c:       38.0    11.5    7.2
put_hevc_epel_uni_w_h16_8_c:       69.2    14.5    9.2
put_hevc_epel_uni_w_h24_8_c:       152.0   34.7    22.5
put_hevc_epel_uni_w_h32_8_c:       271.0   58.0    40.0
put_hevc_epel_uni_w_h48_8_c:       597.5   136.7   95.0
put_hevc_epel_uni_w_h64_8_c:       1074.0  252.2   168.0

put_hevc_epel_bi_h4_8_c:           4.5     0.7
put_hevc_epel_bi_h6_8_c:           9.0     1.5
put_hevc_epel_bi_h8_8_c:           15.2    1.7
put_hevc_epel_bi_h12_8_c:          33.5    4.2     3.7
put_hevc_epel_bi_h16_8_c:          59.7    5.2     4.7
put_hevc_epel_bi_h24_8_c:          132.2   11.0
put_hevc_epel_bi_h32_8_c:          232.7   20.2    13.2
put_hevc_epel_bi_h48_8_c:          521.7   45.2    31.2
put_hevc_epel_bi_h64_8_c:          949.0   71.5    51.0

After this patch, the peformance of decoding H265 4K 30FPS
30Mbps on 3A6000 with 8 threads improves 1fps(55fps-->56fsp).

Change-Id: I8cc1e41daa63ca478039bc55d1ee8934a7423f51
---
 libavcodec/loongarch/hevc_mc.S                | 1991 ++++++++++++++++-
 libavcodec/loongarch/hevcdsp_init_loongarch.c |   66 +
 libavcodec/loongarch/hevcdsp_lasx.h           |   54 +
 libavcodec/loongarch/hevcdsp_lsx.h            |   36 +-
 4 files changed, 2144 insertions(+), 3 deletions(-)

diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index 0b0647546b..a0e5938fbd 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -1784,8 +1784,12 @@ function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
 endfunc
 
 const shufb
-    .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6
-    .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10
+    .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6  //mask for epel_uni_w(128-bit)
+    .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10 //mask for epel_uni_w(256-bit)
+    .byte 0,1,2,3, 4,5,6,7 ,1,2,3,4, 5,6,7,8  //mask for qpel_uni_h4
+    .byte 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8  //mask for qpel_uni_h/v6/8...
+    .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9, 7,8,9,10 //epel_uni_w_h16/24/32/48/64
+    .byte 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8, 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8  //mask for bi_epel_h16/24/32/48/64
 endconst
 
 .macro PUT_HEVC_EPEL_UNI_W_HV4_LSX w
@@ -2584,3 +2588,1986 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx
     addi.d         t5,     t5,      -1
     bnez           t5,     .LOOP_HV64_LASX
 endfunc
+
+/*
+ * void FUNC(put_hevc_qpel_uni_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ *                                const uint8_t *_src, ptrdiff_t _srcstride,
+ *                                int height, intptr_t mx, intptr_t my,
+ *                                int width)
+ */
+function ff_hevc_put_hevc_uni_qpel_h4_8_lsx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr5,    t1,      t0 //filter
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    vreplgr2vr.h   vr1,    t1
+    la.local       t1,     shufb
+    vld            vr2,    t1,      32 //mask0 0 1
+    vaddi.bu       vr3,    vr2,     2  //mask1 2 3
+.LOOP_UNI_H4:
+    vld            vr18,   a2,      0
+    vldx           vr19,   a2,      a3
+    alsl.d         a2,     a3,      a2,   1
+    vshuf.b        vr6,    vr18,    vr18,   vr2
+    vshuf.b        vr7,    vr18,    vr18,   vr3
+    vshuf.b        vr8,    vr19,    vr19,   vr2
+    vshuf.b        vr9,    vr19,    vr19,   vr3
+    vdp2.h.bu.b    vr10,   vr6,     vr5
+    vdp2.h.bu.b    vr11,   vr7,     vr5
+    vdp2.h.bu.b    vr12,   vr8,     vr5
+    vdp2.h.bu.b    vr13,   vr9,     vr5
+    vhaddw.d.h     vr10
+    vhaddw.d.h     vr11
+    vhaddw.d.h     vr12
+    vhaddw.d.h     vr13
+    vpickev.w      vr10,   vr11,    vr10
+    vpickev.w      vr11,   vr13,    vr12
+    vpickev.h      vr10,   vr11,    vr10
+    vadd.h         vr10,   vr10,    vr1
+    vsrai.h        vr10,   vr10,    6
+    vssrani.bu.h   vr10,   vr10,    0
+    fst.s          f10,    a0,      0
+    vbsrl.v        vr10,   vr10,    4
+    fstx.s         f10,    a0,      a1
+    alsl.d         a0,     a1,      a0,   1
+    addi.d         a4,     a4,      -2
+    bnez           a4,     .LOOP_UNI_H4
+endfunc
+
+.macro HEVC_UNI_QPEL_H8_LSX in0, out0
+    vshuf.b        vr10,   \in0,    \in0,   vr5
+    vshuf.b        vr11,   \in0,    \in0,   vr6
+    vshuf.b        vr12,   \in0,    \in0,   vr7
+    vshuf.b        vr13,   \in0,    \in0,   vr8
+    vdp2.h.bu.b    \out0,  vr10,    vr0 //(QPEL_FILTER(src, 1)
+    vdp2add.h.bu.b \out0,  vr11,    vr1
+    vdp2add.h.bu.b \out0,  vr12,    vr2
+    vdp2add.h.bu.b \out0,  vr13,    vr3
+    vadd.h         \out0,  \out0,   vr4
+    vsrai.h        \out0,  \out0,   6
+.endm
+
+.macro HEVC_UNI_QPEL_H16_LASX in0, out0
+    xvshuf.b        xr10,   \in0,   \in0,   xr5
+    xvshuf.b        xr11,   \in0,   \in0,   xr6
+    xvshuf.b        xr12,   \in0,   \in0,   xr7
+    xvshuf.b        xr13,   \in0,   \in0,   xr8
+    xvdp2.h.bu.b    \out0,  xr10,   xr0 //(QPEL_FILTER(src, 1)
+    xvdp2add.h.bu.b \out0,  xr11,   xr1
+    xvdp2add.h.bu.b \out0,  xr12,   xr2
+    xvdp2add.h.bu.b \out0,  xr13,   xr3
+    xvadd.h         \out0,  \out0,  xr4
+    xvsrai.h        \out0,  \out0,  6
+.endm
+
+function ff_hevc_put_hevc_uni_qpel_h6_8_lsx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    vreplvei.h     vr1,    vr0,     1  //cd...
+    vreplvei.h     vr2,    vr0,     2  //ef...
+    vreplvei.h     vr3,    vr0,     3  //gh...
+    vreplvei.h     vr0,    vr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    vreplgr2vr.h   vr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    vaddi.bu       vr6,    vr5,     2
+    vaddi.bu       vr7,    vr5,     4
+    vaddi.bu       vr8,    vr5,     6
+.LOOP_UNI_H6:
+    vld            vr9,    a2,      0
+    add.d          a2,     a2,      a3
+    HEVC_UNI_QPEL_H8_LSX vr9, vr14
+    vssrani.bu.h   vr14,   vr14,    0
+    fst.s          f14,    a0,      0
+    vstelm.h       vr14,   a0,      4,   2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H6
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h8_8_lsx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    vreplvei.h     vr1,    vr0,     1  //cd...
+    vreplvei.h     vr2,    vr0,     2  //ef...
+    vreplvei.h     vr3,    vr0,     3  //gh...
+    vreplvei.h     vr0,    vr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    vreplgr2vr.h   vr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    vaddi.bu       vr6,    vr5,     2
+    vaddi.bu       vr7,    vr5,     4
+    vaddi.bu       vr8,    vr5,     6
+.LOOP_UNI_H8:
+    vld            vr9,    a2,      0
+    add.d          a2,     a2,      a3
+    HEVC_UNI_QPEL_H8_LSX vr9, vr14
+    vssrani.bu.h   vr14,   vr14,    0
+    fst.d          f14,    a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H8
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h12_8_lsx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    vreplvei.h     vr1,    vr0,     1  //cd...
+    vreplvei.h     vr2,    vr0,     2  //ef...
+    vreplvei.h     vr3,    vr0,     3  //gh...
+    vreplvei.h     vr0,    vr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    vreplgr2vr.h   vr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    vaddi.bu       vr6,    vr5,     2
+    vaddi.bu       vr7,    vr5,     4
+    vaddi.bu       vr8,    vr5,     6
+.LOOP_UNI_H12:
+    vld            vr9,    a2,      0
+    HEVC_UNI_QPEL_H8_LSX vr9, vr14
+    vld            vr9,    a2,      8
+    add.d          a2,     a2,      a3
+    HEVC_UNI_QPEL_H8_LSX vr9, vr15
+    vssrani.bu.h   vr15,   vr14,    0
+    fst.d          f15,    a0,      0
+    vstelm.w       vr15,   a0,      8,   2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H12
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h12_8_lasx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    xvreplve0.q    xr0,    xr0
+    xvrepl128vei.h xr1,    xr0,     1  //cd...
+    xvrepl128vei.h xr2,    xr0,     2  //ef...
+    xvrepl128vei.h xr3,    xr0,     3  //gh...
+    xvrepl128vei.h xr0,    xr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    xvreplgr2vr.h  xr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    xvreplve0.q    xr5,    xr5
+    xvaddi.bu      xr6,    xr5,     2
+    xvaddi.bu      xr7,    xr5,     4
+    xvaddi.bu      xr8,    xr5,     6
+.LOOP_UNI_H12_LASX:
+    xvld           xr9,    a2,      0
+    add.d          a2,     a2,      a3
+    xvpermi.d      xr9,    xr9,     0x94 //rearrange data
+    HEVC_UNI_QPEL_H16_LASX xr9, xr14
+    xvpermi.q      xr15,   xr14,    0x01
+    vssrani.bu.h   vr15,   vr14,    0
+    fst.d          f15,    a0,      0
+    vstelm.w       vr15,   a0,      8,   2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h16_8_lsx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    vreplvei.h     vr1,    vr0,     1  //cd...
+    vreplvei.h     vr2,    vr0,     2  //ef...
+    vreplvei.h     vr3,    vr0,     3  //gh...
+    vreplvei.h     vr0,    vr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    vreplgr2vr.h   vr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    vaddi.bu       vr6,    vr5,     2
+    vaddi.bu       vr7,    vr5,     4
+    vaddi.bu       vr8,    vr5,     6
+.LOOP_UNI_H16:
+    vld            vr9,    a2,      0
+    HEVC_UNI_QPEL_H8_LSX vr9, vr14
+    vld            vr9,    a2,      8
+    add.d          a2,     a2,      a3
+    HEVC_UNI_QPEL_H8_LSX vr9, vr15
+    vssrani.bu.h   vr15,   vr14,    0
+    vst            vr15,   a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H16
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h16_8_lasx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    xvreplve0.q    xr0,    xr0
+    xvrepl128vei.h xr1,    xr0,     1  //cd...
+    xvrepl128vei.h xr2,    xr0,     2  //ef...
+    xvrepl128vei.h xr3,    xr0,     3  //gh...
+    xvrepl128vei.h xr0,    xr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    xvreplgr2vr.h  xr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    xvreplve0.q    xr5,    xr5
+    xvaddi.bu      xr6,    xr5,     2
+    xvaddi.bu      xr7,    xr5,     4
+    xvaddi.bu      xr8,    xr5,     6
+.LOOP_UNI_H16_LASX:
+    xvld           xr9,    a2,      0
+    add.d          a2,     a2,      a3
+    xvpermi.d      xr9,    xr9,     0x94 //rearrange data
+    HEVC_UNI_QPEL_H16_LASX xr9, xr14
+    xvpermi.q      xr15,   xr14,    0x01
+    vssrani.bu.h   vr15,   vr14,    0
+    vst            vr15,   a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h24_8_lsx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    vreplvei.h     vr1,    vr0,     1  //cd...
+    vreplvei.h     vr2,    vr0,     2  //ef...
+    vreplvei.h     vr3,    vr0,     3  //gh...
+    vreplvei.h     vr0,    vr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    vreplgr2vr.h   vr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    vaddi.bu       vr6,    vr5,     2
+    vaddi.bu       vr7,    vr5,     4
+    vaddi.bu       vr8,    vr5,     6
+.LOOP_UNI_H24:
+    vld            vr9,    a2,      0
+    HEVC_UNI_QPEL_H8_LSX vr9, vr14
+    vld            vr9,    a2,      8
+    HEVC_UNI_QPEL_H8_LSX vr9, vr15
+    vld            vr9,    a2,      16
+    add.d          a2,     a2,      a3
+    HEVC_UNI_QPEL_H8_LSX vr9, vr16
+    vssrani.bu.h   vr15,   vr14,    0
+    vssrani.bu.h   vr16,   vr16,    0
+    vst            vr15,   a0,      0
+    fst.d          f16,    a0,      16
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H24
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h24_8_lasx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    xvreplve0.q    xr0,    xr0
+    xvrepl128vei.h xr1,    xr0,     1  //cd...
+    xvrepl128vei.h xr2,    xr0,     2  //ef...
+    xvrepl128vei.h xr3,    xr0,     3  //gh...
+    xvrepl128vei.h xr0,    xr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    xvreplgr2vr.h  xr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    xvreplve0.q    xr5,    xr5
+    xvaddi.bu      xr6,    xr5,     2
+    xvaddi.bu      xr7,    xr5,     4
+    xvaddi.bu      xr8,    xr5,     6
+.LOOP_UNI_H24_LASX:
+    xvld           xr9,    a2,      0
+    xvpermi.q      xr19,   xr9,     0x01 //16...23
+    add.d          a2,     a2,      a3
+    xvpermi.d      xr9,    xr9,     0x94 //rearrange data
+    HEVC_UNI_QPEL_H16_LASX xr9, xr14
+    xvpermi.q      xr15,   xr14,    0x01
+    vssrani.bu.h   vr15,   vr14,    0
+    vst            vr15,   a0,      0
+    HEVC_UNI_QPEL_H8_LSX vr19, vr16
+    vssrani.bu.h   vr16,   vr16,    0
+    fst.d          f16,    a0,      16
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h32_8_lsx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    vreplvei.h     vr1,    vr0,     1  //cd...
+    vreplvei.h     vr2,    vr0,     2  //ef...
+    vreplvei.h     vr3,    vr0,     3  //gh...
+    vreplvei.h     vr0,    vr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    vreplgr2vr.h   vr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    vaddi.bu       vr6,    vr5,     2
+    vaddi.bu       vr7,    vr5,     4
+    vaddi.bu       vr8,    vr5,     6
+.LOOP_UNI_H32:
+    vld            vr9,    a2,      0
+    HEVC_UNI_QPEL_H8_LSX vr9, vr14
+    vld            vr9,    a2,      8
+    HEVC_UNI_QPEL_H8_LSX vr9, vr15
+    vld            vr9,    a2,      16
+    HEVC_UNI_QPEL_H8_LSX vr9, vr16
+    vld            vr9,    a2,      24
+    add.d          a2,     a2,      a3
+    HEVC_UNI_QPEL_H8_LSX vr9, vr17
+    vssrani.bu.h   vr15,   vr14,    0
+    vssrani.bu.h   vr17,   vr16,    0
+    vst            vr15,   a0,      0
+    vst            vr17,   a0,      16
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H32
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h32_8_lasx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    xvreplve0.q    xr0,    xr0
+    xvrepl128vei.h xr1,    xr0,     1  //cd...
+    xvrepl128vei.h xr2,    xr0,     2  //ef...
+    xvrepl128vei.h xr3,    xr0,     3  //gh...
+    xvrepl128vei.h xr0,    xr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    xvreplgr2vr.h  xr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    xvreplve0.q    xr5,    xr5
+    xvaddi.bu      xr6,    xr5,     2
+    xvaddi.bu      xr7,    xr5,     4
+    xvaddi.bu      xr8,    xr5,     6
+.LOOP_UNI_H32_LASX:
+    xvld           xr9,    a2,      0
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr14
+    xvld           xr9,    a2,      16
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr15
+    add.d          a2,     a2,      a3
+    xvssrani.bu.h  xr15,   xr14,    0
+    xvpermi.d      xr15,   xr15,    0xd8
+    xvst           xr15,   a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h48_8_lsx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    vreplvei.h     vr1,    vr0,     1  //cd...
+    vreplvei.h     vr2,    vr0,     2  //ef...
+    vreplvei.h     vr3,    vr0,     3  //gh...
+    vreplvei.h     vr0,    vr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    vreplgr2vr.h   vr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    vaddi.bu       vr6,    vr5,     2
+    vaddi.bu       vr7,    vr5,     4
+    vaddi.bu       vr8,    vr5,     6
+.LOOP_UNI_H48:
+    vld            vr9,    a2,      0
+    HEVC_UNI_QPEL_H8_LSX vr9, vr14
+    vld            vr9,    a2,      8
+    HEVC_UNI_QPEL_H8_LSX vr9, vr15
+    vld            vr9,    a2,      16
+    HEVC_UNI_QPEL_H8_LSX vr9, vr16
+    vld            vr9,    a2,      24
+    HEVC_UNI_QPEL_H8_LSX vr9, vr17
+    vld            vr9,    a2,      32
+    HEVC_UNI_QPEL_H8_LSX vr9, vr18
+    vld            vr9,    a2,      40
+    add.d          a2,     a2,      a3
+    HEVC_UNI_QPEL_H8_LSX vr9, vr19
+    vssrani.bu.h   vr15,   vr14,    0
+    vssrani.bu.h   vr17,   vr16,    0
+    vssrani.bu.h   vr19,   vr18,    0
+    vst            vr15,   a0,      0
+    vst            vr17,   a0,      16
+    vst            vr19,   a0,      32
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H48
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h48_8_lasx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    xvreplve0.q    xr0,    xr0
+    xvrepl128vei.h xr1,    xr0,     1  //cd...
+    xvrepl128vei.h xr2,    xr0,     2  //ef...
+    xvrepl128vei.h xr3,    xr0,     3  //gh...
+    xvrepl128vei.h xr0,    xr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    xvreplgr2vr.h  xr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    xvreplve0.q    xr5,    xr5
+    xvaddi.bu      xr6,    xr5,     2
+    xvaddi.bu      xr7,    xr5,     4
+    xvaddi.bu      xr8,    xr5,     6
+.LOOP_UNI_H48_LASX:
+    xvld           xr9,    a2,      0
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr14
+    xvld           xr9,    a2,      16
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr15
+    xvld           xr9,    a2,      32
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr16
+    add.d          a2,     a2,      a3
+    xvssrani.bu.h  xr15,   xr14,    0
+    xvpermi.d      xr15,   xr15,    0xd8
+    xvst           xr15,   a0,      0
+    xvpermi.q      xr17,   xr16,    0x01
+    vssrani.bu.h   vr17,   vr16,    0
+    vst            vr17,   a0,      32
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h64_8_lasx
+    addi.d         t0,     a5,      -1
+    slli.w         t0,     t0,      4
+    la.local       t1,     ff_hevc_qpel_filters
+    vldx           vr0,    t1,      t0 //filter abcdefgh
+    xvreplve0.q    xr0,    xr0
+    xvrepl128vei.h xr1,    xr0,     1  //cd...
+    xvrepl128vei.h xr2,    xr0,     2  //ef...
+    xvrepl128vei.h xr3,    xr0,     3  //gh...
+    xvrepl128vei.h xr0,    xr0,     0  //ab...
+    addi.d         a2,     a2,      -3 //src -= 3
+    addi.w         t1,     zero,    32
+    xvreplgr2vr.h  xr4,    t1
+    la.local       t1,     shufb
+    vld            vr5,    t1,      48
+    xvreplve0.q    xr5,    xr5
+    xvaddi.bu      xr6,    xr5,     2
+    xvaddi.bu      xr7,    xr5,     4
+    xvaddi.bu      xr8,    xr5,     6
+.LOOP_UNI_H64_LASX:
+    xvld           xr9,    a2,      0
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr14
+    xvld           xr9,    a2,      16
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr15
+    xvld           xr9,    a2,      32
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr16
+    xvld           xr9,    a2,      48
+    xvpermi.d      xr9,    xr9,     0x94
+    HEVC_UNI_QPEL_H16_LASX xr9, xr17
+    add.d          a2,     a2,      a3
+    xvssrani.bu.h  xr15,   xr14,    0
+    xvpermi.d      xr15,   xr15,    0xd8
+    xvst           xr15,   a0,      0
+    xvssrani.bu.h  xr17,   xr16,    0
+    xvpermi.d      xr17,   xr17,    0xd8
+    xvst           xr17,   a0,      32
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,     -1
+    bnez           a4,     .LOOP_UNI_H64_LASX
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
+ *                                  const uint8_t *_src, ptrdiff_t _srcstride,
+ *                                  int height, int denom, int wx, int ox,
+ *                                  intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_v4_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    fld.s          f6,     a2,      0  //0
+    fldx.s         f7,     a2,      a3 //1
+    fldx.s         f8,     a2,      t0 //2
+    add.d          a2,     a2,      t1
+    vilvl.b        vr6,    vr7,     vr6
+    vilvl.b        vr7,    vr8,     vr8
+    vilvl.h        vr6,    vr7,     vr6
+    vreplvei.w     vr0,    vr0,     0
+.LOOP_UNI_V4:
+    fld.s          f9,     a2,      0  //3
+    fldx.s         f10,    a2,      a3 //4
+    add.d          a2,     a2,      t0
+    vextrins.b     vr6,    vr9,     0x30 //insert the 3th load
+    vextrins.b     vr6,    vr9,     0x71
+    vextrins.b     vr6,    vr9,     0xb2
+    vextrins.b     vr6,    vr9,     0xf3
+    vbsrl.v        vr7,    vr6,     1
+    vextrins.b     vr7,    vr10,    0x30 //insert the 4th load
+    vextrins.b     vr7,    vr10,    0x71
+    vextrins.b     vr7,    vr10,    0xb2
+    vextrins.b     vr7,    vr10,    0xf3
+    vdp2.h.bu.b    vr8,    vr6,     vr0 //EPEL_FILTER(src, stride)
+    vdp2.h.bu.b    vr9,    vr7,     vr0
+    vhaddw.w.h     vr10,   vr8,     vr8
+    vhaddw.w.h     vr11,   vr9,     vr9
+    vmulwev.w.h    vr10,   vr10,    vr1 //EPEL_FILTER(src, stride) * wx
+    vmulwev.w.h    vr11,   vr11,    vr1
+    vadd.w         vr10,   vr10,    vr2 // + offset
+    vadd.w         vr11,   vr11,    vr2
+    vsra.w         vr10,   vr10,    vr3 // >> shift
+    vsra.w         vr11,   vr11,    vr3
+    vadd.w         vr10,   vr10,    vr4 // + ox
+    vadd.w         vr11,   vr11,    vr4
+    vssrani.h.w    vr11,   vr10,    0
+    vssrani.bu.h   vr10,   vr11,    0
+    vbsrl.v        vr6,    vr7,     1
+    fst.s          f10,    a0,      0
+    vbsrl.v        vr10,   vr10,    4
+    fstx.s         f10,    a0,      a1
+    alsl.d         a0,     a1,      a0,    1
+    addi.d         a4,     a4,      -2
+    bnez           a4,     .LOOP_UNI_V4
+endfunc
+
+.macro CALC_EPEL_FILTER_LSX out0, out1
+    vdp2.h.bu.b    vr12,   vr10,    vr0 //EPEL_FILTER(src, stride)
+    vdp2add.h.bu.b vr12,   vr11,    vr5
+    vexth.w.h      vr13,   vr12
+    vsllwil.w.h    vr12,   vr12,    0
+    vmulwev.w.h    vr12,   vr12,    vr1 //EPEL_FILTER(src, stride) * wx
+    vmulwev.w.h    vr13,   vr13,    vr1 //EPEL_FILTER(src, stride) * wx
+    vadd.w         vr12,   vr12,    vr2 // + offset
+    vadd.w         vr13,   vr13,    vr2
+    vsra.w         vr12,   vr12,    vr3 // >> shift
+    vsra.w         vr13,   vr13,    vr3
+    vadd.w         \out0,  vr12,    vr4 // + ox
+    vadd.w         \out1,  vr13,    vr4
+.endm
+
+.macro CALC_EPEL_FILTER_LASX out0
+    xvdp2.h.bu.b   xr11,   xr12,    xr0 //EPEL_FILTER(src, stride)
+    xvhaddw.w.h    xr12,   xr11,    xr11
+    xvmulwev.w.h   xr12,   xr12,    xr1 //EPEL_FILTER(src, stride) * wx
+    xvadd.w        xr12,   xr12,    xr2 // + offset
+    xvsra.w        xr12,   xr12,    xr3 // >> shift
+    xvadd.w        \out0,  xr12,    xr4 // + ox
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V8_LSX w
+    fld.d          f6,     a2,      0  //0
+    fldx.d         f7,     a2,      a3 //1
+    fldx.d         f8,     a2,      t0 //2
+    add.d          a2,     a2,      t1
+.LOOP_UNI_V8_\w:
+    fld.d          f9,     a2,      0  // 3
+    add.d          a2,     a2,      a3
+    vilvl.b        vr10,   vr7,     vr6
+    vilvl.b        vr11,   vr9,     vr8
+    vaddi.bu       vr6,    vr7,     0  //back up previous value
+    vaddi.bu       vr7,    vr8,     0
+    vaddi.bu       vr8,    vr9,     0
+    CALC_EPEL_FILTER_LSX vr12, vr13
+    vssrani.h.w    vr13,   vr12,    0
+    vssrani.bu.h   vr13,   vr13,    0
+.if \w < 8
+    fst.s          f13,    a0,      0
+    vstelm.h       vr13,   a0,      4,   2
+.else
+    fst.d          f13,    a0,      0
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_V8_\w
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V8_LASX w
+    fld.d          f6,     a2,      0  //0
+    fldx.d         f7,     a2,      a3 //1
+    fldx.d         f8,     a2,      t0 //2
+    add.d          a2,     a2,      t1
+.LOOP_UNI_V8_LASX_\w:
+    fld.d          f9,     a2,      0  // 3
+    add.d          a2,     a2,      a3
+    vilvl.b        vr10,   vr7,     vr6
+    vilvl.b        vr11,   vr9,     vr8
+    xvilvl.h       xr12,   xr11,    xr10
+    xvilvh.h       xr13,   xr11,    xr10
+    xvpermi.q      xr12,   xr13,    0x02
+    vaddi.bu       vr6,    vr7,     0  //back up previous value
+    vaddi.bu       vr7,    vr8,     0
+    vaddi.bu       vr8,    vr9,     0
+    CALC_EPEL_FILTER_LASX xr12
+    xvpermi.q      xr13,   xr12,    0x01
+    vssrani.h.w    vr13,   vr12,    0
+    vssrani.bu.h   vr13,   vr13,    0
+.if \w < 8
+    fst.s          f13,    a0,      0
+    vstelm.h       vr13,   a0,      4,   2
+.else
+    fst.d          f13,    a0,      0
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_V8_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_v6_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+    PUT_HEVC_EPEL_UNI_W_V8_LSX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v6_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    PUT_HEVC_EPEL_UNI_W_V8_LASX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v8_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+    PUT_HEVC_EPEL_UNI_W_V8_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v8_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    PUT_HEVC_EPEL_UNI_W_V8_LASX 8
+endfunc
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V16_LSX w
+    vld            vr6,    a2,      0  //0
+    vldx           vr7,    a2,      a3 //1
+    vldx           vr8,    a2,      t0 //2
+    add.d          a2,     a2,      t1
+.LOOP_UNI_V16_\w:
+    vld            vr9,    a2,      0  //3
+    add.d          a2,     a2,      a3
+    vilvl.b        vr10,   vr7,     vr6
+    vilvl.b        vr11,   vr9,     vr8
+    CALC_EPEL_FILTER_LSX vr14, vr15
+    vilvh.b        vr10,   vr7,     vr6
+    vilvh.b        vr11,   vr9,     vr8
+    CALC_EPEL_FILTER_LSX vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vaddi.bu       vr6,    vr7,     0  //back up previous value
+    vaddi.bu       vr7,    vr8,     0
+    vaddi.bu       vr8,    vr9,     0
+.if \w < 16
+    fst.d          f17,    a0,      0
+    vstelm.w       vr17,   a0,      8,    2
+.else
+    vst            vr17,   a0,      0
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_V16_\w
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V16_LASX w
+    vld            vr6,    a2,      0  //0
+    vldx           vr7,    a2,      a3 //1
+    vldx           vr8,    a2,      t0 //2
+    add.d          a2,     a2,      t1
+.LOOP_UNI_V16_LASX_\w:
+    vld            vr9,    a2,      0  //3
+    add.d          a2,     a2,      a3
+    xvilvl.b       xr10,   xr7,     xr6
+    xvilvh.b       xr11,   xr7,     xr6
+    xvpermi.q      xr11,   xr10,    0x20
+    xvilvl.b       xr12,   xr9,     xr8
+    xvilvh.b       xr13,   xr9,     xr8
+    xvpermi.q      xr13,   xr12,    0x20
+    xvdp2.h.bu.b   xr10,   xr11,    xr0 //EPEL_FILTER(src, stride)
+    xvdp2add.h.bu.b xr10,  xr13,    xr5
+    xvexth.w.h     xr11,   xr10
+    xvsllwil.w.h   xr10,   xr10,    0
+    xvmulwev.w.h   xr10,   xr10,    xr1 //EPEL_FILTER(src, stride) * wx
+    xvmulwev.w.h   xr11,   xr11,    xr1
+    xvadd.w        xr10,   xr10,    xr2 // + offset
+    xvadd.w        xr11,   xr11,    xr2
+    xvsra.w        xr10,   xr10,    xr3 // >> shift
+    xvsra.w        xr11,   xr11,    xr3
+    xvadd.w        xr10,   xr10,    xr4 // + wx
+    xvadd.w        xr11,   xr11,    xr4
+    xvssrani.h.w   xr11,   xr10,    0
+    xvpermi.q      xr10,   xr11,    0x01
+    vssrani.bu.h   vr10,   vr11,    0
+    vaddi.bu       vr6,    vr7,     0  //back up previous value
+    vaddi.bu       vr7,    vr8,     0
+    vaddi.bu       vr8,    vr9,     0
+.if \w < 16
+    fst.d          f10,    a0,      0
+    vstelm.w       vr10,   a0,      8,    2
+.else
+    vst            vr10,   a0,      0
+.endif
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_V16_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_v12_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v12_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.q    xr0,    xr0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    xvrepl128vei.h xr5,    xr0,     1
+    xvrepl128vei.h xr0,    xr0,     0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v16_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v16_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.q    xr0,    xr0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    xvrepl128vei.h xr5,    xr0,     1
+    xvrepl128vei.h xr0,    xr0,     0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v24_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+    addi.d         t2,     a0,      0 //save init
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 24
+    addi.d         a0,     t2,      16 //increase step
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V8_LSX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v24_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr20,   xr0         //save xr0
+    xvreplve0.q    xr0,    xr0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    xvrepl128vei.h xr5,    xr0,     1
+    xvrepl128vei.h xr0,    xr0,     0
+    addi.d         t2,     a0,      0 //save init
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 24
+    addi.d         a0,     t2,      16 //increase step
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    xvaddi.bu      xr0,    xr20,    0
+    PUT_HEVC_EPEL_UNI_W_V8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v32_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 32
+    addi.d         a0,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 33
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v32_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.q    xr0,    xr0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    xvrepl128vei.h xr5,    xr0,     1
+    xvrepl128vei.h xr0,    xr0,     0
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 32
+    addi.d         a0,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 33
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v48_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 48
+    addi.d         a0,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 49
+    addi.d         a0,     t2,      32
+    addi.d         a2,     t3,      32
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 50
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v48_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.q    xr0,    xr0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    xvrepl128vei.h xr5,    xr0,     1
+    xvrepl128vei.h xr0,    xr0,     0
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 48
+    addi.d         a0,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 49
+    addi.d         a0,     t2,      32
+    addi.d         a2,     t3,      32
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 50
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v64_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 64
+    addi.d         a0,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 65
+    addi.d         a0,     t2,      32
+    addi.d         a2,     t3,      32
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 66
+    addi.d         a0,     t2,      48
+    addi.d         a2,     t3,      48
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LSX 67
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v64_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      8  //my
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.q    xr0,    xr0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    sub.d          a2,     a2,      a3 //src -= stride
+    xvrepl128vei.h xr5,    xr0,     1
+    xvrepl128vei.h xr0,    xr0,     0
+    addi.d         t2,     a0,      0
+    addi.d         t3,     a2,      0
+    addi.d         t4,     a4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 64
+    addi.d         a0,     t2,      16
+    addi.d         a2,     t3,      16
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 65
+    addi.d         a0,     t2,      32
+    addi.d         a2,     t3,      32
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 66
+    addi.d         a0,     t2,      48
+    addi.d         a2,     t3,      48
+    addi.d         a4,     t4,      0
+    PUT_HEVC_EPEL_UNI_W_V16_LASX 67
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ *                                  const uint8_t *_src, ptrdiff_t _srcstride,
+ *                                  int height, int denom, int wx, int ox,
+ *                                  intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_h4_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr5,    t1,      0
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H4:
+    fld.d          f6,     a2,      0
+    add.d          a2,     a2,      a3
+    vshuf.b        vr6,    vr6,     vr6,   vr5
+    vdp2.h.bu.b    vr7,    vr6,     vr0
+    vhaddw.w.h     vr7,    vr7,     vr7
+    vmulwev.w.h    vr7,    vr7,     vr1
+    vadd.w         vr7,    vr7,     vr2
+    vsra.w         vr7,    vr7,     vr3
+    vadd.w         vr7,    vr7,     vr4
+    vssrani.h.w    vr7,    vr7,     0
+    vssrani.bu.h   vr7,    vr7,     0
+    fst.s          f7,     a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H4
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr6,    t1,      48
+    vaddi.bu       vr7,   vr6,      2
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+.LOOP_UNI_W_H6:
+    vld            vr8,    a2,      0
+    add.d          a2,     a2,      a3
+    vshuf.b        vr10,   vr8,     vr8,   vr6
+    vshuf.b        vr11,   vr8,     vr8,   vr7
+    CALC_EPEL_FILTER_LSX vr14, vr15
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.bu.h   vr15,   vr15,    0
+    fst.s          f15,    a0,      0
+    vstelm.h       vr15,   a0,      4,   2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    la.local       t1,     shufb
+    xvld           xr6,    t1,      64
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H6_LASX:
+    vld            vr8,    a2,      0
+    xvreplve0.q    xr8,    xr8
+    add.d          a2,     a2,      a3
+    xvshuf.b       xr12,   xr8,     xr8,   xr6
+    CALC_EPEL_FILTER_LASX xr14
+    xvpermi.q      xr15,   xr14,    0x01
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.bu.h   vr15,   vr15,    0
+    fst.s          f15,    a0,      0
+    vstelm.h       vr15,   a0,      4,   2
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H6_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr6,    t1,      48
+    vaddi.bu       vr7,   vr6,      2
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+.LOOP_UNI_W_H8:
+    vld            vr8,    a2,      0
+    add.d          a2,     a2,      a3
+    vshuf.b        vr10,   vr8,     vr8,   vr6
+    vshuf.b        vr11,   vr8,     vr8,   vr7
+    CALC_EPEL_FILTER_LSX vr14, vr15
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.bu.h   vr15,   vr15,    0
+    fst.d          f15,    a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    la.local       t1,     shufb
+    xvld           xr6,    t1,      64
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H8_LASX:
+    vld            vr8,    a2,      0
+    xvreplve0.q    xr8,    xr8
+    add.d          a2,     a2,      a3
+    xvshuf.b       xr12,   xr8,     xr8,   xr6
+    CALC_EPEL_FILTER_LASX xr14
+    xvpermi.q      xr15,   xr14,    0x01
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.bu.h   vr15,   vr15,    0
+    fst.d          f15,    a0,      0
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H8_LASX
+endfunc
+
+.macro EPEL_UNI_W_H16_LOOP_LSX idx0, idx1, idx2
+    vld            vr8,    a2,      \idx0
+    vshuf.b        vr10,   vr8,     vr8,   vr6
+    vshuf.b        vr11,   vr8,     vr8,   vr7
+    CALC_EPEL_FILTER_LSX vr14, vr15
+    vld            vr8,    a2,      \idx1
+    vshuf.b        vr10,   vr8,     vr8,   vr6
+    vshuf.b        vr11,   vr8,     vr8,   vr7
+    CALC_EPEL_FILTER_LSX vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    vst            vr17,   a0,     \idx2
+.endm
+
+.macro EPEL_UNI_W_H16_LOOP_LASX idx0, idx2, w
+    xvld           xr8,    a2,      \idx0
+    xvpermi.d      xr9,    xr8,     0x09
+    xvreplve0.q    xr8,    xr8
+    xvshuf.b       xr12,   xr8,     xr8,   xr6
+    CALC_EPEL_FILTER_LASX xr14
+    xvreplve0.q    xr8,    xr9
+    xvshuf.b       xr12,   xr8,     xr8,   xr6
+    CALC_EPEL_FILTER_LASX xr16
+    xvssrani.h.w   xr16,   xr14,    0
+    xvpermi.q      xr17,   xr16,    0x01
+    vssrani.bu.h   vr17,   vr16,    0
+    vpermi.w       vr17,   vr17,    0xd8
+.if \w == 12
+    fst.d          f17,    a0,      0
+    vstelm.w       vr17,   a0,      8,   2
+.else
+    vst            vr17,   a0,      \idx2
+.endif
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_h12_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr6,    t1,      48
+    vaddi.bu       vr7,   vr6,      2
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+.LOOP_UNI_W_H12:
+    vld            vr8,    a2,      0
+    vshuf.b        vr10,   vr8,     vr8,   vr6
+    vshuf.b        vr11,   vr8,     vr8,   vr7
+    CALC_EPEL_FILTER_LSX vr14, vr15
+    vld            vr8,    a2,      8
+    vshuf.b        vr10,   vr8,     vr8,   vr6
+    vshuf.b        vr11,   vr8,     vr8,   vr7
+    CALC_EPEL_FILTER_LSX vr16, vr17
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.h.w    vr17,   vr16,    0
+    vssrani.bu.h   vr17,   vr15,    0
+    fst.d          f17,    a0,      0
+    vstelm.w       vr17,   a0,      8,   2
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h12_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    la.local       t1,     shufb
+    xvld           xr6,    t1,      64
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H12_LASX:
+    EPEL_UNI_W_H16_LOOP_LASX 0, 0, 12
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h16_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr6,    t1,      48
+    vaddi.bu       vr7,   vr6,      2
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+.LOOP_UNI_W_H16:
+    EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h16_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    la.local       t1,     shufb
+    xvld           xr6,    t1,      64
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H16_LASX:
+    EPEL_UNI_W_H16_LOOP_LASX 0, 0, 16
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h24_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr6,    t1,      48
+    vaddi.bu       vr7,   vr6,      2
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+.LOOP_UNI_W_H24:
+    EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+    vld            vr8,    a2,      16
+    add.d          a2,     a2,      a3
+    vshuf.b        vr10,   vr8,     vr8,   vr6
+    vshuf.b        vr11,   vr8,     vr8,   vr7
+    CALC_EPEL_FILTER_LSX vr18, vr19
+    vssrani.h.w    vr19,   vr18,    0
+    vssrani.bu.h   vr19,   vr19,    0
+    fst.d          f19,    a0,      16
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h24_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    la.local       t1,     shufb
+    xvld           xr6,    t1,      64
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H24_LASX:
+    EPEL_UNI_W_H16_LOOP_LASX 0, 0, 24
+    vld            vr8,    a2,      16
+    add.d          a2,     a2,      a3
+    xvreplve0.q    xr8,    xr8
+    xvshuf.b       xr12,   xr8,     xr8,   xr6
+    CALC_EPEL_FILTER_LASX xr14
+    xvpermi.q      xr15,   xr14,    0x01
+    vssrani.h.w    vr15,   vr14,    0
+    vssrani.bu.h   vr15,   vr15,    0
+    fst.d          f15,    a0,      16
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h32_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr6,    t1,      48
+    vaddi.bu       vr7,   vr6,      2
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+.LOOP_UNI_W_H32:
+    EPEL_UNI_W_H16_LOOP_LSX 0,  8,  0
+    EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H32
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h32_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    la.local       t1,     shufb
+    xvld           xr6,    t1,      64
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H32_LASX:
+    EPEL_UNI_W_H16_LOOP_LASX 0,  0,  32
+    EPEL_UNI_W_H16_LOOP_LASX 16, 16, 32
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h48_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr6,    t1,      48
+    vaddi.bu       vr7,   vr6,      2
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+.LOOP_UNI_W_H48:
+    EPEL_UNI_W_H16_LOOP_LSX 0,  8,  0
+    EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+    EPEL_UNI_W_H16_LOOP_LSX 32, 40, 32
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H48
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h48_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    la.local       t1,     shufb
+    xvld           xr6,    t1,      64
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H48_LASX:
+    EPEL_UNI_W_H16_LOOP_LASX 0,  0,  48
+    EPEL_UNI_W_H16_LOOP_LASX 16, 16, 48
+    EPEL_UNI_W_H16_LOOP_LASX 32, 32, 48
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h64_8_lsx
+    LOAD_VAR 128
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    vreplvei.w     vr0,    vr0,     0
+    la.local       t1,     shufb
+    vld            vr6,    t1,      48
+    vaddi.bu       vr7,   vr6,      2
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+    vreplvei.h     vr5,    vr0,     1
+    vreplvei.h     vr0,    vr0,     0
+.LOOP_UNI_W_H64:
+    EPEL_UNI_W_H16_LOOP_LSX 0,  8,  0
+    EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+    EPEL_UNI_W_H16_LOOP_LSX 32, 40, 32
+    EPEL_UNI_W_H16_LOOP_LSX 48, 56, 48
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H64
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h64_8_lasx
+    LOAD_VAR 256
+    ld.d           t0,     sp,      0  //mx
+    addi.d         t0,     t0,      -1
+    slli.w         t0,     t0,      2
+    la.local       t1,     ff_hevc_epel_filters
+    vldx           vr0,    t1,      t0 //filter
+    xvreplve0.w    xr0,    xr0
+    la.local       t1,     shufb
+    xvld           xr6,    t1,      64
+    slli.d         t0,     a3,      1  //stride * 2
+    add.d          t1,     t0,      a3 //stride * 3
+    addi.d         a2,     a2,      -1 //src -= 1
+.LOOP_UNI_W_H64_LASX:
+    EPEL_UNI_W_H16_LOOP_LASX 0,  0,  64
+    EPEL_UNI_W_H16_LOOP_LASX 16, 16, 64
+    EPEL_UNI_W_H16_LOOP_LASX 32, 32, 64
+    EPEL_UNI_W_H16_LOOP_LASX 48, 48, 64
+    add.d          a2,     a2,      a3
+    add.d          a0,     a0,      a1
+    addi.d         a4,     a4,      -1
+    bnez           a4,     .LOOP_UNI_W_H64_LASX
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ *                               const uint8_t *_src, ptrdiff_t _srcstride,
+ *                               const int16_t *src2, int height, intptr_t mx,
+ *                               intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_bi_epel_h4_8_lsx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   vreplvei.w      vr0,    vr0,     0
+   la.local        t0,     shufb
+   vld             vr1,    t0,      0 // mask
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H4:
+   vld             vr4,    a4,      0 // src2
+   vld             vr5,    a2,      0
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   vshuf.b         vr5,    vr5,     vr5,    vr1
+   vdp2.h.bu.b     vr6,    vr5,     vr0 // EPEL_FILTER(src, 1)
+   vsllwil.w.h     vr4,    vr4,     0
+   vhaddw.w.h      vr6,    vr6,     vr6
+   vadd.w          vr6,    vr6,     vr4 // src2[x]
+   vssrani.h.w     vr6,    vr6,     0
+   vssrarni.bu.h   vr6,    vr6,     7
+   fst.s           f6,     a0,      0
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H4
+endfunc
+
+.macro PUT_HEVC_BI_EPEL_H8_LSX in0, in1, in2, in3, out0
+   vshuf.b         vr6,    \in1,    \in0,   \in2
+   vshuf.b         vr7,    \in1,    \in0,   \in3
+   vdp2.h.bu.b     vr8,    vr6,     vr0 // EPEL_FILTER(src, 1)
+   vdp2add.h.bu.b  vr8,    vr7,     vr1 // EPEL_FILTER(src, 1)
+   vsadd.h         \out0,  vr8,     vr4 // src2[x]
+.endm
+
+.macro PUT_HEVC_BI_EPEL_H16_LASX in0, in1, in2, in3, out0
+   xvshuf.b         xr6,    \in1,    \in0,   \in2
+   xvshuf.b         xr7,    \in1,    \in0,   \in3
+   xvdp2.h.bu.b     xr8,    xr6,     xr0 // EPEL_FILTER(src, 1)
+   xvdp2add.h.bu.b  xr8,    xr7,     xr1 // EPEL_FILTER(src, 1)
+   xvsadd.h         \out0,  xr8,     xr4 // src2[x]
+.endm
+
+function ff_hevc_put_hevc_bi_epel_h6_8_lsx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   vreplvei.h      vr1,    vr0,     1
+   vreplvei.h      vr0,    vr0,     0
+   la.local        t0,     shufb
+   vld             vr2,    t0,      48// mask
+   vaddi.bu        vr3,    vr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H6:
+   vld             vr4,    a4,      0 // src2
+   vld             vr5,    a2,      0
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr7
+   vssrarni.bu.h   vr7,    vr7,     7
+   fst.s           f7,     a0,      0
+   vstelm.h        vr7,    a0,      4,   2
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H6
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h8_8_lsx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   vreplvei.h      vr1,    vr0,     1
+   vreplvei.h      vr0,    vr0,     0
+   la.local        t0,     shufb
+   vld             vr2,    t0,      48// mask
+   vaddi.bu        vr3,    vr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H8:
+   vld             vr4,    a4,      0 // src2
+   vld             vr5,    a2,      0
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr7
+   vssrarni.bu.h   vr7,    vr7,     7
+   fst.d           f7,     a0,      0
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H8
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h12_8_lsx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   vreplvei.h      vr1,    vr0,     1
+   vreplvei.h      vr0,    vr0,     0
+   la.local        t0,     shufb
+   vld             vr2,    t0,      48// mask
+   vaddi.bu        vr3,    vr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H12:
+   vld             vr4,    a4,      0 // src2
+   vld             vr5,    a2,      0
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr11
+   vld             vr5,    a2,      8
+   vld             vr4,    a4,      16
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+   vssrarni.bu.h   vr12,   vr11,    7
+   fst.d           f12,    a0,      0
+   vstelm.w        vr12,   a0,      8,   2
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H12
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h12_8_lasx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   xvreplve0.q     xr0,    xr0
+   xvrepl128vei.h  xr1,    xr0,     1
+   xvrepl128vei.h  xr0,    xr0,     0
+   la.local        t0,     shufb
+   xvld            xr2,    t0,      96// mask
+   xvaddi.bu       xr3,    xr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H12_LASX:
+   xvld            xr4,    a4,      0 // src2
+   xvld            xr5,    a2,      0
+   xvpermi.d       xr5,    xr5,     0x94
+   PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+   xvpermi.q       xr10,   xr9,     0x01
+   vssrarni.bu.h   vr10,   vr9,     7
+   fst.d           f10,    a0,      0
+   vstelm.w        vr10,   a0,      8,  2
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h16_8_lsx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   vreplvei.h      vr1,    vr0,     1
+   vreplvei.h      vr0,    vr0,     0
+   la.local        t0,     shufb
+   vld             vr2,    t0,      48// mask
+   vaddi.bu        vr3,    vr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H16:
+   vld             vr4,    a4,      0 // src2
+   vld             vr5,    a2,      0
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr11
+   vld             vr5,    a2,      8
+   vld             vr4,    a4,      16
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+   vssrarni.bu.h   vr12,   vr11,    7
+   vst             vr12,   a0,      0
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H16
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h16_8_lasx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   xvreplve0.q     xr0,    xr0
+   xvrepl128vei.h  xr1,    xr0,     1
+   xvrepl128vei.h  xr0,    xr0,     0
+   la.local        t0,     shufb
+   xvld            xr2,    t0,      96// mask
+   xvaddi.bu       xr3,    xr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H16_LASX:
+   xvld            xr4,    a4,      0 // src2
+   xvld            xr5,    a2,      0
+   xvpermi.d       xr5,    xr5,     0x94
+   PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+   xvpermi.q       xr10,   xr9,     0x01
+   vssrarni.bu.h   vr10,   vr9,     7
+   vst             vr10,   a0,      0
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h32_8_lasx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   xvreplve0.q     xr0,    xr0
+   xvrepl128vei.h  xr1,    xr0,     1
+   xvrepl128vei.h  xr0,    xr0,     0
+   la.local        t0,     shufb
+   xvld            xr2,    t0,      96// mask
+   xvaddi.bu       xr3,    xr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H32_LASX:
+   xvld            xr4,    a4,      0 // src2
+   xvld            xr5,    a2,      0
+   xvpermi.q       xr15,   xr5,     0x01
+   xvpermi.d       xr5,    xr5,     0x94
+   PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+   xvld            xr4,    a4,      32
+   xvld            xr15,   a2,      16
+   xvpermi.d       xr15,   xr15,    0x94
+   PUT_HEVC_BI_EPEL_H16_LASX xr15, xr15, xr2, xr3, xr11
+   xvssrarni.bu.h  xr11,   xr9,     7
+   xvpermi.d       xr11,   xr11,    0xd8
+   xvst            xr11,   a0,      0
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h48_8_lsx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6// filter
+   vreplvei.h      vr1,    vr0,     1
+   vreplvei.h      vr0,    vr0,     0
+   la.local        t0,     shufb
+   vld             vr2,    t0,      48// mask
+   vaddi.bu        vr3,    vr2,     2
+   vaddi.bu        vr21,   vr2,     8
+   vaddi.bu        vr22,   vr2,     10
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H48:
+   vld             vr4,    a4,      0 // src2
+   vld             vr5,    a2,      0
+   vld             vr9,    a2,      16
+   vld             vr10,   a2,      32
+   vld             vr11,   a2,      48
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+   vld             vr4,    a4,      16
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr9, vr21, vr22, vr13
+   vld             vr4,    a4,      32
+   PUT_HEVC_BI_EPEL_H8_LSX vr9, vr9, vr2, vr3, vr14
+   vld             vr4,    a4,      48
+   PUT_HEVC_BI_EPEL_H8_LSX vr9, vr10, vr21, vr22, vr15
+   vld             vr4,    a4,      64
+   PUT_HEVC_BI_EPEL_H8_LSX vr10, vr10, vr2, vr3, vr16
+   vld             vr4,    a4,      80
+   PUT_HEVC_BI_EPEL_H8_LSX vr10, vr11, vr21, vr22, vr17
+   vssrarni.bu.h   vr13,   vr12,    7
+   vssrarni.bu.h   vr15,   vr14,    7
+   vssrarni.bu.h   vr17,   vr16,    7
+   vst             vr13,   a0,      0
+   vst             vr15,   a0,      16
+   vst             vr17,   a0,      32
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H48
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h48_8_lasx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   xvreplve0.q     xr0,    xr0
+   xvrepl128vei.h  xr1,    xr0,     1
+   xvrepl128vei.h  xr0,    xr0,     0
+   la.local        t0,     shufb
+   xvld            xr2,    t0,      96// mask
+   xvaddi.bu       xr3,    xr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H48_LASX:
+   xvld            xr4,    a4,      0 // src2
+   xvld            xr5,    a2,      0
+   xvld            xr9,    a2,      32
+   xvpermi.d       xr10,   xr9,     0x94
+   xvpermi.q       xr9,    xr5,     0x21
+   xvpermi.d       xr9,    xr9,     0x94
+   xvpermi.d       xr5,    xr5,     0x94
+   PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr11
+   xvld            xr4,    a4,      32
+   PUT_HEVC_BI_EPEL_H16_LASX xr9, xr9, xr2, xr3, xr12
+   xvld            xr4,    a4,      64
+   PUT_HEVC_BI_EPEL_H16_LASX xr10, xr10, xr2, xr3, xr13
+   xvssrarni.bu.h  xr12,   xr11,    7
+   xvpermi.d       xr12,   xr12,    0xd8
+   xvpermi.q       xr14,   xr13,    0x01
+   vssrarni.bu.h   vr14,   vr13,    7
+   xvst            xr12,   a0,      0
+   vst             vr14,   a0,      32
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h64_8_lsx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6// filter
+   vreplvei.h      vr1,    vr0,     1
+   vreplvei.h      vr0,    vr0,     0
+   la.local        t0,     shufb
+   vld             vr2,    t0,      48// mask
+   vaddi.bu        vr3,    vr2,     2
+   vaddi.bu        vr21,   vr2,     8
+   vaddi.bu        vr22,   vr2,     10
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H64:
+   vld             vr4,    a4,      0 // src2
+   vld             vr5,    a2,      0
+   vld             vr9,    a2,      16
+   vld             vr10,   a2,      32
+   vld             vr11,   a2,      48
+   vld             vr12,   a2,      64
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr13
+   vld             vr4,    a4,      16
+   PUT_HEVC_BI_EPEL_H8_LSX vr5, vr9, vr21, vr22, vr14
+   vld             vr4,    a4,      32
+   PUT_HEVC_BI_EPEL_H8_LSX vr9, vr9, vr2, vr3, vr15
+   vld             vr4,    a4,      48
+   PUT_HEVC_BI_EPEL_H8_LSX vr9, vr10, vr21, vr22, vr16
+   vld             vr4,    a4,      64
+   PUT_HEVC_BI_EPEL_H8_LSX vr10, vr10, vr2, vr3, vr17
+   vld             vr4,    a4,      80
+   PUT_HEVC_BI_EPEL_H8_LSX vr10, vr11, vr21, vr22, vr18
+   vld             vr4,    a4,      96
+   PUT_HEVC_BI_EPEL_H8_LSX vr11, vr11, vr2, vr3, vr19
+   vld             vr4,    a4,      112
+   PUT_HEVC_BI_EPEL_H8_LSX vr11, vr12, vr21, vr22, vr20
+   vssrarni.bu.h   vr14,   vr13,    7
+   vssrarni.bu.h   vr16,   vr15,    7
+   vssrarni.bu.h   vr18,   vr17,    7
+   vssrarni.bu.h   vr20,   vr19,    7
+   vst             vr14,   a0,      0
+   vst             vr16,   a0,      16
+   vst             vr18,   a0,      32
+   vst             vr20,   a0,      48
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H64
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h64_8_lasx
+   addi.d          a6,     a6,     -1
+   slli.w          a6,     a6,      2
+   la.local        t0,     ff_hevc_epel_filters
+   vldx            vr0,    t0,      a6 // filter
+   xvreplve0.q     xr0,    xr0
+   xvrepl128vei.h  xr1,    xr0,     1
+   xvrepl128vei.h  xr0,    xr0,     0
+   la.local        t0,     shufb
+   xvld            xr2,    t0,      96// mask
+   xvaddi.bu       xr3,    xr2,     2
+   addi.d          a2,     a2,     -1 // src -= 1
+.LOOP_BI_EPEL_H64_LASX:
+   xvld            xr4,    a4,      0 // src2
+   xvld            xr5,    a2,      0
+   xvld            xr9,    a2,      32
+   xvld            xr11,   a2,      48
+   xvpermi.d       xr11,   xr11,    0x94
+   xvpermi.d       xr10,   xr9,     0x94
+   xvpermi.q       xr9,    xr5,     0x21
+   xvpermi.d       xr9,    xr9,     0x94
+   xvpermi.d       xr5,    xr5,     0x94
+   PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr12
+   xvld            xr4,    a4,      32
+   PUT_HEVC_BI_EPEL_H16_LASX xr9, xr9, xr2, xr3, xr13
+   xvld            xr4,    a4,      64
+   PUT_HEVC_BI_EPEL_H16_LASX xr10, xr10, xr2, xr3, xr14
+   xvld            xr4,    a4,      96
+   PUT_HEVC_BI_EPEL_H16_LASX xr11, xr11, xr2, xr3, xr15
+   xvssrarni.bu.h  xr13,   xr12,    7
+   xvssrarni.bu.h  xr15,   xr14,    7
+   xvpermi.d       xr13,   xr13,    0xd8
+   xvpermi.d       xr15,   xr15,    0xd8
+   xvst            xr13,   a0,      0
+   xvst            xr15,   a0,      32
+   add.d           a2,     a2,      a3
+   addi.d          a4,     a4,      128
+   add.d           a0,     a0,      a1
+   addi.d          a5,     a5,     -1
+   bnez            a5,     .LOOP_BI_EPEL_H64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 245a833947..2756755733 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -124,8 +124,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_qpel_bi[8][0][1] = ff_hevc_put_hevc_bi_qpel_h48_8_lsx;
             c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_bi_qpel_h64_8_lsx;
 
+            c->put_hevc_epel_bi[1][0][1] = ff_hevc_put_hevc_bi_epel_h4_8_lsx;
+            c->put_hevc_epel_bi[2][0][1] = ff_hevc_put_hevc_bi_epel_h6_8_lsx;
+            c->put_hevc_epel_bi[3][0][1] = ff_hevc_put_hevc_bi_epel_h8_8_lsx;
+            c->put_hevc_epel_bi[4][0][1] = ff_hevc_put_hevc_bi_epel_h12_8_lsx;
+            c->put_hevc_epel_bi[5][0][1] = ff_hevc_put_hevc_bi_epel_h16_8_lsx;
             c->put_hevc_epel_bi[6][0][1] = ff_hevc_put_hevc_bi_epel_h24_8_lsx;
             c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lsx;
+            c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lsx;
+            c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lsx;
 
             c->put_hevc_epel_bi[4][1][0] = ff_hevc_put_hevc_bi_epel_v12_8_lsx;
             c->put_hevc_epel_bi[5][1][0] = ff_hevc_put_hevc_bi_epel_v16_8_lsx;
@@ -138,6 +145,14 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_epel_bi[6][1][1] = ff_hevc_put_hevc_bi_epel_hv24_8_lsx;
             c->put_hevc_epel_bi[7][1][1] = ff_hevc_put_hevc_bi_epel_hv32_8_lsx;
 
+            c->put_hevc_qpel_uni[1][0][1] = ff_hevc_put_hevc_uni_qpel_h4_8_lsx;
+            c->put_hevc_qpel_uni[2][0][1] = ff_hevc_put_hevc_uni_qpel_h6_8_lsx;
+            c->put_hevc_qpel_uni[3][0][1] = ff_hevc_put_hevc_uni_qpel_h8_8_lsx;
+            c->put_hevc_qpel_uni[4][0][1] = ff_hevc_put_hevc_uni_qpel_h12_8_lsx;
+            c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_uni_qpel_h16_8_lsx;
+            c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_uni_qpel_h24_8_lsx;
+            c->put_hevc_qpel_uni[7][0][1] = ff_hevc_put_hevc_uni_qpel_h32_8_lsx;
+            c->put_hevc_qpel_uni[8][0][1] = ff_hevc_put_hevc_uni_qpel_h48_8_lsx;
             c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_uni_qpel_h64_8_lsx;
 
             c->put_hevc_qpel_uni[6][1][0] = ff_hevc_put_hevc_uni_qpel_v24_8_lsx;
@@ -191,6 +206,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
             c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
 
+            c->put_hevc_epel_uni_w[1][0][1] = ff_hevc_put_hevc_epel_uni_w_h4_8_lsx;
+            c->put_hevc_epel_uni_w[2][0][1] = ff_hevc_put_hevc_epel_uni_w_h6_8_lsx;
+            c->put_hevc_epel_uni_w[3][0][1] = ff_hevc_put_hevc_epel_uni_w_h8_8_lsx;
+            c->put_hevc_epel_uni_w[4][0][1] = ff_hevc_put_hevc_epel_uni_w_h12_8_lsx;
+            c->put_hevc_epel_uni_w[5][0][1] = ff_hevc_put_hevc_epel_uni_w_h16_8_lsx;
+            c->put_hevc_epel_uni_w[6][0][1] = ff_hevc_put_hevc_epel_uni_w_h24_8_lsx;
+            c->put_hevc_epel_uni_w[7][0][1] = ff_hevc_put_hevc_epel_uni_w_h32_8_lsx;
+            c->put_hevc_epel_uni_w[8][0][1] = ff_hevc_put_hevc_epel_uni_w_h48_8_lsx;
+            c->put_hevc_epel_uni_w[9][0][1] = ff_hevc_put_hevc_epel_uni_w_h64_8_lsx;
+
+            c->put_hevc_epel_uni_w[1][1][0] = ff_hevc_put_hevc_epel_uni_w_v4_8_lsx;
+            c->put_hevc_epel_uni_w[2][1][0] = ff_hevc_put_hevc_epel_uni_w_v6_8_lsx;
+            c->put_hevc_epel_uni_w[3][1][0] = ff_hevc_put_hevc_epel_uni_w_v8_8_lsx;
+            c->put_hevc_epel_uni_w[4][1][0] = ff_hevc_put_hevc_epel_uni_w_v12_8_lsx;
+            c->put_hevc_epel_uni_w[5][1][0] = ff_hevc_put_hevc_epel_uni_w_v16_8_lsx;
+            c->put_hevc_epel_uni_w[6][1][0] = ff_hevc_put_hevc_epel_uni_w_v24_8_lsx;
+            c->put_hevc_epel_uni_w[7][1][0] = ff_hevc_put_hevc_epel_uni_w_v32_8_lsx;
+            c->put_hevc_epel_uni_w[8][1][0] = ff_hevc_put_hevc_epel_uni_w_v48_8_lsx;
+            c->put_hevc_epel_uni_w[9][1][0] = ff_hevc_put_hevc_epel_uni_w_v64_8_lsx;
+
             c->put_hevc_qpel_uni_w[3][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv8_8_lsx;
             c->put_hevc_qpel_uni_w[5][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv16_8_lsx;
             c->put_hevc_qpel_uni_w[6][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv24_8_lsx;
@@ -277,6 +312,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx;
             c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx;
 
+            c->put_hevc_epel_uni_w[2][0][1] = ff_hevc_put_hevc_epel_uni_w_h6_8_lasx;
+            c->put_hevc_epel_uni_w[3][0][1] = ff_hevc_put_hevc_epel_uni_w_h8_8_lasx;
+            c->put_hevc_epel_uni_w[4][0][1] = ff_hevc_put_hevc_epel_uni_w_h12_8_lasx;
+            c->put_hevc_epel_uni_w[5][0][1] = ff_hevc_put_hevc_epel_uni_w_h16_8_lasx;
+            c->put_hevc_epel_uni_w[6][0][1] = ff_hevc_put_hevc_epel_uni_w_h24_8_lasx;
+            c->put_hevc_epel_uni_w[7][0][1] = ff_hevc_put_hevc_epel_uni_w_h32_8_lasx;
+            c->put_hevc_epel_uni_w[8][0][1] = ff_hevc_put_hevc_epel_uni_w_h48_8_lasx;
+            c->put_hevc_epel_uni_w[9][0][1] = ff_hevc_put_hevc_epel_uni_w_h64_8_lasx;
+
             c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
             c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
             c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
@@ -285,6 +329,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx;
             c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx;
 
+            c->put_hevc_epel_uni_w[2][1][0] = ff_hevc_put_hevc_epel_uni_w_v6_8_lasx;
+            c->put_hevc_epel_uni_w[3][1][0] = ff_hevc_put_hevc_epel_uni_w_v8_8_lasx;
+            c->put_hevc_epel_uni_w[4][1][0] = ff_hevc_put_hevc_epel_uni_w_v12_8_lasx;
+            c->put_hevc_epel_uni_w[5][1][0] = ff_hevc_put_hevc_epel_uni_w_v16_8_lasx;
+            c->put_hevc_epel_uni_w[6][1][0] = ff_hevc_put_hevc_epel_uni_w_v24_8_lasx;
+            c->put_hevc_epel_uni_w[7][1][0] = ff_hevc_put_hevc_epel_uni_w_v32_8_lasx;
+            c->put_hevc_epel_uni_w[8][1][0] = ff_hevc_put_hevc_epel_uni_w_v48_8_lasx;
+            c->put_hevc_epel_uni_w[9][1][0] = ff_hevc_put_hevc_epel_uni_w_v64_8_lasx;
+
             c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx;
             c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx;
             c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx;
@@ -294,6 +347,19 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx;
             c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx;
             c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx;
+
+            c->put_hevc_qpel_uni[4][0][1] = ff_hevc_put_hevc_uni_qpel_h12_8_lasx;
+            c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_uni_qpel_h16_8_lasx;
+            c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_uni_qpel_h24_8_lasx;
+            c->put_hevc_qpel_uni[7][0][1] = ff_hevc_put_hevc_uni_qpel_h32_8_lasx;
+            c->put_hevc_qpel_uni[8][0][1] = ff_hevc_put_hevc_uni_qpel_h48_8_lasx;
+            c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_uni_qpel_h64_8_lasx;
+
+            c->put_hevc_epel_bi[4][0][1] = ff_hevc_put_hevc_bi_epel_h12_8_lasx;
+            c->put_hevc_epel_bi[5][0][1] = ff_hevc_put_hevc_bi_epel_h16_8_lasx;
+            c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lasx;
+            c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lasx;
+            c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lasx;
         }
     }
 }
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 7f09d0943a..5db35eed47 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -75,6 +75,60 @@ PEL_UNI_W(epel, hv, 32);
 PEL_UNI_W(epel, hv, 48);
 PEL_UNI_W(epel, hv, 64);
 
+PEL_UNI_W(epel, v, 6);
+PEL_UNI_W(epel, v, 8);
+PEL_UNI_W(epel, v, 12);
+PEL_UNI_W(epel, v, 16);
+PEL_UNI_W(epel, v, 24);
+PEL_UNI_W(epel, v, 32);
+PEL_UNI_W(epel, v, 48);
+PEL_UNI_W(epel, v, 64);
+
+PEL_UNI_W(epel, h, 6);
+PEL_UNI_W(epel, h, 8);
+PEL_UNI_W(epel, h, 12);
+PEL_UNI_W(epel, h, 16);
+PEL_UNI_W(epel, h, 24);
+PEL_UNI_W(epel, h, 32);
+PEL_UNI_W(epel, h, 48);
+PEL_UNI_W(epel, h, 64);
+
 #undef PEL_UNI_W
 
+#define UNI_MC(PEL, DIR, WIDTH)                                               \
+void ff_hevc_put_hevc_uni_##PEL##_##DIR##WIDTH##_8_lasx(uint8_t *dst,         \
+                                                        ptrdiff_t dst_stride, \
+                                                        const uint8_t *src,   \
+                                                        ptrdiff_t src_stride, \
+                                                        int height,           \
+                                                        intptr_t mx,          \
+                                                        intptr_t my,          \
+                                                        int width)
+UNI_MC(qpel, h, 12);
+UNI_MC(qpel, h, 16);
+UNI_MC(qpel, h, 24);
+UNI_MC(qpel, h, 32);
+UNI_MC(qpel, h, 48);
+UNI_MC(qpel, h, 64);
+
+#undef UNI_MC
+
+#define BI_MC(PEL, DIR, WIDTH)                                                \
+void ff_hevc_put_hevc_bi_##PEL##_##DIR##WIDTH##_8_lasx(uint8_t *dst,          \
+                                                       ptrdiff_t dst_stride,  \
+                                                       const uint8_t *src,    \
+                                                       ptrdiff_t src_stride,  \
+                                                       const int16_t *src_16bit, \
+                                                       int height,            \
+                                                       intptr_t mx,           \
+                                                       intptr_t my,           \
+                                                       int width)
+BI_MC(epel, h, 12);
+BI_MC(epel, h, 16);
+BI_MC(epel, h, 32);
+BI_MC(epel, h, 48);
+BI_MC(epel, h, 64);
+
+#undef BI_MC
+
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 7769cf25ae..a5ef237b5d 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -126,8 +126,15 @@ BI_MC(qpel, hv, 32);
 BI_MC(qpel, hv, 48);
 BI_MC(qpel, hv, 64);
 
+BI_MC(epel, h, 4);
+BI_MC(epel, h, 6);
+BI_MC(epel, h, 8);
+BI_MC(epel, h, 12);
+BI_MC(epel, h, 16);
 BI_MC(epel, h, 24);
 BI_MC(epel, h, 32);
+BI_MC(epel, h, 48);
+BI_MC(epel, h, 64);
 
 BI_MC(epel, v, 12);
 BI_MC(epel, v, 16);
@@ -151,7 +158,14 @@ void ff_hevc_put_hevc_uni_##PEL##_##DIR##WIDTH##_8_lsx(uint8_t *dst,         \
                                                        intptr_t mx,          \
                                                        intptr_t my,          \
                                                        int width)
-
+UNI_MC(qpel, h, 4);
+UNI_MC(qpel, h, 6);
+UNI_MC(qpel, h, 8);
+UNI_MC(qpel, h, 12);
+UNI_MC(qpel, h, 16);
+UNI_MC(qpel, h, 24);
+UNI_MC(qpel, h, 32);
+UNI_MC(qpel, h, 48);
 UNI_MC(qpel, h, 64);
 
 UNI_MC(qpel, v, 24);
@@ -287,6 +301,26 @@ PEL_UNI_W(epel, hv, 32);
 PEL_UNI_W(epel, hv, 48);
 PEL_UNI_W(epel, hv, 64);
 
+PEL_UNI_W(epel, h, 4);
+PEL_UNI_W(epel, h, 6);
+PEL_UNI_W(epel, h, 8);
+PEL_UNI_W(epel, h, 12);
+PEL_UNI_W(epel, h, 16);
+PEL_UNI_W(epel, h, 24);
+PEL_UNI_W(epel, h, 32);
+PEL_UNI_W(epel, h, 48);
+PEL_UNI_W(epel, h, 64);
+
+PEL_UNI_W(epel, v, 4);
+PEL_UNI_W(epel, v, 6);
+PEL_UNI_W(epel, v, 8);
+PEL_UNI_W(epel, v, 12);
+PEL_UNI_W(epel, v, 16);
+PEL_UNI_W(epel, v, 24);
+PEL_UNI_W(epel, v, 32);
+PEL_UNI_W(epel, v, 48);
+PEL_UNI_W(epel, v, 64);
+
 #undef PEL_UNI_W
 
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
-- 
2.20.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt
  2023-12-27  4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
                   ` (5 preceding siblings ...)
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 6/7] avcodec/hevc: Add asm opt for the following functions jinbo
@ 2023-12-27  4:50 ` jinbo
  2023-12-28  7:29   ` yinshiyou-hf
  6 siblings, 1 reply; 10+ messages in thread
From: jinbo @ 2023-12-27  4:50 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: yuanhecai

From: yuanhecai <yuanhecai@loongson.cn>

tests/checkasm/checkasm:

                          C          LSX       LASX
hevc_idct_32x32_8_c:      1243.0     211.7     101.7

Speedup of decoding H265 4K 30FPS 30Mbps on
3A6000 with 8 threads is 1fps(56fps-->57fps).
---
 libavcodec/loongarch/Makefile                 |   3 +-
 libavcodec/loongarch/hevc_idct.S              | 863 ++++++++++++++++++
 libavcodec/loongarch/hevc_idct_lsx.c          |  10 +-
 libavcodec/loongarch/hevcdsp_init_loongarch.c |   2 +
 libavcodec/loongarch/hevcdsp_lasx.h           |   2 +
 5 files changed, 874 insertions(+), 6 deletions(-)
 create mode 100644 libavcodec/loongarch/hevc_idct.S

diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index ad98cd4054..07da2964e4 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -29,7 +29,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER)       += loongarch/hevcdsp_lsx.o \
                                          loongarch/hevc_mc_uni_lsx.o \
                                          loongarch/hevc_mc_uniw_lsx.o \
                                          loongarch/hevc_add_res.o \
-                                         loongarch/hevc_mc.o
+                                         loongarch/hevc_mc.o \
+                                         loongarch/hevc_idct.o
 LSX-OBJS-$(CONFIG_H264DSP)            += loongarch/h264idct.o \
                                          loongarch/h264idct_loongarch.o \
                                          loongarch/h264dsp.o
diff --git a/libavcodec/loongarch/hevc_idct.S b/libavcodec/loongarch/hevc_idct.S
new file mode 100644
index 0000000000..5593e5fd73
--- /dev/null
+++ b/libavcodec/loongarch/hevc_idct.S
@@ -0,0 +1,863 @@
+/*
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by Hecai Yuan <yuanhecai@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "loongson_asm.S"
+
+.macro fr_store
+    addi.d        sp,       sp,       -64
+    fst.d         f24,      sp,       0
+    fst.d         f25,      sp,       8
+    fst.d         f26,      sp,       16
+    fst.d         f27,      sp,       24
+    fst.d         f28,      sp,       32
+    fst.d         f29,      sp,       40
+    fst.d         f30,      sp,       48
+    fst.d         f31,      sp,       56
+.endm
+
+.macro fr_recover
+    fld.d         f24,      sp,       0
+    fld.d         f25,      sp,       8
+    fld.d         f26,      sp,       16
+    fld.d         f27,      sp,       24
+    fld.d         f28,      sp,       32
+    fld.d         f29,      sp,       40
+    fld.d         f30,      sp,       48
+    fld.d         f31,      sp,       56
+    addi.d        sp,       sp,       64
+.endm
+
+.macro malloc_space number
+    li.w          t0,       \number
+    sub.d         sp,       sp,       t0
+    fr_store
+.endm
+
+.macro free_space number
+    fr_recover
+    li.w          t0,       \number
+    add.d         sp,       sp,       t0
+.endm
+
+.extern gt32x32_cnst1
+
+.extern gt32x32_cnst2
+
+.extern gt8x8_cnst
+
+.extern gt32x32_cnst0
+
+.macro idct_16x32_step1_lasx
+    xvldrepl.w    xr20,     t1,       0
+    xvldrepl.w    xr21,     t1,       4
+    xvldrepl.w    xr22,     t1,       8
+    xvldrepl.w    xr23,     t1,       12
+
+    xvmulwev.w.h  xr16,     xr8,      xr20
+    xvmaddwod.w.h xr16,     xr8,      xr20
+    xvmulwev.w.h  xr17,     xr9,      xr20
+    xvmaddwod.w.h xr17,     xr9,      xr20
+
+    xvmaddwev.w.h xr16,     xr10,     xr21
+    xvmaddwod.w.h xr16,     xr10,     xr21
+    xvmaddwev.w.h xr17,     xr11,     xr21
+    xvmaddwod.w.h xr17,     xr11,     xr21
+
+    xvmaddwev.w.h xr16,     xr12,     xr22
+    xvmaddwod.w.h xr16,     xr12,     xr22
+    xvmaddwev.w.h xr17,     xr13,     xr22
+    xvmaddwod.w.h xr17,     xr13,     xr22
+
+    xvmaddwev.w.h xr16,     xr14,     xr23
+    xvmaddwod.w.h xr16,     xr14,     xr23
+    xvmaddwev.w.h xr17,     xr15,     xr23
+    xvmaddwod.w.h xr17,     xr15,     xr23
+
+    xvld          xr0,      t2,       0
+    xvld          xr1,      t2,       32
+
+    xvadd.w       xr18,     xr0,      xr16
+    xvadd.w       xr19,     xr1,      xr17
+    xvsub.w       xr0,      xr0,      xr16
+    xvsub.w       xr1,      xr1,      xr17
+
+    xvst          xr18,     t2,       0
+    xvst          xr19,     t2,       32
+    xvst          xr0,      t3,       0
+    xvst          xr1,      t3,       32
+.endm
+
+.macro idct_16x32_step2_lasx in0, in1, in2, in3, in4, in5, in6, in7, out0, out1
+
+    xvldrepl.w    xr20,     t1,       0
+    xvldrepl.w    xr21,     t1,       4
+    xvldrepl.w    xr22,     t1,       8
+    xvldrepl.w    xr23,     t1,       12
+
+    xvmulwev.w.h  \out0,    \in0,     xr20
+    xvmaddwod.w.h \out0,    \in0,     xr20
+    xvmulwev.w.h  \out1,    \in1,     xr20
+    xvmaddwod.w.h \out1,    \in1,     xr20
+    xvmaddwev.w.h \out0,    \in2,     xr21
+    xvmaddwod.w.h \out0,    \in2,     xr21
+    xvmaddwev.w.h \out1,    \in3,     xr21
+    xvmaddwod.w.h \out1,    \in3,     xr21
+    xvmaddwev.w.h \out0,    \in4,     xr22
+    xvmaddwod.w.h \out0,    \in4,     xr22
+    xvmaddwev.w.h \out1,    \in5,     xr22
+    xvmaddwod.w.h \out1,    \in5,     xr22
+    xvmaddwev.w.h \out0,    \in6,     xr23
+    xvmaddwod.w.h \out0,    \in6,     xr23
+    xvmaddwev.w.h \out1,    \in7,     xr23  // sum0_r
+    xvmaddwod.w.h \out1,    \in7,     xr23  // sum0_l
+.endm
+
+    /* loop for all columns of filter constants */
+.macro idct_16x32_step3_lasx round
+    xvadd.w       xr16,     xr16,     xr30
+    xvadd.w       xr17,     xr17,     xr31
+
+    xvld          xr0,      t2,       0
+    xvld          xr1,      t2,       32
+
+    xvadd.w       xr30,     xr0,      xr16
+    xvadd.w       xr31,     xr1,      xr17
+    xvsub.w       xr16,     xr0,      xr16
+    xvsub.w       xr17,     xr1,      xr17
+    xvssrarni.h.w xr31,     xr30,     \round
+    xvssrarni.h.w xr17,     xr16,     \round
+    xvst          xr31,     t4,       0
+    xvst          xr17,     t5,       0
+.endm
+
+.macro idct_16x32_lasx buf_pitch, round
+    addi.d        t2,       sp,       64
+
+    addi.d        t0,       a0,       \buf_pitch*4*2
+
+    // 4 12 20 28
+    xvld          xr0,      t0,       0
+    xvld          xr1,      t0,       \buf_pitch*8*2
+    xvld          xr2,      t0,       \buf_pitch*16*2
+    xvld          xr3,      t0,       \buf_pitch*24*2
+
+    xvilvl.h      xr10,     xr1,      xr0
+    xvilvh.h      xr11,     xr1,      xr0
+    xvilvl.h      xr12,     xr3,      xr2
+    xvilvh.h      xr13,     xr3,      xr2
+
+    la.local      t1,       gt32x32_cnst2
+
+    xvldrepl.w    xr20,     t1,       0
+    xvldrepl.w    xr21,     t1,       4
+    xvmulwev.w.h  xr14,     xr10,     xr20
+    xvmaddwod.w.h xr14,     xr10,     xr20
+    xvmulwev.w.h  xr15,     xr11,     xr20
+    xvmaddwod.w.h xr15,     xr11,     xr20
+    xvmaddwev.w.h xr14,     xr12,     xr21
+    xvmaddwod.w.h xr14,     xr12,     xr21
+    xvmaddwev.w.h xr15,     xr13,     xr21
+    xvmaddwod.w.h xr15,     xr13,     xr21
+
+    xvldrepl.w    xr20,     t1,       8
+    xvldrepl.w    xr21,     t1,       12
+    xvmulwev.w.h  xr16,     xr10,     xr20
+    xvmaddwod.w.h xr16,     xr10,     xr20
+    xvmulwev.w.h  xr17,     xr11,     xr20
+    xvmaddwod.w.h xr17,     xr11,     xr20
+    xvmaddwev.w.h xr16,     xr12,     xr21
+    xvmaddwod.w.h xr16,     xr12,     xr21
+    xvmaddwev.w.h xr17,     xr13,     xr21
+    xvmaddwod.w.h xr17,     xr13,     xr21
+
+    xvldrepl.w    xr20,     t1,       16
+    xvldrepl.w    xr21,     t1,       20
+    xvmulwev.w.h  xr18,     xr10,     xr20
+    xvmaddwod.w.h xr18,     xr10,     xr20
+    xvmulwev.w.h  xr19,     xr11,     xr20
+    xvmaddwod.w.h xr19,     xr11,     xr20
+    xvmaddwev.w.h xr18,     xr12,     xr21
+    xvmaddwod.w.h xr18,     xr12,     xr21
+    xvmaddwev.w.h xr19,     xr13,     xr21
+    xvmaddwod.w.h xr19,     xr13,     xr21
+
+    xvldrepl.w    xr20,     t1,       24
+    xvldrepl.w    xr21,     t1,       28
+    xvmulwev.w.h  xr22,     xr10,     xr20
+    xvmaddwod.w.h xr22,     xr10,     xr20
+    xvmulwev.w.h  xr23,     xr11,     xr20
+    xvmaddwod.w.h xr23,     xr11,     xr20
+    xvmaddwev.w.h xr22,     xr12,     xr21
+    xvmaddwod.w.h xr22,     xr12,     xr21
+    xvmaddwev.w.h xr23,     xr13,     xr21
+    xvmaddwod.w.h xr23,     xr13,     xr21
+
+    /* process coeff 0, 8, 16, 24 */
+    la.local      t1,       gt8x8_cnst
+
+    xvld          xr0,      a0,       0
+    xvld          xr1,      a0,       \buf_pitch*8*2
+    xvld          xr2,      a0,       \buf_pitch*16*2
+    xvld          xr3,      a0,       \buf_pitch*24*2
+
+    xvldrepl.w    xr20,     t1,       0
+    xvldrepl.w    xr21,     t1,       4
+
+    xvilvl.h      xr10,     xr2,      xr0
+    xvilvh.h      xr11,     xr2,      xr0
+    xvilvl.h      xr12,     xr3,      xr1
+    xvilvh.h      xr13,     xr3,      xr1
+
+    xvmulwev.w.h  xr4,      xr10,     xr20
+    xvmaddwod.w.h xr4,      xr10,     xr20  // sum0_r
+    xvmulwev.w.h  xr5,      xr11,     xr20
+    xvmaddwod.w.h xr5,      xr11,     xr20  // sum0_l
+    xvmulwev.w.h  xr6,      xr12,     xr21
+    xvmaddwod.w.h xr6,      xr12,     xr21  // tmp1_r
+    xvmulwev.w.h  xr7,      xr13,     xr21
+    xvmaddwod.w.h xr7,      xr13,     xr21  // tmp1_l
+
+    xvsub.w       xr0,      xr4,      xr6   // sum1_r
+    xvadd.w       xr1,      xr4,      xr6   // sum0_r
+    xvsub.w       xr2,      xr5,      xr7   // sum1_l
+    xvadd.w       xr3,      xr5,      xr7   // sum0_l
+
+    // HEVC_EVEN16_CALC
+    xvsub.w       xr24,     xr1,      xr14  // 7
+    xvsub.w       xr25,     xr3,      xr15
+    xvadd.w       xr14,     xr1,      xr14  // 0
+    xvadd.w       xr15,     xr3,      xr15
+    xvst          xr24,     t2,       7*16*4     // 448=16*28=7*16*4
+    xvst          xr25,     t2,       7*16*4+32  // 480
+    xvst          xr14,     t2,       0
+    xvst          xr15,     t2,       32
+
+    xvsub.w       xr26,     xr0,      xr22  // 4
+    xvsub.w       xr27,     xr2,      xr23
+    xvadd.w       xr22,     xr0,      xr22  // 3
+    xvadd.w       xr23,     xr2,      xr23
+    xvst          xr26,     t2,       4*16*4     // 256=4*16*4
+    xvst          xr27,     t2,       4*16*4+32  // 288
+    xvst          xr22,     t2,       3*16*4     // 192=3*16*4
+    xvst          xr23,     t2,       3*16*4+32  // 224
+
+    xvldrepl.w    xr20,     t1,       16
+    xvldrepl.w    xr21,     t1,       20
+
+    xvmulwev.w.h  xr4,      xr10,     xr20
+    xvmaddwod.w.h xr4,      xr10,     xr20
+    xvmulwev.w.h  xr5,      xr11,     xr20
+    xvmaddwod.w.h xr5,      xr11,     xr20
+    xvmulwev.w.h  xr6,      xr12,     xr21
+    xvmaddwod.w.h xr6,      xr12,     xr21
+    xvmulwev.w.h  xr7,      xr13,     xr21
+    xvmaddwod.w.h xr7,      xr13,     xr21
+
+    xvsub.w       xr0,      xr4,      xr6   // sum1_r
+    xvadd.w       xr1,      xr4,      xr6   // sum0_r
+    xvsub.w       xr2,      xr5,      xr7   // sum1_l
+    xvadd.w       xr3,      xr5,      xr7   // sum0_l
+
+    // HEVC_EVEN16_CALC
+    xvsub.w       xr24,     xr1,      xr16  // 6
+    xvsub.w       xr25,     xr3,      xr17
+    xvadd.w       xr16,     xr1,      xr16  // 1
+    xvadd.w       xr17,     xr3,      xr17
+    xvst          xr24,     t2,       6*16*4      // 384=6*16*4
+    xvst          xr25,     t2,       6*16*4+32   // 416
+    xvst          xr16,     t2,       1*16*4      // 64=1*16*4
+    xvst          xr17,     t2,       1*16*4+32   // 96
+
+    xvsub.w       xr26,     xr0,      xr18  // 5
+    xvsub.w       xr27,     xr2,      xr19
+    xvadd.w       xr18,     xr0,      xr18  // 2
+    xvadd.w       xr19,     xr2,      xr19
+    xvst          xr26,     t2,       5*16*4      // 320=5*16*4
+    xvst          xr27,     t2,       5*16*4+32   // 352
+    xvst          xr18,     t2,       2*16*4      // 128=2*16*4
+    xvst          xr19,     t2,       2*16*4+32   // 160
+
+    /* process coeff 2 6 10 14 18 22 26 30 */
+    addi.d        t0,       a0,       \buf_pitch*2*2
+
+    xvld          xr0,      t0,       0
+    xvld          xr1,      t0,       \buf_pitch*4*2
+    xvld          xr2,      t0,       \buf_pitch*8*2
+    xvld          xr3,      t0,       \buf_pitch*12*2
+
+    xvld          xr4,      t0,       \buf_pitch*16*2
+    xvld          xr5,      t0,       \buf_pitch*20*2
+    xvld          xr6,      t0,       \buf_pitch*24*2
+    xvld          xr7,      t0,       \buf_pitch*28*2
+
+    xvilvl.h      xr8,      xr1,      xr0
+    xvilvh.h      xr9,      xr1,      xr0
+    xvilvl.h      xr10,     xr3,      xr2
+    xvilvh.h      xr11,     xr3,      xr2
+    xvilvl.h      xr12,     xr5,      xr4
+    xvilvh.h      xr13,     xr5,      xr4
+    xvilvl.h      xr14,     xr7,      xr6
+    xvilvh.h      xr15,     xr7,      xr6
+
+    la.local      t1,       gt32x32_cnst1
+
+    addi.d        t2,       sp,       64
+    addi.d        t3,       sp,       64+960   // 30*32
+
+    idct_16x32_step1_lasx
+
+.rept 7
+    addi.d        t1,       t1,       16
+    addi.d        t2,       t2,       64
+    addi.d        t3,       t3,       -64
+    idct_16x32_step1_lasx
+.endr
+
+    addi.d        t0,       a0,       \buf_pitch*2
+
+    xvld          xr0,      t0,       0
+    xvld          xr1,      t0,       \buf_pitch*2*2
+    xvld          xr2,      t0,       \buf_pitch*4*2
+    xvld          xr3,      t0,       \buf_pitch*6*2
+    xvld          xr4,      t0,       \buf_pitch*8*2
+    xvld          xr5,      t0,       \buf_pitch*10*2
+    xvld          xr6,      t0,       \buf_pitch*12*2
+    xvld          xr7,      t0,       \buf_pitch*14*2
+
+    xvilvl.h      xr8,      xr1,      xr0
+    xvilvh.h      xr9,      xr1,      xr0
+    xvilvl.h      xr10,     xr3,      xr2
+    xvilvh.h      xr11,     xr3,      xr2
+    xvilvl.h      xr12,     xr5,      xr4
+    xvilvh.h      xr13,     xr5,      xr4
+    xvilvl.h      xr14,     xr7,      xr6
+    xvilvh.h      xr15,     xr7,      xr6
+
+    la.local      t1,       gt32x32_cnst0
+
+    idct_16x32_step2_lasx xr8, xr9, xr10, xr11, xr12, xr13, \
+                          xr14, xr15, xr16, xr17
+
+    addi.d        t0,       a0,       \buf_pitch*16*2+\buf_pitch*2
+
+    xvld          xr0,      t0,       0
+    xvld          xr1,      t0,       \buf_pitch*2*2
+    xvld          xr2,      t0,       \buf_pitch*4*2
+    xvld          xr3,      t0,       \buf_pitch*6*2
+    xvld          xr4,      t0,       \buf_pitch*8*2
+    xvld          xr5,      t0,       \buf_pitch*10*2
+    xvld          xr6,      t0,       \buf_pitch*12*2
+    xvld          xr7,      t0,       \buf_pitch*14*2
+
+    xvilvl.h      xr18,     xr1,      xr0
+    xvilvh.h      xr19,     xr1,      xr0
+    xvilvl.h      xr24,     xr3,      xr2
+    xvilvh.h      xr25,     xr3,      xr2
+    xvilvl.h      xr26,     xr5,      xr4
+    xvilvh.h      xr27,     xr5,      xr4
+    xvilvl.h      xr28,     xr7,      xr6
+    xvilvh.h      xr29,     xr7,      xr6
+
+    addi.d        t1,       t1,       16
+    idct_16x32_step2_lasx xr18, xr19, xr24, xr25, xr26, xr27, \
+                          xr28, xr29, xr30, xr31
+
+    addi.d        t4,       a0,       0
+    addi.d        t5,       a0,       \buf_pitch*31*2
+    addi.d        t2,       sp,       64
+
+    idct_16x32_step3_lasx \round
+
+.rept 15
+
+    addi.d        t1,       t1,       16
+    idct_16x32_step2_lasx xr8, xr9, xr10, xr11, xr12, xr13, \
+                          xr14, xr15, xr16, xr17
+
+    addi.d        t1,       t1,       16
+    idct_16x32_step2_lasx xr18, xr19, xr24, xr25, xr26, xr27, \
+                          xr28, xr29, xr30, xr31
+
+    addi.d        t2,       t2,       64
+    addi.d        t4,       t4,       \buf_pitch*2
+    addi.d        t5,       t5,       -\buf_pitch*2
+
+    idct_16x32_step3_lasx \round
+.endr
+
+.endm
+
+function hevc_idct_16x32_column_step1_lasx
+    malloc_space 512+512+512
+
+    idct_16x32_lasx 32, 7
+
+    free_space 512+512+512
+endfunc
+
+function hevc_idct_16x32_column_step2_lasx
+    malloc_space 512+512+512
+
+    idct_16x32_lasx 16, 12
+
+    free_space 512+512+512
+endfunc
+
+function hevc_idct_transpose_32x16_to_16x32_lasx
+    fr_store
+
+    xvld          xr0,      a0,       0
+    xvld          xr1,      a0,       64
+    xvld          xr2,      a0,       128
+    xvld          xr3,      a0,       192
+    xvld          xr4,      a0,       256
+    xvld          xr5,      a0,       320
+    xvld          xr6,      a0,       384
+    xvld          xr7,      a0,       448
+
+    xvpermi.q     xr8,      xr0,      0x01
+    xvpermi.q     xr9,      xr1,      0x01
+    xvpermi.q     xr10,     xr2,      0x01
+    xvpermi.q     xr11,     xr3,      0x01
+    xvpermi.q     xr12,     xr4,      0x01
+    xvpermi.q     xr13,     xr5,      0x01
+    xvpermi.q     xr14,     xr6,      0x01
+    xvpermi.q     xr15,     xr7,      0x01
+
+    LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+                       vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+                       vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    addi.d        a0,       a0,       512
+
+    vld           vr24,     a0,       0
+    vld           vr25,     a0,       64
+    vld           vr26,     a0,       128
+    vld           vr27,     a0,       192
+    vld           vr28,     a0,       256
+    vld           vr29,     a0,       320
+    vld           vr30,     a0,       384
+    vld           vr31,     a0,       448
+
+    LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    xvpermi.q     xr0,      xr24,     0x02
+    xvpermi.q     xr1,      xr25,     0x02
+    xvpermi.q     xr2,      xr26,     0x02
+    xvpermi.q     xr3,      xr27,     0x02
+    xvpermi.q     xr4,      xr28,     0x02
+    xvpermi.q     xr5,      xr29,     0x02
+    xvpermi.q     xr6,      xr30,     0x02
+    xvpermi.q     xr7,      xr31,     0x02
+
+    xvst          xr0,      a1,       0
+    xvst          xr1,      a1,       32
+    xvst          xr2,      a1,       64
+    xvst          xr3,      a1,       96
+    xvst          xr4,      a1,       128
+    xvst          xr5,      a1,       160
+    xvst          xr6,      a1,       192
+    xvst          xr7,      a1,       224
+
+    addi.d        a1,       a1,       256
+    addi.d        a0,       a0,       16
+
+    vld           vr24,     a0,       0
+    vld           vr25,     a0,       64
+    vld           vr26,     a0,       128
+    vld           vr27,     a0,       192
+    vld           vr28,     a0,       256
+    vld           vr29,     a0,       320
+    vld           vr30,     a0,       384
+    vld           vr31,     a0,       448
+
+    LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    xvpermi.q     xr8,      xr24,     0x02
+    xvpermi.q     xr9,      xr25,     0x02
+    xvpermi.q     xr10,     xr26,     0x02
+    xvpermi.q     xr11,     xr27,     0x02
+    xvpermi.q     xr12,     xr28,     0x02
+    xvpermi.q     xr13,     xr29,     0x02
+    xvpermi.q     xr14,     xr30,     0x02
+    xvpermi.q     xr15,     xr31,     0x02
+
+    xvst          xr8,      a1,       0
+    xvst          xr9,      a1,       32
+    xvst          xr10,     a1,       64
+    xvst          xr11,     a1,       96
+    xvst          xr12,     a1,       128
+    xvst          xr13,     a1,       160
+    xvst          xr14,     a1,       192
+    xvst          xr15,     a1,       224
+
+    // second
+    addi.d        a0,       a0,       32-512-16
+
+    xvld          xr0,      a0,       0
+    xvld          xr1,      a0,       64
+    xvld          xr2,      a0,       128
+    xvld          xr3,      a0,       192
+    xvld          xr4,      a0,       256
+    xvld          xr5,      a0,       320
+    xvld          xr6,      a0,       384
+    xvld          xr7,      a0,       448
+
+    xvpermi.q     xr8,      xr0,      0x01
+    xvpermi.q     xr9,      xr1,      0x01
+    xvpermi.q     xr10,     xr2,      0x01
+    xvpermi.q     xr11,     xr3,      0x01
+    xvpermi.q     xr12,     xr4,      0x01
+    xvpermi.q     xr13,     xr5,      0x01
+    xvpermi.q     xr14,     xr6,      0x01
+    xvpermi.q     xr15,     xr7,      0x01
+
+    LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+                       vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+                       vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    addi.d        a0,       a0,       512
+
+    vld           vr24,     a0,       0
+    vld           vr25,     a0,       64
+    vld           vr26,     a0,       128
+    vld           vr27,     a0,       192
+    vld           vr28,     a0,       256
+    vld           vr29,     a0,       320
+    vld           vr30,     a0,       384
+    vld           vr31,     a0,       448
+
+    LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    xvpermi.q     xr0,      xr24,     0x02
+    xvpermi.q     xr1,      xr25,     0x02
+    xvpermi.q     xr2,      xr26,     0x02
+    xvpermi.q     xr3,      xr27,     0x02
+    xvpermi.q     xr4,      xr28,     0x02
+    xvpermi.q     xr5,      xr29,     0x02
+    xvpermi.q     xr6,      xr30,     0x02
+    xvpermi.q     xr7,      xr31,     0x02
+
+    addi.d        a1,       a1,       256
+    xvst          xr0,      a1,       0
+    xvst          xr1,      a1,       32
+    xvst          xr2,      a1,       64
+    xvst          xr3,      a1,       96
+    xvst          xr4,      a1,       128
+    xvst          xr5,      a1,       160
+    xvst          xr6,      a1,       192
+    xvst          xr7,      a1,       224
+
+    addi.d        a1,       a1,       256
+    addi.d        a0,       a0,       16
+
+    vld           vr24,     a0,       0
+    vld           vr25,     a0,       64
+    vld           vr26,     a0,       128
+    vld           vr27,     a0,       192
+    vld           vr28,     a0,       256
+    vld           vr29,     a0,       320
+    vld           vr30,     a0,       384
+    vld           vr31,     a0,       448
+
+    LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    xvpermi.q     xr8,      xr24,     0x02
+    xvpermi.q     xr9,      xr25,     0x02
+    xvpermi.q     xr10,     xr26,     0x02
+    xvpermi.q     xr11,     xr27,     0x02
+    xvpermi.q     xr12,     xr28,     0x02
+    xvpermi.q     xr13,     xr29,     0x02
+    xvpermi.q     xr14,     xr30,     0x02
+    xvpermi.q     xr15,     xr31,     0x02
+
+    xvst          xr8,      a1,       0
+    xvst          xr9,      a1,       32
+    xvst          xr10,     a1,       64
+    xvst          xr11,     a1,       96
+    xvst          xr12,     a1,       128
+    xvst          xr13,     a1,       160
+    xvst          xr14,     a1,       192
+    xvst          xr15,     a1,       224
+
+    fr_recover
+endfunc
+
+function hevc_idct_transpose_16x32_to_32x16_lasx
+    fr_store
+
+    xvld          xr0,      a0,       0
+    xvld          xr1,      a0,       32
+    xvld          xr2,      a0,       64
+    xvld          xr3,      a0,       96
+    xvld          xr4,      a0,       128
+    xvld          xr5,      a0,       160
+    xvld          xr6,      a0,       192
+    xvld          xr7,      a0,       224
+
+    xvpermi.q     xr8,      xr0,      0x01
+    xvpermi.q     xr9,      xr1,      0x01
+    xvpermi.q     xr10,     xr2,      0x01
+    xvpermi.q     xr11,     xr3,      0x01
+    xvpermi.q     xr12,     xr4,      0x01
+    xvpermi.q     xr13,     xr5,      0x01
+    xvpermi.q     xr14,     xr6,      0x01
+    xvpermi.q     xr15,     xr7,      0x01
+
+    LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+                       vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+                       vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    addi.d        a0,       a0,       256
+
+    vld          vr24,      a0,       0
+    vld          vr25,      a0,       32
+    vld          vr26,      a0,       64
+    vld          vr27,      a0,       96
+    vld          vr28,      a0,       128
+    vld          vr29,      a0,       160
+    vld          vr30,      a0,       192
+    vld          vr31,      a0,       224
+
+    LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    xvpermi.q     xr0,      xr24,     0x02
+    xvpermi.q     xr1,      xr25,     0x02
+    xvpermi.q     xr2,      xr26,     0x02
+    xvpermi.q     xr3,      xr27,     0x02
+    xvpermi.q     xr4,      xr28,     0x02
+    xvpermi.q     xr5,      xr29,     0x02
+    xvpermi.q     xr6,      xr30,     0x02
+    xvpermi.q     xr7,      xr31,     0x02
+
+    xvst          xr0,      a1,       0
+    xvst          xr1,      a1,       64
+    xvst          xr2,      a1,       128
+    xvst          xr3,      a1,       192
+    xvst          xr4,      a1,       256
+    xvst          xr5,      a1,       320
+    xvst          xr6,      a1,       384
+    xvst          xr7,      a1,       448
+
+    addi.d        a1,       a1,       512
+    addi.d        a0,       a0,       16
+
+    vld           vr24,     a0,       0
+    vld           vr25,     a0,       32
+    vld           vr26,     a0,       64
+    vld           vr27,     a0,       96
+    vld           vr28,     a0,       128
+    vld           vr29,     a0,       160
+    vld           vr30,     a0,       192
+    vld           vr31,     a0,       224
+
+    LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    xvpermi.q     xr8,      xr24,     0x02
+    xvpermi.q     xr9,      xr25,     0x02
+    xvpermi.q     xr10,     xr26,     0x02
+    xvpermi.q     xr11,     xr27,     0x02
+    xvpermi.q     xr12,     xr28,     0x02
+    xvpermi.q     xr13,     xr29,     0x02
+    xvpermi.q     xr14,     xr30,     0x02
+    xvpermi.q     xr15,     xr31,     0x02
+
+    xvst          xr8,      a1,       0
+    xvst          xr9,      a1,       64
+    xvst          xr10,     a1,       128
+    xvst          xr11,     a1,       192
+    xvst          xr12,     a1,       256
+    xvst          xr13,     a1,       320
+    xvst          xr14,     a1,       384
+    xvst          xr15,     a1,       448
+
+    // second
+    addi.d        a0,       a0,       256-16
+
+    xvld          xr0,      a0,       0
+    xvld          xr1,      a0,       32
+    xvld          xr2,      a0,       64
+    xvld          xr3,      a0,       96
+    xvld          xr4,      a0,       128
+    xvld          xr5,      a0,       160
+    xvld          xr6,      a0,       192
+    xvld          xr7,      a0,       224
+
+    xvpermi.q     xr8,      xr0,      0x01
+    xvpermi.q     xr9,      xr1,      0x01
+    xvpermi.q     xr10,     xr2,      0x01
+    xvpermi.q     xr11,     xr3,      0x01
+    xvpermi.q     xr12,     xr4,      0x01
+    xvpermi.q     xr13,     xr5,      0x01
+    xvpermi.q     xr14,     xr6,      0x01
+    xvpermi.q     xr15,     xr7,      0x01
+
+    LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+                       vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+                       vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    addi.d        a0,       a0,       256
+
+    vld           vr24,     a0,       0
+    vld           vr25,     a0,       32
+    vld           vr26,     a0,       64
+    vld           vr27,     a0,       96
+    vld           vr28,     a0,       128
+    vld           vr29,     a0,       160
+    vld           vr30,     a0,       192
+    vld           vr31,     a0,       224
+
+    LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    xvpermi.q     xr0,      xr24,     0x02
+    xvpermi.q     xr1,      xr25,     0x02
+    xvpermi.q     xr2,      xr26,     0x02
+    xvpermi.q     xr3,      xr27,     0x02
+    xvpermi.q     xr4,      xr28,     0x02
+    xvpermi.q     xr5,      xr29,     0x02
+    xvpermi.q     xr6,      xr30,     0x02
+    xvpermi.q     xr7,      xr31,     0x02
+
+    addi.d        a1,       a1,       -512+32
+
+    xvst          xr0,      a1,       0
+    xvst          xr1,      a1,       64
+    xvst          xr2,      a1,       128
+    xvst          xr3,      a1,       192
+    xvst          xr4,      a1,       256
+    xvst          xr5,      a1,       320
+    xvst          xr6,      a1,       384
+    xvst          xr7,      a1,       448
+
+    addi.d        a1,       a1,       512
+    addi.d        a0,       a0,       16
+
+    vld           vr24,     a0,       0
+    vld           vr25,     a0,       32
+    vld           vr26,     a0,       64
+    vld           vr27,     a0,       96
+    vld           vr28,     a0,       128
+    vld           vr29,     a0,       160
+    vld           vr30,     a0,       192
+    vld           vr31,     a0,       224
+
+    LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+                       vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+    xvpermi.q     xr8,      xr24,     0x02
+    xvpermi.q     xr9,      xr25,     0x02
+    xvpermi.q     xr10,     xr26,     0x02
+    xvpermi.q     xr11,     xr27,     0x02
+    xvpermi.q     xr12,     xr28,     0x02
+    xvpermi.q     xr13,     xr29,     0x02
+    xvpermi.q     xr14,     xr30,     0x02
+    xvpermi.q     xr15,     xr31,     0x02
+
+    xvst          xr8,      a1,       0
+    xvst          xr9,      a1,       64
+    xvst          xr10,     a1,       128
+    xvst          xr11,     a1,       192
+    xvst          xr12,     a1,       256
+    xvst          xr13,     a1,       320
+    xvst          xr14,     a1,       384
+    xvst          xr15,     a1,       448
+
+    fr_recover
+endfunc
+
+function ff_hevc_idct_32x32_lasx
+
+    addi.d        t7,       a0,       0
+    addi.d        t6,       a1,       0
+
+    addi.d        sp,       sp,       -8
+    st.d          ra,       sp,        0
+
+    bl hevc_idct_16x32_column_step1_lasx
+
+    addi.d        a0,       a0,       32
+
+    bl hevc_idct_16x32_column_step1_lasx
+
+    malloc_space (16*32+31)*2
+
+    addi.d        t8,       sp,       64+31*2    // tmp_buf_ptr
+
+    addi.d        a0,       t7,       0
+    addi.d        a1,       t8,       0
+    bl hevc_idct_transpose_32x16_to_16x32_lasx
+
+    addi.d        a0,       t8,       0
+    bl hevc_idct_16x32_column_step2_lasx
+
+    addi.d        a0,       t8,       0
+    addi.d        a1,       t7,       0
+    bl hevc_idct_transpose_16x32_to_32x16_lasx
+
+    // second
+    addi.d        a0,       t7,       32*8*2*2
+    addi.d        a1,       t8,       0
+    bl hevc_idct_transpose_32x16_to_16x32_lasx
+
+    addi.d        a0,       t8,       0
+    bl hevc_idct_16x32_column_step2_lasx
+
+    addi.d        a0,       t8,       0
+    addi.d        a1,       t7,       32*8*2*2
+    bl hevc_idct_transpose_16x32_to_32x16_lasx
+
+    free_space (16*32+31)*2
+
+    ld.d          ra,       sp,       0
+    addi.d        sp,       sp,       8
+
+endfunc
diff --git a/libavcodec/loongarch/hevc_idct_lsx.c b/libavcodec/loongarch/hevc_idct_lsx.c
index 2193b27546..527279d85d 100644
--- a/libavcodec/loongarch/hevc_idct_lsx.c
+++ b/libavcodec/loongarch/hevc_idct_lsx.c
@@ -23,18 +23,18 @@
 #include "libavutil/loongarch/loongson_intrinsics.h"
 #include "hevcdsp_lsx.h"
 
-static const int16_t gt8x8_cnst[16] __attribute__ ((aligned (64))) = {
+const int16_t gt8x8_cnst[16] __attribute__ ((aligned (64))) = {
     64, 64, 83, 36, 89, 50, 18, 75, 64, -64, 36, -83, 75, -89, -50, -18
 };
 
-static const int16_t gt16x16_cnst[64] __attribute__ ((aligned (64))) = {
+const int16_t gt16x16_cnst[64] __attribute__ ((aligned (64))) = {
     64, 83, 64, 36, 89, 75, 50, 18, 90, 80, 57, 25, 70, 87, 9, 43,
     64, 36, -64, -83, 75, -18, -89, -50, 87, 9, -80, -70, -43, 57, -25, -90,
     64, -36, -64, 83, 50, -89, 18, 75, 80, -70, -25, 90, -87, 9, 43, 57,
     64, -83, 64, -36, 18, -50, 75, -89, 70, -87, 90, -80, 9, -43, -57, 25
 };
 
-static const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = {
+const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = {
     90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4,
     90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13,
     88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22,
@@ -53,14 +53,14 @@ static const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = {
     4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90
 };
 
-static const int16_t gt32x32_cnst1[64] __attribute__ ((aligned (64))) = {
+const int16_t gt32x32_cnst1[64] __attribute__ ((aligned (64))) = {
     90, 87, 80, 70, 57, 43, 25, 9, 87, 57, 9, -43, -80, -90, -70, -25,
     80, 9, -70, -87, -25, 57, 90, 43, 70, -43, -87, 9, 90, 25, -80, -57,
     57, -80, -25, 90, -9, -87, 43, 70, 43, -90, 57, 25, -87, 70, 9, -80,
     25, -70, 90, -80, 43, 9, -57, 87, 9, -25, 43, -57, 70, -80, 87, -90
 };
 
-static const int16_t gt32x32_cnst2[16] __attribute__ ((aligned (64))) = {
+const int16_t gt32x32_cnst2[16] __attribute__ ((aligned (64))) = {
     89, 75, 50, 18, 75, -18, -89, -50, 50, -89, 18, 75, 18, -50, 75, -89
 };
 
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 2756755733..1585bda276 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -360,6 +360,8 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
             c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lasx;
             c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lasx;
             c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lasx;
+
+            c->idct[3] = ff_hevc_idct_32x32_lasx;
         }
     }
 }
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 5db35eed47..714cbf5880 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -131,4 +131,6 @@ BI_MC(epel, h, 64);
 
 #undef BI_MC
 
+void ff_hevc_idct_32x32_lasx(int16_t *coeffs, int col_limit);
+
 #endif  // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
-- 
2.20.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt
  2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt jinbo
@ 2023-12-28  7:29   ` yinshiyou-hf
  0 siblings, 0 replies; 10+ messages in thread
From: yinshiyou-hf @ 2023-12-28  7:29 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: yuanhecai

> -----原始邮件-----
> 发件人: jinbo <jinbo@loongson.cn>
> 发送时间:2023-12-27 12:50:19 (星期三)
> 收件人: ffmpeg-devel@ffmpeg.org
> 抄送: yuanhecai <yuanhecai@loongson.cn>
> 主题: [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt

> +
> +.macro malloc_space number
> +    li.w          t0,       \number
> +    sub.d         sp,       sp,       t0
> +    fr_store
> +.endm
> +
> +.macro free_space number
> +    fr_recover
> +    li.w          t0,       \number
> +    add.d         sp,       sp,       t0
> +.endm
> +
use subi and addi, these two macro is not needed.

本邮件及其附件含有龙芯中科的商业秘密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制或散发）本邮件及其附件中的信息。如果您错收本邮件，请您立即电话或邮件通知发件人并删除本邮件。 
This email and its attachments contain confidential information from Loongson Technology , which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it. 
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-12-28  8:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-27  4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter jinbo
2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt jinbo
2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
2023-12-28  2:18   ` yinshiyou-hf
2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 " jinbo
2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 " jinbo
2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 6/7] avcodec/hevc: Add asm opt for the following functions jinbo
2023-12-27  4:50 ` [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt jinbo
2023-12-28  7:29   ` yinshiyou-hf

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git