* [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter
2023-12-27 4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
@ 2023-12-27 4:50 ` jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt jinbo
` (5 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27 4:50 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
Forgot to init c->sao_edge_filter[idx] when idx=0/1/2/3.
After this patch, the speedup of decoding H265 4K 30FPS
30Mbps on 3A6000 is about 7% (42fps==>45fps).
Change-Id: I521999b397fa72b931a23c165cf45f276440cdfb
---
libavcodec/loongarch/hevcdsp_init_loongarch.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 22739c6f5b..5a96f3a4c9 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -167,6 +167,10 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[8][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv48_8_lsx;
c->put_hevc_qpel_uni_w[9][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv64_8_lsx;
+ c->sao_edge_filter[0] = ff_hevc_sao_edge_filter_8_lsx;
+ c->sao_edge_filter[1] = ff_hevc_sao_edge_filter_8_lsx;
+ c->sao_edge_filter[2] = ff_hevc_sao_edge_filter_8_lsx;
+ c->sao_edge_filter[3] = ff_hevc_sao_edge_filter_8_lsx;
c->sao_edge_filter[4] = ff_hevc_sao_edge_filter_8_lsx;
c->hevc_h_loop_filter_luma = ff_hevc_loop_filter_luma_h_8_lsx;
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread
* [FFmpeg-devel] [PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt
2023-12-27 4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter jinbo
@ 2023-12-27 4:50 ` jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
` (4 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27 4:50 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
After this patch, the peformance of decoding H265 4K 30FPS 30Mbps
on 3A6000 with 8 threads improves 2fps (45fps-->47fsp).
---
libavcodec/loongarch/Makefile | 3 +-
libavcodec/loongarch/hevc_add_res.S | 162 ++++++++++++++++++
libavcodec/loongarch/hevcdsp_init_loongarch.c | 5 +
libavcodec/loongarch/hevcdsp_lsx.h | 5 +
4 files changed, 174 insertions(+), 1 deletion(-)
create mode 100644 libavcodec/loongarch/hevc_add_res.S
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 06cfab5c20..07ea97f803 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -27,7 +27,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER) += loongarch/hevcdsp_lsx.o \
loongarch/hevc_lpf_sao_lsx.o \
loongarch/hevc_mc_bi_lsx.o \
loongarch/hevc_mc_uni_lsx.o \
- loongarch/hevc_mc_uniw_lsx.o
+ loongarch/hevc_mc_uniw_lsx.o \
+ loongarch/hevc_add_res.o
LSX-OBJS-$(CONFIG_H264DSP) += loongarch/h264idct.o \
loongarch/h264idct_loongarch.o \
loongarch/h264dsp.o
diff --git a/libavcodec/loongarch/hevc_add_res.S b/libavcodec/loongarch/hevc_add_res.S
new file mode 100644
index 0000000000..dd2d820af8
--- /dev/null
+++ b/libavcodec/loongarch/hevc_add_res.S
@@ -0,0 +1,162 @@
+/*
+ * Loongson LSX optimized add_residual functions for HEVC decoding
+ *
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "loongson_asm.S"
+
+/*
+ * void ff_hevc_add_residual4x4_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+.macro ADD_RES_LSX_4x4_8
+ vldrepl.w vr0, a0, 0
+ add.d t0, a0, a2
+ vldrepl.w vr1, t0, 0
+ vld vr2, a1, 0
+
+ vilvl.w vr1, vr1, vr0
+ vsllwil.hu.bu vr1, vr1, 0
+ vadd.h vr1, vr1, vr2
+ vssrani.bu.h vr1, vr1, 0
+
+ vstelm.w vr1, a0, 0, 0
+ vstelm.w vr1, t0, 0, 1
+.endm
+
+function ff_hevc_add_residual4x4_8_lsx
+ ADD_RES_LSX_4x4_8
+ alsl.d a0, a2, a0, 1
+ addi.d a1, a1, 16
+ ADD_RES_LSX_4x4_8
+endfunc
+
+/*
+ * void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+.macro ADD_RES_LSX_8x8_8
+ vldrepl.d vr0, a0, 0
+ add.d t0, a0, a2
+ vldrepl.d vr1, t0, 0
+ add.d t1, t0, a2
+ vldrepl.d vr2, t1, 0
+ add.d t2, t1, a2
+ vldrepl.d vr3, t2, 0
+
+ vld vr4, a1, 0
+ addi.d t3, zero, 16
+ vldx vr5, a1, t3
+ addi.d t4, a1, 32
+ vld vr6, t4, 0
+ vldx vr7, t4, t3
+
+ vsllwil.hu.bu vr0, vr0, 0
+ vsllwil.hu.bu vr1, vr1, 0
+ vsllwil.hu.bu vr2, vr2, 0
+ vsllwil.hu.bu vr3, vr3, 0
+ vadd.h vr0, vr0, vr4
+ vadd.h vr1, vr1, vr5
+ vadd.h vr2, vr2, vr6
+ vadd.h vr3, vr3, vr7
+ vssrani.bu.h vr1, vr0, 0
+ vssrani.bu.h vr3, vr2, 0
+
+ vstelm.d vr1, a0, 0, 0
+ vstelm.d vr1, t0, 0, 1
+ vstelm.d vr3, t1, 0, 0
+ vstelm.d vr3, t2, 0, 1
+.endm
+
+function ff_hevc_add_residual8x8_8_lsx
+ ADD_RES_LSX_8x8_8
+ alsl.d a0, a2, a0, 2
+ addi.d a1, a1, 64
+ ADD_RES_LSX_8x8_8
+endfunc
+
+/*
+ * void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+function ff_hevc_add_residual16x16_8_lsx
+.rept 8
+ vld vr0, a0, 0
+ vldx vr2, a0, a2
+
+ vld vr4, a1, 0
+ addi.d t0, zero, 16
+ vldx vr5, a1, t0
+ addi.d t1, a1, 32
+ vld vr6, t1, 0
+ vldx vr7, t1, t0
+
+ vexth.hu.bu vr1, vr0
+ vsllwil.hu.bu vr0, vr0, 0
+ vexth.hu.bu vr3, vr2
+ vsllwil.hu.bu vr2, vr2, 0
+ vadd.h vr0, vr0, vr4
+ vadd.h vr1, vr1, vr5
+ vadd.h vr2, vr2, vr6
+ vadd.h vr3, vr3, vr7
+
+ vssrani.bu.h vr1, vr0, 0
+ vssrani.bu.h vr3, vr2, 0
+
+ vst vr1, a0, 0
+ vstx vr3, a0, a2
+
+ alsl.d a0, a2, a0, 1
+ addi.d a1, a1, 64
+.endr
+endfunc
+
+/*
+ * void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+function ff_hevc_add_residual32x32_8_lsx
+.rept 32
+ vld vr0, a0, 0
+ addi.w t0, zero, 16
+ vldx vr2, a0, t0
+
+ vld vr4, a1, 0
+ vldx vr5, a1, t0
+ addi.d t1, a1, 32
+ vld vr6, t1, 0
+ vldx vr7, t1, t0
+
+ vexth.hu.bu vr1, vr0
+ vsllwil.hu.bu vr0, vr0, 0
+ vexth.hu.bu vr3, vr2
+ vsllwil.hu.bu vr2, vr2, 0
+ vadd.h vr0, vr0, vr4
+ vadd.h vr1, vr1, vr5
+ vadd.h vr2, vr2, vr6
+ vadd.h vr3, vr3, vr7
+
+ vssrani.bu.h vr1, vr0, 0
+ vssrani.bu.h vr3, vr2, 0
+
+ vst vr1, a0, 0
+ vstx vr3, a0, t0
+
+ add.d a0, a0, a2
+ addi.d a1, a1, 64
+.endr
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 5a96f3a4c9..a8f753dc86 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -189,6 +189,11 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->idct[1] = ff_hevc_idct_8x8_lsx;
c->idct[2] = ff_hevc_idct_16x16_lsx;
c->idct[3] = ff_hevc_idct_32x32_lsx;
+
+ c->add_residual[0] = ff_hevc_add_residual4x4_8_lsx;
+ c->add_residual[1] = ff_hevc_add_residual8x8_8_lsx;
+ c->add_residual[2] = ff_hevc_add_residual16x16_8_lsx;
+ c->add_residual[3] = ff_hevc_add_residual32x32_8_lsx;
}
}
}
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 0d54196caf..ac509984fd 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -227,4 +227,9 @@ void ff_hevc_idct_8x8_lsx(int16_t *coeffs, int col_limit);
void ff_hevc_idct_16x16_lsx(int16_t *coeffs, int col_limit);
void ff_hevc_idct_32x32_lsx(int16_t *coeffs, int col_limit);
+void ff_hevc_add_residual4x4_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread
* [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
2023-12-27 4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt jinbo
@ 2023-12-27 4:50 ` jinbo
2023-12-28 2:18 ` yinshiyou-hf
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 " jinbo
` (3 subsequent siblings)
6 siblings, 1 reply; 10+ messages in thread
From: jinbo @ 2023-12-27 4:50 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
tests/checkasm/checkasm: C LSX LASX
put_hevc_pel_uni_w_pixels4_8_c: 2.7 1.0
put_hevc_pel_uni_w_pixels6_8_c: 6.2 2.0 1.5
put_hevc_pel_uni_w_pixels8_8_c: 10.7 2.5 1.7
put_hevc_pel_uni_w_pixels12_8_c: 23.0 5.5 5.0
put_hevc_pel_uni_w_pixels16_8_c: 41.0 8.2 5.0
put_hevc_pel_uni_w_pixels24_8_c: 91.0 19.7 13.2
put_hevc_pel_uni_w_pixels32_8_c: 161.7 32.5 16.2
put_hevc_pel_uni_w_pixels48_8_c: 354.5 73.7 43.0
put_hevc_pel_uni_w_pixels64_8_c: 641.5 130.0 64.2
Speedup of decoding H265 4K 30FPS 30Mbps on 3A6000 with
8 threads is 1fps(47fps-->48fps).
---
libavcodec/loongarch/Makefile | 3 +-
libavcodec/loongarch/hevc_mc.S | 471 ++++++++++++++++++
libavcodec/loongarch/hevcdsp_init_loongarch.c | 43 ++
libavcodec/loongarch/hevcdsp_lasx.h | 53 ++
libavcodec/loongarch/hevcdsp_lsx.h | 27 +
5 files changed, 596 insertions(+), 1 deletion(-)
create mode 100644 libavcodec/loongarch/hevc_mc.S
create mode 100644 libavcodec/loongarch/hevcdsp_lasx.h
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 07ea97f803..ad98cd4054 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -28,7 +28,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER) += loongarch/hevcdsp_lsx.o \
loongarch/hevc_mc_bi_lsx.o \
loongarch/hevc_mc_uni_lsx.o \
loongarch/hevc_mc_uniw_lsx.o \
- loongarch/hevc_add_res.o
+ loongarch/hevc_add_res.o \
+ loongarch/hevc_mc.o
LSX-OBJS-$(CONFIG_H264DSP) += loongarch/h264idct.o \
loongarch/h264idct_loongarch.o \
loongarch/h264dsp.o
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
new file mode 100644
index 0000000000..c5d553effe
--- /dev/null
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -0,0 +1,471 @@
+/*
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "loongson_asm.S"
+
+.macro LOAD_VAR bit
+ addi.w t1, a5, 6 //shift
+ addi.w t3, zero, 1 //one
+ sub.w t4, t1, t3
+ sll.w t3, t3, t4 //offset
+.if \bit == 128
+ vreplgr2vr.w vr1, a6 //wx
+ vreplgr2vr.w vr2, t3 //offset
+ vreplgr2vr.w vr3, t1 //shift
+ vreplgr2vr.w vr4, a7 //ox
+.else
+ xvreplgr2vr.w xr1, a6
+ xvreplgr2vr.w xr2, t3
+ xvreplgr2vr.w xr3, t1
+ xvreplgr2vr.w xr4, a7
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS8_LSX src0, dst0, w
+ vldrepl.d vr0, \src0, 0
+ vsllwil.hu.bu vr0, vr0, 0
+ vexth.wu.hu vr5, vr0
+ vsllwil.wu.hu vr0, vr0, 0
+ vslli.w vr0, vr0, 6
+ vslli.w vr5, vr5, 6
+ vmul.w vr0, vr0, vr1
+ vmul.w vr5, vr5, vr1
+ vadd.w vr0, vr0, vr2
+ vadd.w vr5, vr5, vr2
+ vsra.w vr0, vr0, vr3
+ vsra.w vr5, vr5, vr3
+ vadd.w vr0, vr0, vr4
+ vadd.w vr5, vr5, vr4
+ vssrani.h.w vr5, vr0, 0
+ vssrani.bu.h vr5, vr5, 0
+.if \w == 6
+ fst.s f5, \dst0, 0
+ vstelm.h vr5, \dst0, 4, 2
+.else
+ fst.d f5, \dst0, 0
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS8x2_LASX src0, dst0, w
+ vldrepl.d vr0, \src0, 0
+ add.d t2, \src0, a3
+ vldrepl.d vr5, t2, 0
+ xvpermi.q xr0, xr5, 0x02
+ xvsllwil.hu.bu xr0, xr0, 0
+ xvexth.wu.hu xr5, xr0
+ xvsllwil.wu.hu xr0, xr0, 0
+ xvslli.w xr0, xr0, 6
+ xvslli.w xr5, xr5, 6
+ xvmul.w xr0, xr0, xr1
+ xvmul.w xr5, xr5, xr1
+ xvadd.w xr0, xr0, xr2
+ xvadd.w xr5, xr5, xr2
+ xvsra.w xr0, xr0, xr3
+ xvsra.w xr5, xr5, xr3
+ xvadd.w xr0, xr0, xr4
+ xvadd.w xr5, xr5, xr4
+ xvssrani.h.w xr5, xr0, 0
+ xvpermi.q xr0, xr5, 0x01
+ xvssrani.bu.h xr0, xr5, 0
+ add.d t3, \dst0, a1
+.if \w == 6
+ vstelm.w vr0, \dst0, 0, 0
+ vstelm.h vr0, \dst0, 4, 2
+ vstelm.w vr0, t3, 0, 2
+ vstelm.h vr0, t3, 4, 6
+.else
+ vstelm.d vr0, \dst0, 0, 0
+ vstelm.d vr0, t3, 0, 1
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS16_LSX src0, dst0
+ vld vr0, \src0, 0
+ vexth.hu.bu vr7, vr0
+ vexth.wu.hu vr8, vr7
+ vsllwil.wu.hu vr7, vr7, 0
+ vsllwil.hu.bu vr5, vr0, 0
+ vexth.wu.hu vr6, vr5
+ vsllwil.wu.hu vr5, vr5, 0
+ vslli.w vr5, vr5, 6
+ vslli.w vr6, vr6, 6
+ vslli.w vr7, vr7, 6
+ vslli.w vr8, vr8, 6
+ vmul.w vr5, vr5, vr1
+ vmul.w vr6, vr6, vr1
+ vmul.w vr7, vr7, vr1
+ vmul.w vr8, vr8, vr1
+ vadd.w vr5, vr5, vr2
+ vadd.w vr6, vr6, vr2
+ vadd.w vr7, vr7, vr2
+ vadd.w vr8, vr8, vr2
+ vsra.w vr5, vr5, vr3
+ vsra.w vr6, vr6, vr3
+ vsra.w vr7, vr7, vr3
+ vsra.w vr8, vr8, vr3
+ vadd.w vr5, vr5, vr4
+ vadd.w vr6, vr6, vr4
+ vadd.w vr7, vr7, vr4
+ vadd.w vr8, vr8, vr4
+ vssrani.h.w vr6, vr5, 0
+ vssrani.h.w vr8, vr7, 0
+ vssrani.bu.h vr8, vr6, 0
+ vst vr8, \dst0, 0
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS16_LASX src0, dst0
+ vld vr0, \src0, 0
+ xvpermi.d xr0, xr0, 0xd8
+ xvsllwil.hu.bu xr0, xr0, 0
+ xvexth.wu.hu xr6, xr0
+ xvsllwil.wu.hu xr5, xr0, 0
+ xvslli.w xr5, xr5, 6
+ xvslli.w xr6, xr6, 6
+ xvmul.w xr5, xr5, xr1
+ xvmul.w xr6, xr6, xr1
+ xvadd.w xr5, xr5, xr2
+ xvadd.w xr6, xr6, xr2
+ xvsra.w xr5, xr5, xr3
+ xvsra.w xr6, xr6, xr3
+ xvadd.w xr5, xr5, xr4
+ xvadd.w xr6, xr6, xr4
+ xvssrani.h.w xr6, xr5, 0
+ xvpermi.q xr7, xr6, 0x01
+ xvssrani.bu.h xr7, xr6, 0
+ vst vr7, \dst0, 0
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS32_LASX src0, dst0, w
+.if \w == 16
+ vld vr0, \src0, 0
+ add.d t2, \src0, a3
+ vld vr5, t2, 0
+ xvpermi.q xr0, xr5, 0x02
+.else //w=24/32
+ xvld xr0, \src0, 0
+.endif
+ xvexth.hu.bu xr7, xr0
+ xvexth.wu.hu xr8, xr7
+ xvsllwil.wu.hu xr7, xr7, 0
+ xvsllwil.hu.bu xr5, xr0, 0
+ xvexth.wu.hu xr6, xr5
+ xvsllwil.wu.hu xr5, xr5, 0
+ xvslli.w xr5, xr5, 6
+ xvslli.w xr6, xr6, 6
+ xvslli.w xr7, xr7, 6
+ xvslli.w xr8, xr8, 6
+ xvmul.w xr5, xr5, xr1
+ xvmul.w xr6, xr6, xr1
+ xvmul.w xr7, xr7, xr1
+ xvmul.w xr8, xr8, xr1
+ xvadd.w xr5, xr5, xr2
+ xvadd.w xr6, xr6, xr2
+ xvadd.w xr7, xr7, xr2
+ xvadd.w xr8, xr8, xr2
+ xvsra.w xr5, xr5, xr3
+ xvsra.w xr6, xr6, xr3
+ xvsra.w xr7, xr7, xr3
+ xvsra.w xr8, xr8, xr3
+ xvadd.w xr5, xr5, xr4
+ xvadd.w xr6, xr6, xr4
+ xvadd.w xr7, xr7, xr4
+ xvadd.w xr8, xr8, xr4
+ xvssrani.h.w xr6, xr5, 0
+ xvssrani.h.w xr8, xr7, 0
+ xvssrani.bu.h xr8, xr6, 0
+.if \w == 16
+ vst vr8, \dst0, 0
+ add.d t2, \dst0, a1
+ xvpermi.q xr8, xr8, 0x01
+ vst vr8, t2, 0
+.elseif \w == 24
+ vst vr8, \dst0, 0
+ xvstelm.d xr8, \dst0, 16, 2
+.else
+ xvst xr8, \dst0, 0
+.endif
+.endm
+
+function ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx
+ LOAD_VAR 128
+ srli.w t0, a4, 1
+.LOOP_PIXELS4:
+ vldrepl.w vr0, a2, 0
+ add.d t1, a2, a3
+ vldrepl.w vr5, t1, 0
+ vsllwil.hu.bu vr0, vr0, 0
+ vsllwil.wu.hu vr0, vr0, 0
+ vsllwil.hu.bu vr5, vr5, 0
+ vsllwil.wu.hu vr5, vr5, 0
+ vslli.w vr0, vr0, 6
+ vslli.w vr5, vr5, 6
+ vmul.w vr0, vr0, vr1
+ vmul.w vr5, vr5, vr1
+ vadd.w vr0, vr0, vr2
+ vadd.w vr5, vr5, vr2
+ vsra.w vr0, vr0, vr3
+ vsra.w vr5, vr5, vr3
+ vadd.w vr0, vr0, vr4
+ vadd.w vr5, vr5, vr4
+ vssrani.h.w vr5, vr0, 0
+ vssrani.bu.h vr5, vr5, 0
+ fst.s f5, a0, 0
+ add.d t2, a0, a1
+ vstelm.w vr5, t2, 0, 1
+ alsl.d a2, a3, a2, 1
+ alsl.d a0, a1, a0, 1
+ addi.w t0, t0, -1
+ bnez t0, .LOOP_PIXELS4
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS6:
+ HEVC_PEL_UNI_W_PIXELS8_LSX a2, a0, 6
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS6
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx
+ LOAD_VAR 256
+ srli.w t0, a4, 1
+.LOOP_PIXELS6_LASX:
+ HEVC_PEL_UNI_W_PIXELS8x2_LASX a2, a0, 6
+ alsl.d a2, a3, a2, 1
+ alsl.d a0, a1, a0, 1
+ addi.w t0, t0, -1
+ bnez t0, .LOOP_PIXELS6_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS8:
+ HEVC_PEL_UNI_W_PIXELS8_LSX a2, a0, 8
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS8
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx
+ LOAD_VAR 256
+ srli.w t0, a4, 1
+.LOOP_PIXELS8_LASX:
+ HEVC_PEL_UNI_W_PIXELS8x2_LASX a2, a0, 8
+ alsl.d a2, a3, a2, 1
+ alsl.d a0, a1, a0, 1
+ addi.w t0, t0, -1
+ bnez t0, .LOOP_PIXELS8_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS12:
+ vld vr0, a2, 0
+ vexth.hu.bu vr7, vr0
+ vsllwil.wu.hu vr7, vr7, 0
+ vsllwil.hu.bu vr5, vr0, 0
+ vexth.wu.hu vr6, vr5
+ vsllwil.wu.hu vr5, vr5, 0
+ vslli.w vr5, vr5, 6
+ vslli.w vr6, vr6, 6
+ vslli.w vr7, vr7, 6
+ vmul.w vr5, vr5, vr1
+ vmul.w vr6, vr6, vr1
+ vmul.w vr7, vr7, vr1
+ vadd.w vr5, vr5, vr2
+ vadd.w vr6, vr6, vr2
+ vadd.w vr7, vr7, vr2
+ vsra.w vr5, vr5, vr3
+ vsra.w vr6, vr6, vr3
+ vsra.w vr7, vr7, vr3
+ vadd.w vr5, vr5, vr4
+ vadd.w vr6, vr6, vr4
+ vadd.w vr7, vr7, vr4
+ vssrani.h.w vr6, vr5, 0
+ vssrani.h.w vr7, vr7, 0
+ vssrani.bu.h vr7, vr6, 0
+ fst.d f7, a0, 0
+ vstelm.w vr7, a0, 8, 2
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS12
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS12_LASX:
+ vld vr0, a2, 0
+ xvpermi.d xr0, xr0, 0xd8
+ xvsllwil.hu.bu xr0, xr0, 0
+ xvexth.wu.hu xr6, xr0
+ xvsllwil.wu.hu xr5, xr0, 0
+ xvslli.w xr5, xr5, 6
+ xvslli.w xr6, xr6, 6
+ xvmul.w xr5, xr5, xr1
+ xvmul.w xr6, xr6, xr1
+ xvadd.w xr5, xr5, xr2
+ xvadd.w xr6, xr6, xr2
+ xvsra.w xr5, xr5, xr3
+ xvsra.w xr6, xr6, xr3
+ xvadd.w xr5, xr5, xr4
+ xvadd.w xr6, xr6, xr4
+ xvssrani.h.w xr6, xr5, 0
+ xvpermi.q xr7, xr6, 0x01
+ xvssrani.bu.h xr7, xr6, 0
+ fst.d f7, a0, 0
+ vstelm.w vr7, a0, 8, 2
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS12_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS16:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS16
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx
+ LOAD_VAR 256
+ srli.w t0, a4, 1
+.LOOP_PIXELS16_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 16
+ alsl.d a2, a3, a2, 1
+ alsl.d a0, a1, a0, 1
+ addi.w t0, t0, -1
+ bnez t0, .LOOP_PIXELS16_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS24:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ addi.d t0, a2, 16
+ addi.d t1, a0, 16
+ HEVC_PEL_UNI_W_PIXELS8_LSX t0, t1, 8
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS24
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS24_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 24
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS24_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS32:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ addi.d t0, a2, 16
+ addi.d t1, a0, 16
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS32
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS32_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS32_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS48:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ addi.d t0, a2, 16
+ addi.d t1, a0, 16
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ addi.d t0, a2, 32
+ addi.d t1, a0, 32
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS48
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS48_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32
+ addi.d t0, a2, 32
+ addi.d t1, a0, 32
+ HEVC_PEL_UNI_W_PIXELS16_LASX t0, t1
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS48_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS64:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ addi.d t0, a2, 16
+ addi.d t1, a0, 16
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ addi.d t0, a2, 32
+ addi.d t1, a0, 32
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ addi.d t0, a2, 48
+ addi.d t1, a0, 48
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS64
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS64_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32
+ addi.d t0, a2, 32
+ addi.d t1, a0, 32
+ HEVC_PEL_UNI_W_PIXELS32_LASX t0, t1, 32
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index a8f753dc86..d0ee99d6b5 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -22,6 +22,7 @@
#include "libavutil/loongarch/cpu.h"
#include "hevcdsp_lsx.h"
+#include "hevcdsp_lasx.h"
void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
{
@@ -160,6 +161,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni[6][1][1] = ff_hevc_put_hevc_uni_epel_hv24_8_lsx;
c->put_hevc_epel_uni[7][1][1] = ff_hevc_put_hevc_uni_epel_hv32_8_lsx;
+ c->put_hevc_qpel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
+ c->put_hevc_qpel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
+ c->put_hevc_qpel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
+ c->put_hevc_qpel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx;
+ c->put_hevc_qpel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx;
+ c->put_hevc_qpel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx;
+ c->put_hevc_qpel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx;
+ c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
+ c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+
+ c->put_hevc_epel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
+ c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
+ c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
+ c->put_hevc_epel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx;
+ c->put_hevc_epel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx;
+ c->put_hevc_epel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx;
+ c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx;
+ c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
+ c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+
c->put_hevc_qpel_uni_w[3][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv8_8_lsx;
c->put_hevc_qpel_uni_w[5][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv16_8_lsx;
c->put_hevc_qpel_uni_w[6][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv24_8_lsx;
@@ -196,4 +217,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->add_residual[3] = ff_hevc_add_residual32x32_8_lsx;
}
}
+
+ if (have_lasx(cpu_flags)) {
+ if (bit_depth == 8) {
+ c->put_hevc_qpel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx;
+ c->put_hevc_qpel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx;
+ c->put_hevc_qpel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx;
+ c->put_hevc_qpel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx;
+ c->put_hevc_qpel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx;
+ c->put_hevc_qpel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
+ c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
+ c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+
+ c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx;
+ c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx;
+ c->put_hevc_epel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx;
+ c->put_hevc_epel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx;
+ c->put_hevc_epel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx;
+ c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
+ c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
+ c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+ }
+ }
}
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
new file mode 100644
index 0000000000..819c3c3ecf
--- /dev/null
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -0,0 +1,53 @@
+/*
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
+#define AVCODEC_LOONGARCH_HEVCDSP_LASX_H
+
+#include "libavcodec/hevcdsp.h"
+
+#define PEL_UNI_W(PEL, DIR, WIDTH) \
+void ff_hevc_put_hevc_##PEL##_uni_w_##DIR##WIDTH##_8_lasx(uint8_t *dst, \
+ ptrdiff_t \
+ dst_stride, \
+ const uint8_t *src, \
+ ptrdiff_t \
+ src_stride, \
+ int height, \
+ int denom, \
+ int wx, \
+ int ox, \
+ intptr_t mx, \
+ intptr_t my, \
+ int width)
+
+PEL_UNI_W(pel, pixels, 6);
+PEL_UNI_W(pel, pixels, 8);
+PEL_UNI_W(pel, pixels, 12);
+PEL_UNI_W(pel, pixels, 16);
+PEL_UNI_W(pel, pixels, 24);
+PEL_UNI_W(pel, pixels, 32);
+PEL_UNI_W(pel, pixels, 48);
+PEL_UNI_W(pel, pixels, 64);
+
+#undef PEL_UNI_W
+
+#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index ac509984fd..0d724a90ef 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -232,4 +232,31 @@ void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t s
void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+#define PEL_UNI_W(PEL, DIR, WIDTH) \
+void ff_hevc_put_hevc_##PEL##_uni_w_##DIR##WIDTH##_8_lsx(uint8_t *dst, \
+ ptrdiff_t \
+ dst_stride, \
+ const uint8_t *src, \
+ ptrdiff_t \
+ src_stride, \
+ int height, \
+ int denom, \
+ int wx, \
+ int ox, \
+ intptr_t mx, \
+ intptr_t my, \
+ int width)
+
+PEL_UNI_W(pel, pixels, 4);
+PEL_UNI_W(pel, pixels, 6);
+PEL_UNI_W(pel, pixels, 8);
+PEL_UNI_W(pel, pixels, 12);
+PEL_UNI_W(pel, pixels, 16);
+PEL_UNI_W(pel, pixels, 24);
+PEL_UNI_W(pel, pixels, 32);
+PEL_UNI_W(pel, pixels, 48);
+PEL_UNI_W(pel, pixels, 64);
+
+#undef PEL_UNI_W
+
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-28 2:18 ` yinshiyou-hf
0 siblings, 0 replies; 10+ messages in thread
From: yinshiyou-hf @ 2023-12-28 2:18 UTC (permalink / raw)
To: FFmpeg development discussions and patches; +Cc: jinbo
> -----原始邮件-----
> 发件人: jinbo <jinbo@loongson.cn>
> 发送时间:2023-12-27 12:50:15 (星期三)
> 收件人: ffmpeg-devel@ffmpeg.org
> 抄送: jinbo <jinbo@loongson.cn>
> 主题: [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
>
> +
> +.macro HEVC_PEL_UNI_W_PIXELS8_LSX src0, dst0, w
> + vldrepl.d vr0, \src0, 0
> + vsllwil.hu.bu vr0, vr0, 0
> + vexth.wu.hu vr5, vr0
> + vsllwil.wu.hu vr0, vr0, 0
> + vslli.w vr0, vr0, 6
> + vslli.w vr5, vr5, 6
> + vmul.w vr0, vr0, vr1
> + vmul.w vr5, vr5, vr1
> + vadd.w vr0, vr0, vr2
> + vadd.w vr5, vr5, vr2
You can use 'vmadd.w' here.
> + vsra.w vr0, vr0, vr3
> + vsra.w vr5, vr5, vr3
> + vadd.w vr0, vr0, vr4
> + vadd.w vr5, vr5, vr4
> + vssrani.h.w vr5, vr0, 0
> + vssrani.bu.h vr5, vr5, 0
> +.if \w == 6
> + fst.s f5, \dst0, 0
> + vstelm.h vr5, \dst0, 4, 2
> +.else
> + fst.d f5, \dst0, 0
> +.endif
> +.endm
> +
> +.macro HEVC_PEL_UNI_W_PIXELS8x2_LASX src0, dst0, w
> + vldrepl.d vr0, \src0, 0
> + add.d t2, \src0, a3
> + vldrepl.d vr5, t2, 0
> + xvpermi.q xr0, xr5, 0x02
> + xvsllwil.hu.bu xr0, xr0, 0
> + xvexth.wu.hu xr5, xr0
> + xvsllwil.wu.hu xr0, xr0, 0
> + xvslli.w xr0, xr0, 6
> + xvslli.w xr5, xr5, 6
> + xvmul.w xr0, xr0, xr1
> + xvmul.w xr5, xr5, xr1
> + xvadd.w xr0, xr0, xr2
> + xvadd.w xr5, xr5, xr2
Use 'vmadd.w' will be better.
> + xvsra.w xr0, xr0, xr3
> + xvsra.w xr5, xr5, xr3
> + xvadd.w xr0, xr0, xr4
> + xvadd.w xr5, xr5, xr4
> + xvssrani.h.w xr5, xr0, 0
> + xvpermi.q xr0, xr5, 0x01
> + xvssrani.bu.h xr0, xr5, 0
> + add.d t3, \dst0, a1
> +.if \w == 6
> + vstelm.w vr0, \dst0, 0, 0
> + vstelm.h vr0, \dst0, 4, 2
> + vstelm.w vr0, t3, 0, 2
> + vstelm.h vr0, t3, 4, 6
> +.else
> + vstelm.d vr0, \dst0, 0, 0
> + vstelm.d vr0, t3, 0, 1
> +.endif
> +.endm
> +
> +.macro HEVC_PEL_UNI_W_PIXELS16_LSX src0, dst0
> + vld vr0, \src0, 0
> + vexth.hu.bu vr7, vr0
> + vexth.wu.hu vr8, vr7
> + vsllwil.wu.hu vr7, vr7, 0
> + vsllwil.hu.bu vr5, vr0, 0
> + vexth.wu.hu vr6, vr5
> + vsllwil.wu.hu vr5, vr5, 0
> + vslli.w vr5, vr5, 6
> + vslli.w vr6, vr6, 6
> + vslli.w vr7, vr7, 6
> + vslli.w vr8, vr8, 6
> + vmul.w vr5, vr5, vr1
> + vmul.w vr6, vr6, vr1
> + vmul.w vr7, vr7, vr1
> + vmul.w vr8, vr8, vr1
> + vadd.w vr5, vr5, vr2
> + vadd.w vr6, vr6, vr2
> + vadd.w vr7, vr7, vr2
> + vadd.w vr8, vr8, vr2
Use 'vmadd.w', please check it in your left code.
本邮件及其附件含有龙芯中科的商业秘密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制或散发)本邮件及其附件中的信息。如果您错收本邮件,请您立即电话或邮件通知发件人并删除本邮件。
This email and its attachments contain confidential information from Loongson Technology , which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread
* [FFmpeg-devel] [PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 asm opt
2023-12-27 4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
` (2 preceding siblings ...)
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-27 4:50 ` jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 " jinbo
` (2 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27 4:50 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
tests/checkasm/checkasm: C LSX LASX
put_hevc_qpel_uni_w_h4_8_c: 6.5 1.7 1.2
put_hevc_qpel_uni_w_h6_8_c: 14.5 4.5 3.7
put_hevc_qpel_uni_w_h8_8_c: 24.5 5.7 4.5
put_hevc_qpel_uni_w_h12_8_c: 54.7 17.5 12.0
put_hevc_qpel_uni_w_h16_8_c: 96.5 22.7 13.2
put_hevc_qpel_uni_w_h24_8_c: 216.0 51.2 33.2
put_hevc_qpel_uni_w_h32_8_c: 385.7 87.0 53.2
put_hevc_qpel_uni_w_h48_8_c: 860.5 192.0 113.2
put_hevc_qpel_uni_w_h64_8_c: 1531.0 334.2 200.0
put_hevc_qpel_uni_w_v4_8_c: 8.0 1.7
put_hevc_qpel_uni_w_v6_8_c: 17.2 4.5
put_hevc_qpel_uni_w_v8_8_c: 29.5 6.0 5.2
put_hevc_qpel_uni_w_v12_8_c: 65.2 16.0 11.7
put_hevc_qpel_uni_w_v16_8_c: 116.5 20.5 14.0
put_hevc_qpel_uni_w_v24_8_c: 259.2 48.5 37.2
put_hevc_qpel_uni_w_v32_8_c: 459.5 80.5 56.0
put_hevc_qpel_uni_w_v48_8_c: 1028.5 180.2 126.5
put_hevc_qpel_uni_w_v64_8_c: 1831.2 319.2 224.2
Speedup of decoding H265 4K 30FPS 30Mbps on
3A6000 with 8 threads is 4fps(48fps-->52fps).
Change-Id: I1178848541d90083869225ba98a02e6aa8bb8c5a
---
libavcodec/loongarch/hevc_mc.S | 1294 +++++++++++++++++
libavcodec/loongarch/hevcdsp_init_loongarch.c | 38 +
libavcodec/loongarch/hevcdsp_lasx.h | 18 +
libavcodec/loongarch/hevcdsp_lsx.h | 20 +
4 files changed, 1370 insertions(+)
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index c5d553effe..2ee338fb8e 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -21,6 +21,8 @@
#include "loongson_asm.S"
+.extern ff_hevc_qpel_filters
+
.macro LOAD_VAR bit
addi.w t1, a5, 6 //shift
addi.w t3, zero, 1 //one
@@ -469,3 +471,1295 @@ function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx
addi.w a4, a4, -1
bnez a4, .LOOP_PIXELS64_LASX
endfunc
+
+.macro vhaddw.d.h in0
+ vhaddw.w.h \in0, \in0, \in0
+ vhaddw.d.w \in0, \in0, \in0
+.endm
+
+.macro xvhaddw.d.h in0
+ xvhaddw.w.h \in0, \in0, \in0
+ xvhaddw.d.w \in0, \in0, \in0
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ fld.s f6, a2, 0 //0
+ fldx.s f7, a2, a3 //1
+ fldx.s f8, a2, t0 //2
+ add.d a2, a2, t1
+ fld.s f9, a2, 0 //3
+ fldx.s f10, a2, a3 //4
+ fldx.s f11, a2, t0 //5
+ fldx.s f12, a2, t1 //6
+ add.d a2, a2, t2
+ vilvl.b vr6, vr7, vr6
+ vilvl.b vr7, vr9, vr8
+ vilvl.b vr8, vr11, vr10
+ vilvl.b vr9, vr13, vr12
+ vilvl.h vr6, vr7, vr6
+ vilvl.h vr7, vr9, vr8
+ vilvl.w vr8, vr7, vr6
+ vilvh.w vr9, vr7, vr6
+.LOOP_V4:
+ fld.s f13, a2, 0 //7
+ fldx.s f14, a2, a3 //8 next loop
+ add.d a2, a2, t0
+ vextrins.b vr8, vr13, 0x70
+ vextrins.b vr8, vr13, 0xf1
+ vextrins.b vr9, vr13, 0x72
+ vextrins.b vr9, vr13, 0xf3
+ vbsrl.v vr10, vr8, 1
+ vbsrl.v vr11, vr9, 1
+ vextrins.b vr10, vr14, 0x70
+ vextrins.b vr10, vr14, 0xf1
+ vextrins.b vr11, vr14, 0x72
+ vextrins.b vr11, vr14, 0xf3
+ vdp2.h.bu.b vr6, vr8, vr5 //QPEL_FILTER(src, stride)
+ vdp2.h.bu.b vr7, vr9, vr5
+ vdp2.h.bu.b vr12, vr10, vr5
+ vdp2.h.bu.b vr13, vr11, vr5
+ vbsrl.v vr8, vr10, 1
+ vbsrl.v vr9, vr11, 1
+ vhaddw.d.h vr6
+ vhaddw.d.h vr7
+ vhaddw.d.h vr12
+ vhaddw.d.h vr13
+ vpickev.w vr6, vr7, vr6
+ vpickev.w vr12, vr13, vr12
+ vmulwev.w.h vr6, vr6, vr1 //QPEL_FILTER(src, stride) * wx
+ vmulwev.w.h vr12, vr12, vr1
+ vadd.w vr6, vr6, vr2
+ vsra.w vr6, vr6, vr3
+ vadd.w vr6, vr6, vr4
+ vadd.w vr12, vr12, vr2
+ vsra.w vr12, vr12, vr3
+ vadd.w vr12, vr12, vr4
+ vssrani.h.w vr12, vr6, 0
+ vssrani.bu.h vr12, vr12, 0
+ fst.s f12, a0, 0
+ add.d a0, a0, a1
+ vstelm.w vr12, a0, 0, 1
+ add.d a0, a0, a1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_V4
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ fld.d f6, a2, 0
+ fldx.d f7, a2, a3
+ fldx.d f8, a2, t0
+ add.d a2, a2, t1
+ fld.d f9, a2, 0
+ fldx.d f10, a2, a3
+ fldx.d f11, a2, t0
+ fldx.d f12, a2, t1
+ add.d a2, a2, t2
+ vilvl.b vr6, vr7, vr6 //transpose 8x6 to 3x16
+ vilvl.b vr7, vr9, vr8
+ vilvl.b vr8, vr11, vr10
+ vilvl.b vr9, vr13, vr12
+ vilvl.h vr10, vr7, vr6
+ vilvh.h vr11, vr7, vr6
+ vilvl.h vr12, vr9, vr8
+ vilvh.h vr13, vr9, vr8
+ vilvl.w vr6, vr12, vr10
+ vilvh.w vr7, vr12, vr10
+ vilvl.w vr8, vr13, vr11
+.LOOP_V6:
+ fld.d f13, a2, 0
+ add.d a2, a2, a3
+ vextrins.b vr6, vr13, 0x70
+ vextrins.b vr6, vr13, 0xf1
+ vextrins.b vr7, vr13, 0x72
+ vextrins.b vr7, vr13, 0xf3
+ vextrins.b vr8, vr13, 0x74
+ vextrins.b vr8, vr13, 0xf5
+ vdp2.h.bu.b vr10, vr6, vr5 //QPEL_FILTER(src, stride)
+ vdp2.h.bu.b vr11, vr7, vr5
+ vdp2.h.bu.b vr12, vr8, vr5
+ vbsrl.v vr6, vr6, 1
+ vbsrl.v vr7, vr7, 1
+ vbsrl.v vr8, vr8, 1
+ vhaddw.d.h vr10
+ vhaddw.d.h vr11
+ vhaddw.d.h vr12
+ vpickev.w vr10, vr11, vr10
+ vpickev.w vr11, vr13, vr12
+ vmulwev.w.h vr10, vr10, vr1 //QPEL_FILTER(src, stride) * wx
+ vmulwev.w.h vr11, vr11, vr1
+ vadd.w vr10, vr10, vr2
+ vadd.w vr11, vr11, vr2
+ vsra.w vr10, vr10, vr3
+ vsra.w vr11, vr11, vr3
+ vadd.w vr10, vr10, vr4
+ vadd.w vr11, vr11, vr4
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vstelm.h vr11, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_V6
+endfunc
+
+// transpose 8x8b to 4x16b
+.macro TRANSPOSE8X8B_LSX in0, in1, in2, in3, in4, in5, in6, in7, \
+ out0, out1, out2, out3
+ vilvl.b \in0, \in1, \in0
+ vilvl.b \in1, \in3, \in2
+ vilvl.b \in2, \in5, \in4
+ vilvl.b \in3, \in7, \in6
+ vilvl.h \in4, \in1, \in0
+ vilvh.h \in5, \in1, \in0
+ vilvl.h \in6, \in3, \in2
+ vilvh.h \in7, \in3, \in2
+ vilvl.w \out0, \in6, \in4
+ vilvh.w \out1, \in6, \in4
+ vilvl.w \out2, \in7, \in5
+ vilvh.w \out3, \in7, \in5
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_V8_LSX in0, in1, in2, in3, out0, out1, pos
+.if \pos == 0
+ vextrins.b \in0, vr13, 0x70 //insert the 8th load
+ vextrins.b \in0, vr13, 0xf1
+ vextrins.b \in1, vr13, 0x72
+ vextrins.b \in1, vr13, 0xf3
+ vextrins.b \in2, vr13, 0x74
+ vextrins.b \in2, vr13, 0xf5
+ vextrins.b \in3, vr13, 0x76
+ vextrins.b \in3, vr13, 0xf7
+.else// \pos == 8
+ vextrins.b \in0, vr13, 0x78
+ vextrins.b \in0, vr13, 0xf9
+ vextrins.b \in1, vr13, 0x7a
+ vextrins.b \in1, vr13, 0xfb
+ vextrins.b \in2, vr13, 0x7c
+ vextrins.b \in2, vr13, 0xfd
+ vextrins.b \in3, vr13, 0x7e
+ vextrins.b \in3, vr13, 0xff
+.endif
+ vdp2.h.bu.b \out0, \in0, vr5 //QPEL_FILTER(src, stride)
+ vdp2.h.bu.b \out1, \in1, vr5
+ vdp2.h.bu.b vr12, \in2, vr5
+ vdp2.h.bu.b vr20, \in3, vr5
+ vbsrl.v \in0, \in0, 1 //Back up previous 7 loaded datas,
+ vbsrl.v \in1, \in1, 1 //so just need to insert the 8th
+ vbsrl.v \in2, \in2, 1 //load in the next loop.
+ vbsrl.v \in3, \in3, 1
+ vhaddw.d.h \out0
+ vhaddw.d.h \out1
+ vhaddw.d.h vr12
+ vhaddw.d.h vr20
+ vpickev.w \out0, \out1, \out0
+ vpickev.w \out1, vr20, vr12
+ vmulwev.w.h \out0, \out0, vr1 //QPEL_FILTER(src, stride) * wx
+ vmulwev.w.h \out1, \out1, vr1
+ vadd.w \out0, \out0, vr2
+ vadd.w \out1, \out1, vr2
+ vsra.w \out0, \out0, vr3
+ vsra.w \out1, \out1, vr3
+ vadd.w \out0, \out0, vr4
+ vadd.w \out1, \out1, vr4
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ fld.d f6, a2, 0
+ fldx.d f7, a2, a3
+ fldx.d f8, a2, t0
+ add.d a2, a2, t1
+ fld.d f9, a2, 0
+ fldx.d f10, a2, a3
+ fldx.d f11, a2, t0
+ fldx.d f12, a2, t1
+ add.d a2, a2, t2
+ TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \
+ vr6, vr7, vr8, vr9
+.LOOP_V8:
+ fld.d f13, a2, 0 //the 8th load
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_V8_LSX vr6, vr7, vr8, vr9, vr10, vr11, 0
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.d f11, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_V8
+endfunc
+
+.macro PUT_HEVC_UNI_W_V8_LASX w
+ fld.d f6, a2, 0
+ fldx.d f7, a2, a3
+ fldx.d f8, a2, t0
+ add.d a2, a2, t1
+ fld.d f9, a2, 0
+ fldx.d f10, a2, a3
+ fldx.d f11, a2, t0
+ fldx.d f12, a2, t1
+ add.d a2, a2, t2
+ TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \
+ vr6, vr7, vr8, vr9
+ xvpermi.q xr6, xr7, 0x02
+ xvpermi.q xr8, xr9, 0x02
+.LOOP_V8_LASX_\w:
+ fld.d f13, a2, 0 // 0 1 2 3 4 5 6 7 the 8th load
+ add.d a2, a2, a3
+ vshuf4i.h vr13, vr13, 0xd8
+ vbsrl.v vr14, vr13, 4
+ xvpermi.q xr13, xr14, 0x02 //0 1 4 5 * * * * 2 3 6 7 * * * *
+ xvextrins.b xr6, xr13, 0x70 //begin to insert the 8th load
+ xvextrins.b xr6, xr13, 0xf1
+ xvextrins.b xr8, xr13, 0x72
+ xvextrins.b xr8, xr13, 0xf3
+ xvdp2.h.bu.b xr20, xr6, xr5 //QPEL_FILTER(src, stride)
+ xvdp2.h.bu.b xr21, xr8, xr5
+ xvbsrl.v xr6, xr6, 1
+ xvbsrl.v xr8, xr8, 1
+ xvhaddw.d.h xr20
+ xvhaddw.d.h xr21
+ xvpickev.w xr20, xr21, xr20
+ xvpermi.d xr20, xr20, 0xd8
+ xvmulwev.w.h xr20, xr20, xr1 //QPEL_FILTER(src, stride) * wx
+ xvadd.w xr20, xr20, xr2
+ xvsra.w xr20, xr20, xr3
+ xvadd.w xr10, xr20, xr4
+ xvpermi.q xr11, xr10, 0x01
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.d f11, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_V8_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_UNI_W_V8_LASX 8
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_V16_LSX w
+ vld vr6, a2, 0
+ vldx vr7, a2, a3
+ vldx vr8, a2, t0
+ add.d a2, a2, t1
+ vld vr9, a2, 0
+ vldx vr10, a2, a3
+ vldx vr11, a2, t0
+ vldx vr12, a2, t1
+ add.d a2, a2, t2
+.if \w > 8
+ vilvh.d vr14, vr14, vr6
+ vilvh.d vr15, vr15, vr7
+ vilvh.d vr16, vr16, vr8
+ vilvh.d vr17, vr17, vr9
+ vilvh.d vr18, vr18, vr10
+ vilvh.d vr19, vr19, vr11
+ vilvh.d vr20, vr20, vr12
+.endif
+ TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \
+ vr6, vr7, vr8, vr9
+.if \w > 8
+ TRANSPOSE8X8B_LSX vr14, vr15, vr16, vr17, vr18, vr19, vr20, vr21, \
+ vr14, vr15, vr16, vr17
+.endif
+.LOOP_HORI_16_\w:
+ vld vr13, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_V8_LSX vr6, vr7, vr8, vr9, vr10, vr11, 0
+.if \w > 8
+ PUT_HEVC_QPEL_UNI_W_V8_LSX vr14, vr15, vr16, vr17, vr18, vr19, 8
+.endif
+ vssrani.h.w vr11, vr10, 0
+.if \w > 8
+ vssrani.h.w vr19, vr18, 0
+ vssrani.bu.h vr19, vr11, 0
+.else
+ vssrani.bu.h vr11, vr11, 0
+.endif
+.if \w == 8
+ fst.d f11, a0, 0
+.elseif \w == 12
+ fst.d f19, a0, 0
+ vstelm.w vr19, a0, 8, 2
+.else
+ vst vr19, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HORI_16_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 16
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_V16_LASX w
+ vld vr6, a2, 0
+ vldx vr7, a2, a3
+ vldx vr8, a2, t0
+ add.d a2, a2, t1
+ vld vr9, a2, 0
+ vldx vr10, a2, a3
+ vldx vr11, a2, t0
+ vldx vr12, a2, t1
+ add.d a2, a2, t2
+ xvpermi.q xr6, xr10, 0x02 //pack and transpose the 8x16 to 4x32 begin
+ xvpermi.q xr7, xr11, 0x02
+ xvpermi.q xr8, xr12, 0x02
+ xvpermi.q xr9, xr13, 0x02
+ xvilvl.b xr14, xr7, xr6 //0 2
+ xvilvh.b xr15, xr7, xr6 //1 3
+ xvilvl.b xr16, xr9, xr8 //0 2
+ xvilvh.b xr17, xr9, xr8 //1 3
+ xvpermi.d xr14, xr14, 0xd8
+ xvpermi.d xr15, xr15, 0xd8
+ xvpermi.d xr16, xr16, 0xd8
+ xvpermi.d xr17, xr17, 0xd8
+ xvilvl.h xr6, xr16, xr14
+ xvilvh.h xr7, xr16, xr14
+ xvilvl.h xr8, xr17, xr15
+ xvilvh.h xr9, xr17, xr15
+ xvilvl.w xr14, xr7, xr6 //0 1 4 5
+ xvilvh.w xr15, xr7, xr6 //2 3 6 7
+ xvilvl.w xr16, xr9, xr8 //8 9 12 13
+ xvilvh.w xr17, xr9, xr8 //10 11 14 15 end
+.LOOP_HORI_16_LASX_\w:
+ vld vr13, a2, 0 //the 8th load
+ add.d a2, a2, a3
+ vshuf4i.w vr13, vr13, 0xd8
+ vbsrl.v vr12, vr13, 8
+ xvpermi.q xr13, xr12, 0x02
+ xvextrins.b xr14, xr13, 0x70 //inset the 8th load
+ xvextrins.b xr14, xr13, 0xf1
+ xvextrins.b xr15, xr13, 0x72
+ xvextrins.b xr15, xr13, 0xf3
+ xvextrins.b xr16, xr13, 0x74
+ xvextrins.b xr16, xr13, 0xf5
+ xvextrins.b xr17, xr13, 0x76
+ xvextrins.b xr17, xr13, 0xf7
+ xvdp2.h.bu.b xr6, xr14, xr5 //QPEL_FILTER(src, stride)
+ xvdp2.h.bu.b xr7, xr15, xr5
+ xvdp2.h.bu.b xr8, xr16, xr5
+ xvdp2.h.bu.b xr9, xr17, xr5
+ xvhaddw.d.h xr6
+ xvhaddw.d.h xr7
+ xvhaddw.d.h xr8
+ xvhaddw.d.h xr9
+ xvbsrl.v xr14, xr14, 1 //Back up previous 7 loaded datas,
+ xvbsrl.v xr15, xr15, 1 //so just need to insert the 8th
+ xvbsrl.v xr16, xr16, 1 //load in next loop.
+ xvbsrl.v xr17, xr17, 1
+ xvpickev.w xr6, xr7, xr6 //0 1 2 3 4 5 6 7
+ xvpickev.w xr7, xr9, xr8 //8 9 10 11 12 13 14 15
+ xvmulwev.w.h xr6, xr6, xr1 //QPEL_FILTER(src, stride) * wx
+ xvmulwev.w.h xr7, xr7, xr1
+ xvadd.w xr6, xr6, xr2
+ xvadd.w xr7, xr7, xr2
+ xvsra.w xr6, xr6, xr3
+ xvsra.w xr7, xr7, xr3
+ xvadd.w xr6, xr6, xr4
+ xvadd.w xr7, xr7, xr4
+ xvssrani.h.w xr7, xr6, 0 //0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15
+ xvpermi.q xr6, xr7, 0x01
+ vssrani.bu.h vr6, vr7, 0
+ vshuf4i.w vr6, vr6, 0xd8
+.if \w == 12
+ fst.d f6, a0, 0
+ vstelm.w vr6, a0, 8, 2
+.else
+ vst vr6, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HORI_16_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 24
+ addi.d a0, t4, 16
+ addi.d a2, t5, 16
+ addi.d a4, t6, 0
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 24
+ addi.d a0, t4, 16
+ addi.d a2, t5, 16
+ addi.d a4, t6, 0
+ PUT_HEVC_UNI_W_V8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 2
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V32:
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 32
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d a2, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V32
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 2
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V32_LASX:
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 32
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d a2, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V32_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 3
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V48:
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 48
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d t4, t4, 16
+ addi.d a2, t5, 16
+ addi.d t5, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V48
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 3
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V48_LASX:
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 48
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d t4, t4, 16
+ addi.d a2, t5, 16
+ addi.d t5, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V48_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 4
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V64:
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 64
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d t4, t4, 16
+ addi.d a2, t5, 16
+ addi.d t5, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V64
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 4
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V64_LASX:
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 64
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d t4, t4, 16
+ addi.d a2, t5, 16
+ addi.d t5, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V64_LASX
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_H8_LSX in0, out0, out1
+ vbsrl.v vr7, \in0, 1
+ vbsrl.v vr8, \in0, 2
+ vbsrl.v vr9, \in0, 3
+ vbsrl.v vr10, \in0, 4
+ vbsrl.v vr11, \in0, 5
+ vbsrl.v vr12, \in0, 6
+ vbsrl.v vr13, \in0, 7
+ vilvl.d vr6, vr7, \in0
+ vilvl.d vr7, vr9, vr8
+ vilvl.d vr8, vr11, vr10
+ vilvl.d vr9, vr13, vr12
+ vdp2.h.bu.b vr10, vr6, vr5
+ vdp2.h.bu.b vr11, vr7, vr5
+ vdp2.h.bu.b vr12, vr8, vr5
+ vdp2.h.bu.b vr13, vr9, vr5
+ vhaddw.d.h vr10
+ vhaddw.d.h vr11
+ vhaddw.d.h vr12
+ vhaddw.d.h vr13
+ vpickev.w vr10, vr11, vr10
+ vpickev.w vr11, vr13, vr12
+ vmulwev.w.h vr10, vr10, vr1
+ vmulwev.w.h vr11, vr11, vr1
+ vadd.w vr10, vr10, vr2
+ vadd.w vr11, vr11, vr2
+ vsra.w vr10, vr10, vr3
+ vsra.w vr11, vr11, vr3
+ vadd.w \out0, vr10, vr4
+ vadd.w \out1, vr11, vr4
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_H8_LASX in0, out0
+ xvbsrl.v xr7, \in0, 4
+ xvpermi.q xr7, \in0, 0x20
+ xvbsrl.v xr8, xr7, 1
+ xvbsrl.v xr9, xr7, 2
+ xvbsrl.v xr10, xr7, 3
+ xvpackev.d xr7, xr8, xr7
+ xvpackev.d xr8, xr10, xr9
+ xvdp2.h.bu.b xr10, xr7, xr5
+ xvdp2.h.bu.b xr11, xr8, xr5
+ xvhaddw.d.h xr10
+ xvhaddw.d.h xr11
+ xvpickev.w xr10, xr11, xr10
+ xvmulwev.w.h xr10, xr10, xr1
+ xvadd.w xr10, xr10, xr2
+ xvsra.w xr10, xr10, xr3
+ xvadd.w \out0, xr10, xr4
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_H16_LASX in0, out0
+ xvpermi.d xr6, \in0, 0x94
+ xvbsrl.v xr7, xr6, 1
+ xvbsrl.v xr8, xr6, 2
+ xvbsrl.v xr9, xr6, 3
+ xvbsrl.v xr10, xr6, 4
+ xvbsrl.v xr11, xr6, 5
+ xvbsrl.v xr12, xr6, 6
+ xvbsrl.v xr13, xr6, 7
+ xvpackev.d xr6, xr7, xr6
+ xvpackev.d xr7, xr9, xr8
+ xvpackev.d xr8, xr11, xr10
+ xvpackev.d xr9, xr13, xr12
+ xvdp2.h.bu.b xr10, xr6, xr5
+ xvdp2.h.bu.b xr11, xr7, xr5
+ xvdp2.h.bu.b xr12, xr8, xr5
+ xvdp2.h.bu.b xr13, xr9, xr5
+ xvhaddw.d.h xr10
+ xvhaddw.d.h xr11
+ xvhaddw.d.h xr12
+ xvhaddw.d.h xr13
+ xvpickev.w xr10, xr11, xr10
+ xvpickev.w xr11, xr13, xr12
+ xvmulwev.w.h xr10, xr10, xr1
+ xvmulwev.w.h xr11, xr11, xr1
+ xvadd.w xr10, xr10, xr2
+ xvadd.w xr11, xr11, xr2
+ xvsra.w xr10, xr10, xr3
+ xvsra.w xr11, xr11, xr3
+ xvadd.w xr10, xr10, xr4
+ xvadd.w xr11, xr11, xr4
+ xvssrani.h.w xr11, xr10, 0
+ xvpermi.q \out0, xr11, 0x01
+ xvssrani.bu.h \out0, xr11, 0
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H4:
+ vld vr18, a2, 0
+ vldx vr19, a2, a3
+ alsl.d a2, a3, a2, 1
+ vbsrl.v vr6, vr18, 1
+ vbsrl.v vr7, vr18, 2
+ vbsrl.v vr8, vr18, 3
+ vbsrl.v vr9, vr19, 1
+ vbsrl.v vr10, vr19, 2
+ vbsrl.v vr11, vr19, 3
+ vilvl.d vr6, vr6, vr18
+ vilvl.d vr7, vr8, vr7
+ vilvl.d vr8, vr9, vr19
+ vilvl.d vr9, vr11, vr10
+ vdp2.h.bu.b vr10, vr6, vr5
+ vdp2.h.bu.b vr11, vr7, vr5
+ vdp2.h.bu.b vr12, vr8, vr5
+ vdp2.h.bu.b vr13, vr9, vr5
+ vhaddw.d.h vr10
+ vhaddw.d.h vr11
+ vhaddw.d.h vr12
+ vhaddw.d.h vr13
+ vpickev.w vr10, vr11, vr10
+ vpickev.w vr11, vr13, vr12
+ vmulwev.w.h vr10, vr10, vr1
+ vmulwev.w.h vr11, vr11, vr1
+ vadd.w vr10, vr10, vr2
+ vadd.w vr11, vr11, vr2
+ vsra.w vr10, vr10, vr3
+ vsra.w vr11, vr11, vr3
+ vadd.w vr10, vr10, vr4
+ vadd.w vr11, vr11, vr4
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vbsrl.v vr11, vr11, 4
+ fstx.s f11, a0, a1
+ alsl.d a0, a1, a0, 1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_H4
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H4_LASX:
+ vld vr18, a2, 0
+ vldx vr19, a2, a3
+ alsl.d a2, a3, a2, 1
+ xvpermi.q xr18, xr19, 0x02
+ xvbsrl.v xr6, xr18, 1
+ xvbsrl.v xr7, xr18, 2
+ xvbsrl.v xr8, xr18, 3
+ xvpackev.d xr6, xr6, xr18
+ xvpackev.d xr7, xr8, xr7
+ xvdp2.h.bu.b xr10, xr6, xr5
+ xvdp2.h.bu.b xr11, xr7, xr5
+ xvhaddw.d.h xr10
+ xvhaddw.d.h xr11
+ xvpickev.w xr10, xr11, xr10
+ xvmulwev.w.h xr10, xr10, xr1
+ xvadd.w xr10, xr10, xr2
+ xvsra.w xr10, xr10, xr3
+ xvadd.w xr10, xr10, xr4
+ xvpermi.q xr11, xr10, 0x01
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vbsrl.v vr11, vr11, 4
+ fstx.s f11, a0, a1
+ alsl.d a0, a1, a0, 1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_H4_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H6:
+ vld vr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr10, vr11
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vstelm.h vr11, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H6
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H6_LASX:
+ vld vr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LASX xr6, xr10
+ xvpermi.q xr11, xr10, 0x01
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vstelm.h vr11, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H6_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H8:
+ vld vr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr10, vr11
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.d f11, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H8
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H8_LASX:
+ vld vr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LASX xr6, xr10
+ xvpermi.q xr11, xr10, 0x01
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.d f11, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H8_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H12:
+ vld vr6, a2, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr14, vr15
+ vld vr6, a2, 8
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr16, vr17
+ add.d a2, a2, a3
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ fst.d f17, a0, 0
+ vbsrl.v vr17, vr17, 8
+ fst.s f17, a0, 8
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H12_LASX:
+ xvld xr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr6, xr14
+ fst.d f14, a0, 0
+ vstelm.w vr14, a0, 8, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H16:
+ vld vr6, a2, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr14, vr15
+ vld vr6, a2, 8
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr16, vr17
+ add.d a2, a2, a3
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H16
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H16_LASX:
+ xvld xr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr6, xr10
+ vst vr10, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H24:
+ vld vr18, a2, 0
+ vld vr19, a2, 16
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+ vshuf4i.d vr18, vr19, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.d f15, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H24
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H24_LASX:
+ xvld xr18, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+ xvpermi.q xr19, xr18, 0x01
+ vst vr20, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LASX xr19, xr20
+ xvpermi.q xr21, xr20, 0x01
+ vssrani.h.w vr21, vr20, 0
+ vssrani.bu.h vr21, vr21, 0
+ fst.d f21, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H32:
+ vld vr18, a2, 0
+ vld vr19, a2, 16
+ vld vr20, a2, 32
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+ vshuf4i.d vr18, vr19, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+ vshuf4i.d vr19, vr20, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H32
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H32_LASX:
+ xvld xr18, a2, 0
+ xvld xr19, a2, 16
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr21
+ xvpermi.q xr20, xr21, 0x02
+ xvst xr20, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H48:
+ vld vr18, a2, 0
+ vld vr19, a2, 16
+ vld vr20, a2, 32
+ vld vr21, a2, 48
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+ vshuf4i.d vr18, vr19, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+ vshuf4i.d vr19, vr20, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 16
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr14, vr15
+ vshuf4i.d vr20, vr21, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H48
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H48_LASX:
+ xvld xr18, a2, 0
+ xvld xr19, a2, 32
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+ xvpermi.q xr18, xr19, 0x03
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr21
+ xvpermi.q xr20, xr21, 0x02
+ xvst xr20, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr20
+ vst vr20, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H64:
+ vld vr18, a2, 0
+ vld vr19, a2, 16
+ vld vr20, a2, 32
+ vld vr21, a2, 48
+ vld vr22, a2, 64
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+ vshuf4i.d vr18, vr19, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+ vshuf4i.d vr19, vr20, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 16
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr14, vr15
+ vshuf4i.d vr20, vr21, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 32
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr21, vr14, vr15
+ vshuf4i.d vr21, vr22, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr21, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 48
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H64
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H64_LASX:
+ xvld xr18, a2, 0
+ xvld xr19, a2, 32
+ xvld xr20, a2, 64
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr21
+ xvpermi.q xr18, xr19, 0x03
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr22
+ xvpermi.q xr21, xr22, 0x02
+ xvst xr21, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr21
+ xvpermi.q xr19, xr20, 0x03
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr22
+ xvpermi.q xr21, xr22, 0x02
+ xvst xr21, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index d0ee99d6b5..3cdb3fb2d7 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -188,6 +188,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[8][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv48_8_lsx;
c->put_hevc_qpel_uni_w[9][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv64_8_lsx;
+ c->put_hevc_qpel_uni_w[1][1][0] = ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx;
+ c->put_hevc_qpel_uni_w[2][1][0] = ff_hevc_put_hevc_qpel_uni_w_v6_8_lsx;
+ c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lsx;
+ c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lsx;
+ c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lsx;
+ c->put_hevc_qpel_uni_w[6][1][0] = ff_hevc_put_hevc_qpel_uni_w_v24_8_lsx;
+ c->put_hevc_qpel_uni_w[7][1][0] = ff_hevc_put_hevc_qpel_uni_w_v32_8_lsx;
+ c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lsx;
+ c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lsx;
+
+ c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx;
+ c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lsx;
+ c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lsx;
+ c->put_hevc_qpel_uni_w[4][0][1] = ff_hevc_put_hevc_qpel_uni_w_h12_8_lsx;
+ c->put_hevc_qpel_uni_w[5][0][1] = ff_hevc_put_hevc_qpel_uni_w_h16_8_lsx;
+ c->put_hevc_qpel_uni_w[6][0][1] = ff_hevc_put_hevc_qpel_uni_w_h24_8_lsx;
+ c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lsx;
+ c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lsx;
+ c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lsx;
+
c->sao_edge_filter[0] = ff_hevc_sao_edge_filter_8_lsx;
c->sao_edge_filter[1] = ff_hevc_sao_edge_filter_8_lsx;
c->sao_edge_filter[2] = ff_hevc_sao_edge_filter_8_lsx;
@@ -237,6 +257,24 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+
+ c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
+ c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
+ c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
+ c->put_hevc_qpel_uni_w[6][1][0] = ff_hevc_put_hevc_qpel_uni_w_v24_8_lasx;
+ c->put_hevc_qpel_uni_w[7][1][0] = ff_hevc_put_hevc_qpel_uni_w_v32_8_lasx;
+ c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx;
+ c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx;
+
+ c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx;
+ c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx;
+ c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx;
+ c->put_hevc_qpel_uni_w[4][0][1] = ff_hevc_put_hevc_qpel_uni_w_h12_8_lasx;
+ c->put_hevc_qpel_uni_w[5][0][1] = ff_hevc_put_hevc_qpel_uni_w_h16_8_lasx;
+ c->put_hevc_qpel_uni_w[6][0][1] = ff_hevc_put_hevc_qpel_uni_w_h24_8_lasx;
+ c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx;
+ c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx;
+ c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx;
}
}
}
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 819c3c3ecf..8a9266d375 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -48,6 +48,24 @@ PEL_UNI_W(pel, pixels, 32);
PEL_UNI_W(pel, pixels, 48);
PEL_UNI_W(pel, pixels, 64);
+PEL_UNI_W(qpel, v, 8);
+PEL_UNI_W(qpel, v, 12);
+PEL_UNI_W(qpel, v, 16);
+PEL_UNI_W(qpel, v, 24);
+PEL_UNI_W(qpel, v, 32);
+PEL_UNI_W(qpel, v, 48);
+PEL_UNI_W(qpel, v, 64);
+
+PEL_UNI_W(qpel, h, 4);
+PEL_UNI_W(qpel, h, 6);
+PEL_UNI_W(qpel, h, 8);
+PEL_UNI_W(qpel, h, 12);
+PEL_UNI_W(qpel, h, 16);
+PEL_UNI_W(qpel, h, 24);
+PEL_UNI_W(qpel, h, 32);
+PEL_UNI_W(qpel, h, 48);
+PEL_UNI_W(qpel, h, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 0d724a90ef..3291294ed9 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -257,6 +257,26 @@ PEL_UNI_W(pel, pixels, 32);
PEL_UNI_W(pel, pixels, 48);
PEL_UNI_W(pel, pixels, 64);
+PEL_UNI_W(qpel, v, 4);
+PEL_UNI_W(qpel, v, 6);
+PEL_UNI_W(qpel, v, 8);
+PEL_UNI_W(qpel, v, 12);
+PEL_UNI_W(qpel, v, 16);
+PEL_UNI_W(qpel, v, 24);
+PEL_UNI_W(qpel, v, 32);
+PEL_UNI_W(qpel, v, 48);
+PEL_UNI_W(qpel, v, 64);
+
+PEL_UNI_W(qpel, h, 4);
+PEL_UNI_W(qpel, h, 6);
+PEL_UNI_W(qpel, h, 8);
+PEL_UNI_W(qpel, h, 12);
+PEL_UNI_W(qpel, h, 16);
+PEL_UNI_W(qpel, h, 24);
+PEL_UNI_W(qpel, h, 32);
+PEL_UNI_W(qpel, h, 48);
+PEL_UNI_W(qpel, h, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread
* [FFmpeg-devel] [PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 asm opt
2023-12-27 4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
` (3 preceding siblings ...)
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-27 4:50 ` jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 6/7] avcodec/hevc: Add asm opt for the following functions jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt jinbo
6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27 4:50 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
tests/checkasm/checkasm: C LSX LASX
put_hevc_epel_uni_w_hv4_8_c: 9.5 2.2
put_hevc_epel_uni_w_hv6_8_c: 18.5 5.0 3.7
put_hevc_epel_uni_w_hv8_8_c: 30.7 6.0 4.5
put_hevc_epel_uni_w_hv12_8_c: 63.7 14.0 10.7
put_hevc_epel_uni_w_hv16_8_c: 107.5 22.7 17.0
put_hevc_epel_uni_w_hv24_8_c: 236.7 50.2 31.7
put_hevc_epel_uni_w_hv32_8_c: 414.5 88.0 53.0
put_hevc_epel_uni_w_hv48_8_c: 917.5 197.7 118.5
put_hevc_epel_uni_w_hv64_8_c: 1617.0 349.5 203.0
After this patch, the peformance of decoding H265 4K 30FPS 30Mbps
on 3A6000 with 8 threads improves 3fps (52fps-->55fsp).
Change-Id: If067e394cec4685c62193e7adb829ac93ba4804d
---
libavcodec/loongarch/hevc_mc.S | 821 ++++++++++++++++++
libavcodec/loongarch/hevcdsp_init_loongarch.c | 19 +
libavcodec/loongarch/hevcdsp_lasx.h | 9 +
libavcodec/loongarch/hevcdsp_lsx.h | 10 +
4 files changed, 859 insertions(+)
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index 2ee338fb8e..0b0647546b 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -22,6 +22,7 @@
#include "loongson_asm.S"
.extern ff_hevc_qpel_filters
+.extern ff_hevc_epel_filters
.macro LOAD_VAR bit
addi.w t1, a5, 6 //shift
@@ -206,6 +207,12 @@
.endif
.endm
+/*
+ * void FUNC(put_hevc_pel_uni_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
function ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx
LOAD_VAR 128
srli.w t0, a4, 1
@@ -482,6 +489,12 @@ endfunc
xvhaddw.d.w \in0, \in0, \in0
.endm
+/*
+ * void FUNC(put_hevc_qpel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
function ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx
LOAD_VAR 128
ld.d t0, sp, 8 //my
@@ -1253,6 +1266,12 @@ endfunc
xvssrani.bu.h \out0, xr11, 0
.endm
+/*
+ * void FUNC(put_hevc_qpel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
function ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx
LOAD_VAR 128
ld.d t0, sp, 0 //mx
@@ -1763,3 +1782,805 @@ function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
addi.d a4, a4, -1
bnez a4, .LOOP_H64_LASX
endfunc
+
+const shufb
+ .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6
+ .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10
+endconst
+
+.macro PUT_HEVC_EPEL_UNI_W_HV4_LSX w
+ fld.d f7, a2, 0 // start to load src
+ fldx.d f8, a2, a3
+ alsl.d a2, a3, a2, 1
+ fld.d f9, a2, 0
+ vshuf.b vr7, vr7, vr7, vr0 // 0123 1234 2345 3456
+ vshuf.b vr8, vr8, vr8, vr0
+ vshuf.b vr9, vr9, vr9, vr0
+ vdp2.h.bu.b vr10, vr7, vr5 // EPEL_FILTER(src, 1)
+ vdp2.h.bu.b vr11, vr8, vr5
+ vdp2.h.bu.b vr12, vr9, vr5
+ vhaddw.w.h vr10, vr10, vr10 // tmp[0/1/2/3]
+ vhaddw.w.h vr11, vr11, vr11 // vr10,vr11,vr12 corresponding to EPEL_EXTRA
+ vhaddw.w.h vr12, vr12, vr12
+.LOOP_HV4_\w:
+ add.d a2, a2, a3
+ fld.d f14, a2, 0 // height loop begin
+ vshuf.b vr14, vr14, vr14, vr0
+ vdp2.h.bu.b vr13, vr14, vr5
+ vhaddw.w.h vr13, vr13, vr13
+ vmul.w vr14, vr10, vr16 // EPEL_FILTER(tmp, MAX_PB_SIZE)
+ vmadd.w vr14, vr11, vr17
+ vmadd.w vr14, vr12, vr18
+ vmadd.w vr14, vr13, vr19
+ vaddi.wu vr10, vr11, 0 //back up previous value
+ vaddi.wu vr11, vr12, 0
+ vaddi.wu vr12, vr13, 0
+ vsrai.w vr14, vr14, 6 // >> 6
+ vmul.w vr14, vr14, vr1 // * wx
+ vadd.w vr14, vr14, vr2 // + offset
+ vsra.w vr14, vr14, vr3 // >> shift
+ vadd.w vr14, vr14, vr4 // + ox
+ vssrani.h.w vr14, vr14, 0
+ vssrani.bu.h vr14, vr14, 0 // clip
+ fst.s f14, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HV4_\w
+.endm
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_hv4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV4_LSX 4
+endfunc
+
+.macro PUT_HEVC_EPEL_UNI_W_HV8_LSX w
+ vld vr7, a2, 0 // start to load src
+ vldx vr8, a2, a3
+ alsl.d a2, a3, a2, 1
+ vld vr9, a2, 0
+ vshuf.b vr10, vr7, vr7, vr0 // 0123 1234 2345 3456
+ vshuf.b vr11, vr8, vr8, vr0
+ vshuf.b vr12, vr9, vr9, vr0
+ vshuf.b vr7, vr7, vr7, vr22// 4567 5678 6789 78910
+ vshuf.b vr8, vr8, vr8, vr22
+ vshuf.b vr9, vr9, vr9, vr22
+ vdp2.h.bu.b vr13, vr10, vr5 // EPEL_FILTER(src, 1)
+ vdp2.h.bu.b vr14, vr11, vr5
+ vdp2.h.bu.b vr15, vr12, vr5
+ vdp2.h.bu.b vr23, vr7, vr5
+ vdp2.h.bu.b vr20, vr8, vr5
+ vdp2.h.bu.b vr21, vr9, vr5
+ vhaddw.w.h vr7, vr13, vr13
+ vhaddw.w.h vr8, vr14, vr14
+ vhaddw.w.h vr9, vr15, vr15
+ vhaddw.w.h vr10, vr23, vr23
+ vhaddw.w.h vr11, vr20, vr20
+ vhaddw.w.h vr12, vr21, vr21
+.LOOP_HV8_HORI_\w:
+ add.d a2, a2, a3
+ vld vr15, a2, 0
+ vshuf.b vr23, vr15, vr15, vr0
+ vshuf.b vr15, vr15, vr15, vr22
+ vdp2.h.bu.b vr13, vr23, vr5
+ vdp2.h.bu.b vr14, vr15, vr5
+ vhaddw.w.h vr13, vr13, vr13 //789--13
+ vhaddw.w.h vr14, vr14, vr14 //101112--14
+ vmul.w vr15, vr7, vr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+ vmadd.w vr15, vr8, vr17
+ vmadd.w vr15, vr9, vr18
+ vmadd.w vr15, vr13, vr19
+ vmul.w vr20, vr10, vr16
+ vmadd.w vr20, vr11, vr17
+ vmadd.w vr20, vr12, vr18
+ vmadd.w vr20, vr14, vr19
+ vaddi.wu vr7, vr8, 0 //back up previous value
+ vaddi.wu vr8, vr9, 0
+ vaddi.wu vr9, vr13, 0
+ vaddi.wu vr10, vr11, 0
+ vaddi.wu vr11, vr12, 0
+ vaddi.wu vr12, vr14, 0
+ vsrai.w vr15, vr15, 6 // >> 6
+ vsrai.w vr20, vr20, 6
+ vmul.w vr15, vr15, vr1 // * wx
+ vmul.w vr20, vr20, vr1
+ vadd.w vr15, vr15, vr2 // + offset
+ vadd.w vr20, vr20, vr2
+ vsra.w vr15, vr15, vr3 // >> shift
+ vsra.w vr20, vr20, vr3
+ vadd.w vr15, vr15, vr4 // + ox
+ vadd.w vr20, vr20, vr4
+ vssrani.h.w vr20, vr15, 0
+ vssrani.bu.h vr20, vr20, 0
+.if \w > 6
+ fst.d f20, a0, 0
+.else
+ fst.s f20, a0, 0
+ vstelm.h vr20, a0, 4, 2
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HV8_HORI_\w
+.endm
+
+.macro PUT_HEVC_EPEL_UNI_W_HV8_LASX w
+ vld vr7, a2, 0 // start to load src
+ vldx vr8, a2, a3
+ alsl.d a2, a3, a2, 1
+ vld vr9, a2, 0
+ xvreplve0.q xr7, xr7
+ xvreplve0.q xr8, xr8
+ xvreplve0.q xr9, xr9
+ xvshuf.b xr10, xr7, xr7, xr0 // 0123 1234 2345 3456
+ xvshuf.b xr11, xr8, xr8, xr0
+ xvshuf.b xr12, xr9, xr9, xr0
+ xvdp2.h.bu.b xr13, xr10, xr5 // EPEL_FILTER(src, 1)
+ xvdp2.h.bu.b xr14, xr11, xr5
+ xvdp2.h.bu.b xr15, xr12, xr5
+ xvhaddw.w.h xr7, xr13, xr13
+ xvhaddw.w.h xr8, xr14, xr14
+ xvhaddw.w.h xr9, xr15, xr15
+.LOOP_HV8_HORI_LASX_\w:
+ add.d a2, a2, a3
+ vld vr15, a2, 0
+ xvreplve0.q xr15, xr15
+ xvshuf.b xr23, xr15, xr15, xr0
+ xvdp2.h.bu.b xr10, xr23, xr5
+ xvhaddw.w.h xr10, xr10, xr10
+ xvmul.w xr15, xr7, xr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+ xvmadd.w xr15, xr8, xr17
+ xvmadd.w xr15, xr9, xr18
+ xvmadd.w xr15, xr10, xr19
+ xvaddi.wu xr7, xr8, 0 //back up previous value
+ xvaddi.wu xr8, xr9, 0
+ xvaddi.wu xr9, xr10, 0
+ xvsrai.w xr15, xr15, 6 // >> 6
+ xvmul.w xr15, xr15, xr1 // * wx
+ xvadd.w xr15, xr15, xr2 // + offset
+ xvsra.w xr15, xr15, xr3 // >> shift
+ xvadd.w xr15, xr15, xr4 // + ox
+ xvpermi.q xr20, xr15, 0x01
+ vssrani.h.w vr20, vr15, 0
+ vssrani.bu.h vr20, vr20, 0
+.if \w > 6
+ fst.d f20, a0, 0
+.else
+ fst.s f20, a0, 0
+ vstelm.h vr20, a0, 4, 2
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HV8_HORI_LASX_\w
+.endm
+
+.macro PUT_HEVC_EPEL_UNI_W_HV16_LASX w
+ xvld xr7, a2, 0 // start to load src
+ xvldx xr8, a2, a3
+ alsl.d a2, a3, a2, 1
+ xvld xr9, a2, 0
+ xvpermi.d xr10, xr7, 0x09 //8..18
+ xvpermi.d xr11, xr8, 0x09
+ xvpermi.d xr12, xr9, 0x09
+ xvreplve0.q xr7, xr7
+ xvreplve0.q xr8, xr8
+ xvreplve0.q xr9, xr9
+ xvshuf.b xr13, xr7, xr7, xr0 // 0123 1234 2345 3456
+ xvshuf.b xr14, xr8, xr8, xr0
+ xvshuf.b xr15, xr9, xr9, xr0
+ xvdp2.h.bu.b xr20, xr13, xr5 // EPEL_FILTER(src, 1)
+ xvdp2.h.bu.b xr21, xr14, xr5
+ xvdp2.h.bu.b xr22, xr15, xr5
+ xvhaddw.w.h xr7, xr20, xr20
+ xvhaddw.w.h xr8, xr21, xr21
+ xvhaddw.w.h xr9, xr22, xr22
+ xvreplve0.q xr10, xr10
+ xvreplve0.q xr11, xr11
+ xvreplve0.q xr12, xr12
+ xvshuf.b xr13, xr10, xr10, xr0
+ xvshuf.b xr14, xr11, xr11, xr0
+ xvshuf.b xr15, xr12, xr12, xr0
+ xvdp2.h.bu.b xr20, xr13, xr5
+ xvdp2.h.bu.b xr21, xr14, xr5
+ xvdp2.h.bu.b xr22, xr15, xr5
+ xvhaddw.w.h xr10, xr20, xr20
+ xvhaddw.w.h xr11, xr21, xr21
+ xvhaddw.w.h xr12, xr22, xr22
+.LOOP_HV16_HORI_LASX_\w:
+ add.d a2, a2, a3
+ xvld xr15, a2, 0
+ xvpermi.d xr20, xr15, 0x09 //8...18
+ xvreplve0.q xr15, xr15
+ xvreplve0.q xr20, xr20
+ xvshuf.b xr21, xr15, xr15, xr0
+ xvshuf.b xr22, xr20, xr20, xr0
+ xvdp2.h.bu.b xr13, xr21, xr5
+ xvdp2.h.bu.b xr14, xr22, xr5
+ xvhaddw.w.h xr13, xr13, xr13
+ xvhaddw.w.h xr14, xr14, xr14
+ xvmul.w xr15, xr7, xr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+ xvmadd.w xr15, xr8, xr17
+ xvmadd.w xr15, xr9, xr18
+ xvmadd.w xr15, xr13, xr19
+ xvmul.w xr20, xr10, xr16
+ xvmadd.w xr20, xr11, xr17
+ xvmadd.w xr20, xr12, xr18
+ xvmadd.w xr20, xr14, xr19
+ xvaddi.wu xr7, xr8, 0 //back up previous value
+ xvaddi.wu xr8, xr9, 0
+ xvaddi.wu xr9, xr13, 0
+ xvaddi.wu xr10, xr11, 0
+ xvaddi.wu xr11, xr12, 0
+ xvaddi.wu xr12, xr14, 0
+ xvsrai.w xr15, xr15, 6 // >> 6
+ xvsrai.w xr20, xr20, 6 // >> 6
+ xvmul.w xr15, xr15, xr1 // * wx
+ xvmul.w xr20, xr20, xr1 // * wx
+ xvadd.w xr15, xr15, xr2 // + offset
+ xvadd.w xr20, xr20, xr2 // + offset
+ xvsra.w xr15, xr15, xr3 // >> shift
+ xvsra.w xr20, xr20, xr3 // >> shift
+ xvadd.w xr15, xr15, xr4 // + ox
+ xvadd.w xr20, xr20, xr4 // + ox
+ xvssrani.h.w xr20, xr15, 0
+ xvpermi.q xr21, xr20, 0x01
+ vssrani.bu.h vr21, vr20, 0
+ vpermi.w vr21, vr21, 0xd8
+.if \w < 16
+ fst.d f21, a0, 0
+ vstelm.w vr21, a0, 8, 2
+.else
+ vst vr21, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HV16_HORI_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV8_LASX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV8_LASX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 12
+ addi.d a0, t2, 8
+ addi.d a2, t3, 8
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_HV4_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 2
+.LOOP_HV16:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 16
+ addi.d a0, t2, 8
+ addi.d a2, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 3
+.LOOP_HV24:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 24
+ addi.d a0, t2, 8
+ addi.d t2, t2, 8
+ addi.d a2, t3, 8
+ addi.d t3, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 24
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_HV8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 4
+.LOOP_HV32:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 32
+ addi.d a0, t2, 8
+ addi.d t2, t2, 8
+ addi.d a2, t3, 8
+ addi.d t3, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV32
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 2
+.LOOP_HV32_LASX:
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 32
+ addi.d a0, t2, 16
+ addi.d t2, t2, 16
+ addi.d a2, t3, 16
+ addi.d t3, t3, 16
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV32_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 6
+.LOOP_HV48:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 48
+ addi.d a0, t2, 8
+ addi.d t2, t2, 8
+ addi.d a2, t3, 8
+ addi.d t3, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV48
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 3
+.LOOP_HV48_LASX:
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 48
+ addi.d a0, t2, 16
+ addi.d t2, t2, 16
+ addi.d a2, t3, 16
+ addi.d t3, t3, 16
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV48_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 8
+.LOOP_HV64:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 64
+ addi.d a0, t2, 8
+ addi.d t2, t2, 8
+ addi.d a2, t3, 8
+ addi.d t3, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV64
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 4
+.LOOP_HV64_LASX:
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 64
+ addi.d a0, t2, 16
+ addi.d t2, t2, 16
+ addi.d a2, t3, 16
+ addi.d t3, t3, 16
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 3cdb3fb2d7..245a833947 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -171,6 +171,16 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+ c->put_hevc_epel_uni_w[1][1][1] = ff_hevc_put_hevc_epel_uni_w_hv4_8_lsx;
+ c->put_hevc_epel_uni_w[2][1][1] = ff_hevc_put_hevc_epel_uni_w_hv6_8_lsx;
+ c->put_hevc_epel_uni_w[3][1][1] = ff_hevc_put_hevc_epel_uni_w_hv8_8_lsx;
+ c->put_hevc_epel_uni_w[4][1][1] = ff_hevc_put_hevc_epel_uni_w_hv12_8_lsx;
+ c->put_hevc_epel_uni_w[5][1][1] = ff_hevc_put_hevc_epel_uni_w_hv16_8_lsx;
+ c->put_hevc_epel_uni_w[6][1][1] = ff_hevc_put_hevc_epel_uni_w_hv24_8_lsx;
+ c->put_hevc_epel_uni_w[7][1][1] = ff_hevc_put_hevc_epel_uni_w_hv32_8_lsx;
+ c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lsx;
+ c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lsx;
+
c->put_hevc_epel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
@@ -258,6 +268,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+ c->put_hevc_epel_uni_w[2][1][1] = ff_hevc_put_hevc_epel_uni_w_hv6_8_lasx;
+ c->put_hevc_epel_uni_w[3][1][1] = ff_hevc_put_hevc_epel_uni_w_hv8_8_lasx;
+ c->put_hevc_epel_uni_w[4][1][1] = ff_hevc_put_hevc_epel_uni_w_hv12_8_lasx;
+ c->put_hevc_epel_uni_w[5][1][1] = ff_hevc_put_hevc_epel_uni_w_hv16_8_lasx;
+ c->put_hevc_epel_uni_w[6][1][1] = ff_hevc_put_hevc_epel_uni_w_hv24_8_lasx;
+ c->put_hevc_epel_uni_w[7][1][1] = ff_hevc_put_hevc_epel_uni_w_hv32_8_lasx;
+ c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx;
+ c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx;
+
c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 8a9266d375..7f09d0943a 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -66,6 +66,15 @@ PEL_UNI_W(qpel, h, 32);
PEL_UNI_W(qpel, h, 48);
PEL_UNI_W(qpel, h, 64);
+PEL_UNI_W(epel, hv, 6);
+PEL_UNI_W(epel, hv, 8);
+PEL_UNI_W(epel, hv, 12);
+PEL_UNI_W(epel, hv, 16);
+PEL_UNI_W(epel, hv, 24);
+PEL_UNI_W(epel, hv, 32);
+PEL_UNI_W(epel, hv, 48);
+PEL_UNI_W(epel, hv, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 3291294ed9..7769cf25ae 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -277,6 +277,16 @@ PEL_UNI_W(qpel, h, 32);
PEL_UNI_W(qpel, h, 48);
PEL_UNI_W(qpel, h, 64);
+PEL_UNI_W(epel, hv, 4);
+PEL_UNI_W(epel, hv, 6);
+PEL_UNI_W(epel, hv, 8);
+PEL_UNI_W(epel, hv, 12);
+PEL_UNI_W(epel, hv, 16);
+PEL_UNI_W(epel, hv, 24);
+PEL_UNI_W(epel, hv, 32);
+PEL_UNI_W(epel, hv, 48);
+PEL_UNI_W(epel, hv, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread
* [FFmpeg-devel] [PATCH v2 6/7] avcodec/hevc: Add asm opt for the following functions
2023-12-27 4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
` (4 preceding siblings ...)
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-27 4:50 ` jinbo
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt jinbo
6 siblings, 0 replies; 10+ messages in thread
From: jinbo @ 2023-12-27 4:50 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
tests/checkasm/checkasm: C LSX LASX
put_hevc_qpel_uni_h4_8_c: 5.7 1.2
put_hevc_qpel_uni_h6_8_c: 12.2 2.7
put_hevc_qpel_uni_h8_8_c: 21.5 3.2
put_hevc_qpel_uni_h12_8_c: 47.2 9.2 7.2
put_hevc_qpel_uni_h16_8_c: 87.0 11.7 9.0
put_hevc_qpel_uni_h24_8_c: 188.2 27.5 21.0
put_hevc_qpel_uni_h32_8_c: 335.2 46.7 28.5
put_hevc_qpel_uni_h48_8_c: 772.5 104.5 65.2
put_hevc_qpel_uni_h64_8_c: 1383.2 142.2 109.0
put_hevc_epel_uni_w_v4_8_c: 5.0 1.5
put_hevc_epel_uni_w_v6_8_c: 10.7 3.5 2.5
put_hevc_epel_uni_w_v8_8_c: 18.2 3.7 3.0
put_hevc_epel_uni_w_v12_8_c: 40.2 10.7 7.5
put_hevc_epel_uni_w_v16_8_c: 70.2 13.0 9.2
put_hevc_epel_uni_w_v24_8_c: 158.2 30.2 22.5
put_hevc_epel_uni_w_v32_8_c: 281.0 52.0 36.5
put_hevc_epel_uni_w_v48_8_c: 631.7 116.7 82.7
put_hevc_epel_uni_w_v64_8_c: 1108.2 207.5 142.2
put_hevc_epel_uni_w_h4_8_c: 4.7 1.2
put_hevc_epel_uni_w_h6_8_c: 9.7 3.5 2.7
put_hevc_epel_uni_w_h8_8_c: 17.2 4.2 3.5
put_hevc_epel_uni_w_h12_8_c: 38.0 11.5 7.2
put_hevc_epel_uni_w_h16_8_c: 69.2 14.5 9.2
put_hevc_epel_uni_w_h24_8_c: 152.0 34.7 22.5
put_hevc_epel_uni_w_h32_8_c: 271.0 58.0 40.0
put_hevc_epel_uni_w_h48_8_c: 597.5 136.7 95.0
put_hevc_epel_uni_w_h64_8_c: 1074.0 252.2 168.0
put_hevc_epel_bi_h4_8_c: 4.5 0.7
put_hevc_epel_bi_h6_8_c: 9.0 1.5
put_hevc_epel_bi_h8_8_c: 15.2 1.7
put_hevc_epel_bi_h12_8_c: 33.5 4.2 3.7
put_hevc_epel_bi_h16_8_c: 59.7 5.2 4.7
put_hevc_epel_bi_h24_8_c: 132.2 11.0
put_hevc_epel_bi_h32_8_c: 232.7 20.2 13.2
put_hevc_epel_bi_h48_8_c: 521.7 45.2 31.2
put_hevc_epel_bi_h64_8_c: 949.0 71.5 51.0
After this patch, the peformance of decoding H265 4K 30FPS
30Mbps on 3A6000 with 8 threads improves 1fps(55fps-->56fsp).
Change-Id: I8cc1e41daa63ca478039bc55d1ee8934a7423f51
---
libavcodec/loongarch/hevc_mc.S | 1991 ++++++++++++++++-
libavcodec/loongarch/hevcdsp_init_loongarch.c | 66 +
libavcodec/loongarch/hevcdsp_lasx.h | 54 +
libavcodec/loongarch/hevcdsp_lsx.h | 36 +-
4 files changed, 2144 insertions(+), 3 deletions(-)
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index 0b0647546b..a0e5938fbd 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -1784,8 +1784,12 @@ function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
endfunc
const shufb
- .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6
- .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10
+ .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6 //mask for epel_uni_w(128-bit)
+ .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10 //mask for epel_uni_w(256-bit)
+ .byte 0,1,2,3, 4,5,6,7 ,1,2,3,4, 5,6,7,8 //mask for qpel_uni_h4
+ .byte 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8 //mask for qpel_uni_h/v6/8...
+ .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9, 7,8,9,10 //epel_uni_w_h16/24/32/48/64
+ .byte 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8, 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8 //mask for bi_epel_h16/24/32/48/64
endconst
.macro PUT_HEVC_EPEL_UNI_W_HV4_LSX w
@@ -2584,3 +2588,1986 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx
addi.d t5, t5, -1
bnez t5, .LOOP_HV64_LASX
endfunc
+
+/*
+ * void FUNC(put_hevc_qpel_uni_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, intptr_t mx, intptr_t my,
+ * int width)
+ */
+function ff_hevc_put_hevc_uni_qpel_h4_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr1, t1
+ la.local t1, shufb
+ vld vr2, t1, 32 //mask0 0 1
+ vaddi.bu vr3, vr2, 2 //mask1 2 3
+.LOOP_UNI_H4:
+ vld vr18, a2, 0
+ vldx vr19, a2, a3
+ alsl.d a2, a3, a2, 1
+ vshuf.b vr6, vr18, vr18, vr2
+ vshuf.b vr7, vr18, vr18, vr3
+ vshuf.b vr8, vr19, vr19, vr2
+ vshuf.b vr9, vr19, vr19, vr3
+ vdp2.h.bu.b vr10, vr6, vr5
+ vdp2.h.bu.b vr11, vr7, vr5
+ vdp2.h.bu.b vr12, vr8, vr5
+ vdp2.h.bu.b vr13, vr9, vr5
+ vhaddw.d.h vr10
+ vhaddw.d.h vr11
+ vhaddw.d.h vr12
+ vhaddw.d.h vr13
+ vpickev.w vr10, vr11, vr10
+ vpickev.w vr11, vr13, vr12
+ vpickev.h vr10, vr11, vr10
+ vadd.h vr10, vr10, vr1
+ vsrai.h vr10, vr10, 6
+ vssrani.bu.h vr10, vr10, 0
+ fst.s f10, a0, 0
+ vbsrl.v vr10, vr10, 4
+ fstx.s f10, a0, a1
+ alsl.d a0, a1, a0, 1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_UNI_H4
+endfunc
+
+.macro HEVC_UNI_QPEL_H8_LSX in0, out0
+ vshuf.b vr10, \in0, \in0, vr5
+ vshuf.b vr11, \in0, \in0, vr6
+ vshuf.b vr12, \in0, \in0, vr7
+ vshuf.b vr13, \in0, \in0, vr8
+ vdp2.h.bu.b \out0, vr10, vr0 //(QPEL_FILTER(src, 1)
+ vdp2add.h.bu.b \out0, vr11, vr1
+ vdp2add.h.bu.b \out0, vr12, vr2
+ vdp2add.h.bu.b \out0, vr13, vr3
+ vadd.h \out0, \out0, vr4
+ vsrai.h \out0, \out0, 6
+.endm
+
+.macro HEVC_UNI_QPEL_H16_LASX in0, out0
+ xvshuf.b xr10, \in0, \in0, xr5
+ xvshuf.b xr11, \in0, \in0, xr6
+ xvshuf.b xr12, \in0, \in0, xr7
+ xvshuf.b xr13, \in0, \in0, xr8
+ xvdp2.h.bu.b \out0, xr10, xr0 //(QPEL_FILTER(src, 1)
+ xvdp2add.h.bu.b \out0, xr11, xr1
+ xvdp2add.h.bu.b \out0, xr12, xr2
+ xvdp2add.h.bu.b \out0, xr13, xr3
+ xvadd.h \out0, \out0, xr4
+ xvsrai.h \out0, \out0, 6
+.endm
+
+function ff_hevc_put_hevc_uni_qpel_h6_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H6:
+ vld vr9, a2, 0
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vssrani.bu.h vr14, vr14, 0
+ fst.s f14, a0, 0
+ vstelm.h vr14, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H6
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h8_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H8:
+ vld vr9, a2, 0
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vssrani.bu.h vr14, vr14, 0
+ fst.d f14, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H8
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h12_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H12:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vssrani.bu.h vr15, vr14, 0
+ fst.d f15, a0, 0
+ vstelm.w vr15, a0, 8, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H12
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h12_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H12_LASX:
+ xvld xr9, a2, 0
+ add.d a2, a2, a3
+ xvpermi.d xr9, xr9, 0x94 //rearrange data
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.bu.h vr15, vr14, 0
+ fst.d f15, a0, 0
+ vstelm.w vr15, a0, 8, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h16_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H16:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vssrani.bu.h vr15, vr14, 0
+ vst vr15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H16
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h16_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H16_LASX:
+ xvld xr9, a2, 0
+ add.d a2, a2, a3
+ xvpermi.d xr9, xr9, 0x94 //rearrange data
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.bu.h vr15, vr14, 0
+ vst vr15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h24_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H24:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vld vr9, a2, 16
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr16
+ vssrani.bu.h vr15, vr14, 0
+ vssrani.bu.h vr16, vr16, 0
+ vst vr15, a0, 0
+ fst.d f16, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H24
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h24_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H24_LASX:
+ xvld xr9, a2, 0
+ xvpermi.q xr19, xr9, 0x01 //16...23
+ add.d a2, a2, a3
+ xvpermi.d xr9, xr9, 0x94 //rearrange data
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.bu.h vr15, vr14, 0
+ vst vr15, a0, 0
+ HEVC_UNI_QPEL_H8_LSX vr19, vr16
+ vssrani.bu.h vr16, vr16, 0
+ fst.d f16, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h32_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H32:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vld vr9, a2, 16
+ HEVC_UNI_QPEL_H8_LSX vr9, vr16
+ vld vr9, a2, 24
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr17
+ vssrani.bu.h vr15, vr14, 0
+ vssrani.bu.h vr17, vr16, 0
+ vst vr15, a0, 0
+ vst vr17, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H32
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h32_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H32_LASX:
+ xvld xr9, a2, 0
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvld xr9, a2, 16
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr15
+ add.d a2, a2, a3
+ xvssrani.bu.h xr15, xr14, 0
+ xvpermi.d xr15, xr15, 0xd8
+ xvst xr15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h48_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H48:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vld vr9, a2, 16
+ HEVC_UNI_QPEL_H8_LSX vr9, vr16
+ vld vr9, a2, 24
+ HEVC_UNI_QPEL_H8_LSX vr9, vr17
+ vld vr9, a2, 32
+ HEVC_UNI_QPEL_H8_LSX vr9, vr18
+ vld vr9, a2, 40
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr19
+ vssrani.bu.h vr15, vr14, 0
+ vssrani.bu.h vr17, vr16, 0
+ vssrani.bu.h vr19, vr18, 0
+ vst vr15, a0, 0
+ vst vr17, a0, 16
+ vst vr19, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H48
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h48_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H48_LASX:
+ xvld xr9, a2, 0
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvld xr9, a2, 16
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr15
+ xvld xr9, a2, 32
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr16
+ add.d a2, a2, a3
+ xvssrani.bu.h xr15, xr14, 0
+ xvpermi.d xr15, xr15, 0xd8
+ xvst xr15, a0, 0
+ xvpermi.q xr17, xr16, 0x01
+ vssrani.bu.h vr17, vr16, 0
+ vst vr17, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h64_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H64_LASX:
+ xvld xr9, a2, 0
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvld xr9, a2, 16
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr15
+ xvld xr9, a2, 32
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr16
+ xvld xr9, a2, 48
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr17
+ add.d a2, a2, a3
+ xvssrani.bu.h xr15, xr14, 0
+ xvpermi.d xr15, xr15, 0xd8
+ xvst xr15, a0, 0
+ xvssrani.bu.h xr17, xr16, 0
+ xvpermi.d xr17, xr17, 0xd8
+ xvst xr17, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H64_LASX
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_v4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ fld.s f6, a2, 0 //0
+ fldx.s f7, a2, a3 //1
+ fldx.s f8, a2, t0 //2
+ add.d a2, a2, t1
+ vilvl.b vr6, vr7, vr6
+ vilvl.b vr7, vr8, vr8
+ vilvl.h vr6, vr7, vr6
+ vreplvei.w vr0, vr0, 0
+.LOOP_UNI_V4:
+ fld.s f9, a2, 0 //3
+ fldx.s f10, a2, a3 //4
+ add.d a2, a2, t0
+ vextrins.b vr6, vr9, 0x30 //insert the 3th load
+ vextrins.b vr6, vr9, 0x71
+ vextrins.b vr6, vr9, 0xb2
+ vextrins.b vr6, vr9, 0xf3
+ vbsrl.v vr7, vr6, 1
+ vextrins.b vr7, vr10, 0x30 //insert the 4th load
+ vextrins.b vr7, vr10, 0x71
+ vextrins.b vr7, vr10, 0xb2
+ vextrins.b vr7, vr10, 0xf3
+ vdp2.h.bu.b vr8, vr6, vr0 //EPEL_FILTER(src, stride)
+ vdp2.h.bu.b vr9, vr7, vr0
+ vhaddw.w.h vr10, vr8, vr8
+ vhaddw.w.h vr11, vr9, vr9
+ vmulwev.w.h vr10, vr10, vr1 //EPEL_FILTER(src, stride) * wx
+ vmulwev.w.h vr11, vr11, vr1
+ vadd.w vr10, vr10, vr2 // + offset
+ vadd.w vr11, vr11, vr2
+ vsra.w vr10, vr10, vr3 // >> shift
+ vsra.w vr11, vr11, vr3
+ vadd.w vr10, vr10, vr4 // + ox
+ vadd.w vr11, vr11, vr4
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr10, vr11, 0
+ vbsrl.v vr6, vr7, 1
+ fst.s f10, a0, 0
+ vbsrl.v vr10, vr10, 4
+ fstx.s f10, a0, a1
+ alsl.d a0, a1, a0, 1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_UNI_V4
+endfunc
+
+.macro CALC_EPEL_FILTER_LSX out0, out1
+ vdp2.h.bu.b vr12, vr10, vr0 //EPEL_FILTER(src, stride)
+ vdp2add.h.bu.b vr12, vr11, vr5
+ vexth.w.h vr13, vr12
+ vsllwil.w.h vr12, vr12, 0
+ vmulwev.w.h vr12, vr12, vr1 //EPEL_FILTER(src, stride) * wx
+ vmulwev.w.h vr13, vr13, vr1 //EPEL_FILTER(src, stride) * wx
+ vadd.w vr12, vr12, vr2 // + offset
+ vadd.w vr13, vr13, vr2
+ vsra.w vr12, vr12, vr3 // >> shift
+ vsra.w vr13, vr13, vr3
+ vadd.w \out0, vr12, vr4 // + ox
+ vadd.w \out1, vr13, vr4
+.endm
+
+.macro CALC_EPEL_FILTER_LASX out0
+ xvdp2.h.bu.b xr11, xr12, xr0 //EPEL_FILTER(src, stride)
+ xvhaddw.w.h xr12, xr11, xr11
+ xvmulwev.w.h xr12, xr12, xr1 //EPEL_FILTER(src, stride) * wx
+ xvadd.w xr12, xr12, xr2 // + offset
+ xvsra.w xr12, xr12, xr3 // >> shift
+ xvadd.w \out0, xr12, xr4 // + ox
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V8_LSX w
+ fld.d f6, a2, 0 //0
+ fldx.d f7, a2, a3 //1
+ fldx.d f8, a2, t0 //2
+ add.d a2, a2, t1
+.LOOP_UNI_V8_\w:
+ fld.d f9, a2, 0 // 3
+ add.d a2, a2, a3
+ vilvl.b vr10, vr7, vr6
+ vilvl.b vr11, vr9, vr8
+ vaddi.bu vr6, vr7, 0 //back up previous value
+ vaddi.bu vr7, vr8, 0
+ vaddi.bu vr8, vr9, 0
+ CALC_EPEL_FILTER_LSX vr12, vr13
+ vssrani.h.w vr13, vr12, 0
+ vssrani.bu.h vr13, vr13, 0
+.if \w < 8
+ fst.s f13, a0, 0
+ vstelm.h vr13, a0, 4, 2
+.else
+ fst.d f13, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_V8_\w
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V8_LASX w
+ fld.d f6, a2, 0 //0
+ fldx.d f7, a2, a3 //1
+ fldx.d f8, a2, t0 //2
+ add.d a2, a2, t1
+.LOOP_UNI_V8_LASX_\w:
+ fld.d f9, a2, 0 // 3
+ add.d a2, a2, a3
+ vilvl.b vr10, vr7, vr6
+ vilvl.b vr11, vr9, vr8
+ xvilvl.h xr12, xr11, xr10
+ xvilvh.h xr13, xr11, xr10
+ xvpermi.q xr12, xr13, 0x02
+ vaddi.bu vr6, vr7, 0 //back up previous value
+ vaddi.bu vr7, vr8, 0
+ vaddi.bu vr8, vr9, 0
+ CALC_EPEL_FILTER_LASX xr12
+ xvpermi.q xr13, xr12, 0x01
+ vssrani.h.w vr13, vr12, 0
+ vssrani.bu.h vr13, vr13, 0
+.if \w < 8
+ fst.s f13, a0, 0
+ vstelm.h vr13, a0, 4, 2
+.else
+ fst.d f13, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_V8_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_v6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ PUT_HEVC_EPEL_UNI_W_V8_LSX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v6_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ PUT_HEVC_EPEL_UNI_W_V8_LASX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ PUT_HEVC_EPEL_UNI_W_V8_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ PUT_HEVC_EPEL_UNI_W_V8_LASX 8
+endfunc
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V16_LSX w
+ vld vr6, a2, 0 //0
+ vldx vr7, a2, a3 //1
+ vldx vr8, a2, t0 //2
+ add.d a2, a2, t1
+.LOOP_UNI_V16_\w:
+ vld vr9, a2, 0 //3
+ add.d a2, a2, a3
+ vilvl.b vr10, vr7, vr6
+ vilvl.b vr11, vr9, vr8
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vilvh.b vr10, vr7, vr6
+ vilvh.b vr11, vr9, vr8
+ CALC_EPEL_FILTER_LSX vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vaddi.bu vr6, vr7, 0 //back up previous value
+ vaddi.bu vr7, vr8, 0
+ vaddi.bu vr8, vr9, 0
+.if \w < 16
+ fst.d f17, a0, 0
+ vstelm.w vr17, a0, 8, 2
+.else
+ vst vr17, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_V16_\w
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V16_LASX w
+ vld vr6, a2, 0 //0
+ vldx vr7, a2, a3 //1
+ vldx vr8, a2, t0 //2
+ add.d a2, a2, t1
+.LOOP_UNI_V16_LASX_\w:
+ vld vr9, a2, 0 //3
+ add.d a2, a2, a3
+ xvilvl.b xr10, xr7, xr6
+ xvilvh.b xr11, xr7, xr6
+ xvpermi.q xr11, xr10, 0x20
+ xvilvl.b xr12, xr9, xr8
+ xvilvh.b xr13, xr9, xr8
+ xvpermi.q xr13, xr12, 0x20
+ xvdp2.h.bu.b xr10, xr11, xr0 //EPEL_FILTER(src, stride)
+ xvdp2add.h.bu.b xr10, xr13, xr5
+ xvexth.w.h xr11, xr10
+ xvsllwil.w.h xr10, xr10, 0
+ xvmulwev.w.h xr10, xr10, xr1 //EPEL_FILTER(src, stride) * wx
+ xvmulwev.w.h xr11, xr11, xr1
+ xvadd.w xr10, xr10, xr2 // + offset
+ xvadd.w xr11, xr11, xr2
+ xvsra.w xr10, xr10, xr3 // >> shift
+ xvsra.w xr11, xr11, xr3
+ xvadd.w xr10, xr10, xr4 // + wx
+ xvadd.w xr11, xr11, xr4
+ xvssrani.h.w xr11, xr10, 0
+ xvpermi.q xr10, xr11, 0x01
+ vssrani.bu.h vr10, vr11, 0
+ vaddi.bu vr6, vr7, 0 //back up previous value
+ vaddi.bu vr7, vr8, 0
+ vaddi.bu vr8, vr9, 0
+.if \w < 16
+ fst.d f10, a0, 0
+ vstelm.w vr10, a0, 8, 2
+.else
+ vst vr10, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_V16_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_v12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ addi.d t2, a0, 0 //save init
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 24
+ addi.d a0, t2, 16 //increase step
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V8_LSX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr20, xr0 //save xr0
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ addi.d t2, a0, 0 //save init
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 24
+ addi.d a0, t2, 16 //increase step
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ xvaddi.bu xr0, xr20, 0
+ PUT_HEVC_EPEL_UNI_W_V8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 32
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 33
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 32
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 33
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 48
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 49
+ addi.d a0, t2, 32
+ addi.d a2, t3, 32
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 50
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 48
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 49
+ addi.d a0, t2, 32
+ addi.d a2, t3, 32
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 50
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 64
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 65
+ addi.d a0, t2, 32
+ addi.d a2, t3, 32
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 66
+ addi.d a0, t2, 48
+ addi.d a2, t3, 48
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 67
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 64
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 65
+ addi.d a0, t2, 32
+ addi.d a2, t3, 32
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 66
+ addi.d a0, t2, 48
+ addi.d a2, t3, 48
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 67
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_h4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr5, t1, 0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H4:
+ fld.d f6, a2, 0
+ add.d a2, a2, a3
+ vshuf.b vr6, vr6, vr6, vr5
+ vdp2.h.bu.b vr7, vr6, vr0
+ vhaddw.w.h vr7, vr7, vr7
+ vmulwev.w.h vr7, vr7, vr1
+ vadd.w vr7, vr7, vr2
+ vsra.w vr7, vr7, vr3
+ vadd.w vr7, vr7, vr4
+ vssrani.h.w vr7, vr7, 0
+ vssrani.bu.h vr7, vr7, 0
+ fst.s f7, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H4
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H6:
+ vld vr8, a2, 0
+ add.d a2, a2, a3
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.s f15, a0, 0
+ vstelm.h vr15, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H6_LASX:
+ vld vr8, a2, 0
+ xvreplve0.q xr8, xr8
+ add.d a2, a2, a3
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.s f15, a0, 0
+ vstelm.h vr15, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H6_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H8:
+ vld vr8, a2, 0
+ add.d a2, a2, a3
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.d f15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H8_LASX:
+ vld vr8, a2, 0
+ xvreplve0.q xr8, xr8
+ add.d a2, a2, a3
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.d f15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H8_LASX
+endfunc
+
+.macro EPEL_UNI_W_H16_LOOP_LSX idx0, idx1, idx2
+ vld vr8, a2, \idx0
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vld vr8, a2, \idx1
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, \idx2
+.endm
+
+.macro EPEL_UNI_W_H16_LOOP_LASX idx0, idx2, w
+ xvld xr8, a2, \idx0
+ xvpermi.d xr9, xr8, 0x09
+ xvreplve0.q xr8, xr8
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr14
+ xvreplve0.q xr8, xr9
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr16
+ xvssrani.h.w xr16, xr14, 0
+ xvpermi.q xr17, xr16, 0x01
+ vssrani.bu.h vr17, vr16, 0
+ vpermi.w vr17, vr17, 0xd8
+.if \w == 12
+ fst.d f17, a0, 0
+ vstelm.w vr17, a0, 8, 2
+.else
+ vst vr17, a0, \idx2
+.endif
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_h12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H12:
+ vld vr8, a2, 0
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vld vr8, a2, 8
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ fst.d f17, a0, 0
+ vstelm.w vr17, a0, 8, 2
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H12_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 12
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H16:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H16_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 16
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H24:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ vld vr8, a2, 16
+ add.d a2, a2, a3
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr18, vr19
+ vssrani.h.w vr19, vr18, 0
+ vssrani.bu.h vr19, vr19, 0
+ fst.d f19, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H24_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 24
+ vld vr8, a2, 16
+ add.d a2, a2, a3
+ xvreplve0.q xr8, xr8
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.d f15, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H32:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H32
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H32_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 32
+ EPEL_UNI_W_H16_LOOP_LASX 16, 16, 32
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H48:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+ EPEL_UNI_W_H16_LOOP_LSX 32, 40, 32
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H48
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H48_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 48
+ EPEL_UNI_W_H16_LOOP_LASX 16, 16, 48
+ EPEL_UNI_W_H16_LOOP_LASX 32, 32, 48
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H64:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+ EPEL_UNI_W_H16_LOOP_LSX 32, 40, 32
+ EPEL_UNI_W_H16_LOOP_LSX 48, 56, 48
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H64
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H64_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 64
+ EPEL_UNI_W_H16_LOOP_LASX 16, 16, 64
+ EPEL_UNI_W_H16_LOOP_LASX 32, 32, 64
+ EPEL_UNI_W_H16_LOOP_LASX 48, 48, 64
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H64_LASX
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * const int16_t *src2, int height, intptr_t mx,
+ * intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_bi_epel_h4_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.w vr0, vr0, 0
+ la.local t0, shufb
+ vld vr1, t0, 0 // mask
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H4:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ vshuf.b vr5, vr5, vr5, vr1
+ vdp2.h.bu.b vr6, vr5, vr0 // EPEL_FILTER(src, 1)
+ vsllwil.w.h vr4, vr4, 0
+ vhaddw.w.h vr6, vr6, vr6
+ vadd.w vr6, vr6, vr4 // src2[x]
+ vssrani.h.w vr6, vr6, 0
+ vssrarni.bu.h vr6, vr6, 7
+ fst.s f6, a0, 0
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H4
+endfunc
+
+.macro PUT_HEVC_BI_EPEL_H8_LSX in0, in1, in2, in3, out0
+ vshuf.b vr6, \in1, \in0, \in2
+ vshuf.b vr7, \in1, \in0, \in3
+ vdp2.h.bu.b vr8, vr6, vr0 // EPEL_FILTER(src, 1)
+ vdp2add.h.bu.b vr8, vr7, vr1 // EPEL_FILTER(src, 1)
+ vsadd.h \out0, vr8, vr4 // src2[x]
+.endm
+
+.macro PUT_HEVC_BI_EPEL_H16_LASX in0, in1, in2, in3, out0
+ xvshuf.b xr6, \in1, \in0, \in2
+ xvshuf.b xr7, \in1, \in0, \in3
+ xvdp2.h.bu.b xr8, xr6, xr0 // EPEL_FILTER(src, 1)
+ xvdp2add.h.bu.b xr8, xr7, xr1 // EPEL_FILTER(src, 1)
+ xvsadd.h \out0, xr8, xr4 // src2[x]
+.endm
+
+function ff_hevc_put_hevc_bi_epel_h6_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H6:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr7
+ vssrarni.bu.h vr7, vr7, 7
+ fst.s f7, a0, 0
+ vstelm.h vr7, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H6
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h8_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H8:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr7
+ vssrarni.bu.h vr7, vr7, 7
+ fst.d f7, a0, 0
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H8
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h12_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H12:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr11
+ vld vr5, a2, 8
+ vld vr4, a4, 16
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+ vssrarni.bu.h vr12, vr11, 7
+ fst.d f12, a0, 0
+ vstelm.w vr12, a0, 8, 2
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H12
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h12_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H12_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+ xvpermi.q xr10, xr9, 0x01
+ vssrarni.bu.h vr10, vr9, 7
+ fst.d f10, a0, 0
+ vstelm.w vr10, a0, 8, 2
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h16_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H16:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr11
+ vld vr5, a2, 8
+ vld vr4, a4, 16
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+ vssrarni.bu.h vr12, vr11, 7
+ vst vr12, a0, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H16
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h16_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H16_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+ xvpermi.q xr10, xr9, 0x01
+ vssrarni.bu.h vr10, vr9, 7
+ vst vr10, a0, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h32_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H32_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvpermi.q xr15, xr5, 0x01
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+ xvld xr4, a4, 32
+ xvld xr15, a2, 16
+ xvpermi.d xr15, xr15, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr15, xr15, xr2, xr3, xr11
+ xvssrarni.bu.h xr11, xr9, 7
+ xvpermi.d xr11, xr11, 0xd8
+ xvst xr11, a0, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h48_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6// filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ vaddi.bu vr21, vr2, 8
+ vaddi.bu vr22, vr2, 10
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H48:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ vld vr9, a2, 16
+ vld vr10, a2, 32
+ vld vr11, a2, 48
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+ vld vr4, a4, 16
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr9, vr21, vr22, vr13
+ vld vr4, a4, 32
+ PUT_HEVC_BI_EPEL_H8_LSX vr9, vr9, vr2, vr3, vr14
+ vld vr4, a4, 48
+ PUT_HEVC_BI_EPEL_H8_LSX vr9, vr10, vr21, vr22, vr15
+ vld vr4, a4, 64
+ PUT_HEVC_BI_EPEL_H8_LSX vr10, vr10, vr2, vr3, vr16
+ vld vr4, a4, 80
+ PUT_HEVC_BI_EPEL_H8_LSX vr10, vr11, vr21, vr22, vr17
+ vssrarni.bu.h vr13, vr12, 7
+ vssrarni.bu.h vr15, vr14, 7
+ vssrarni.bu.h vr17, vr16, 7
+ vst vr13, a0, 0
+ vst vr15, a0, 16
+ vst vr17, a0, 32
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H48
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h48_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H48_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvld xr9, a2, 32
+ xvpermi.d xr10, xr9, 0x94
+ xvpermi.q xr9, xr5, 0x21
+ xvpermi.d xr9, xr9, 0x94
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr11
+ xvld xr4, a4, 32
+ PUT_HEVC_BI_EPEL_H16_LASX xr9, xr9, xr2, xr3, xr12
+ xvld xr4, a4, 64
+ PUT_HEVC_BI_EPEL_H16_LASX xr10, xr10, xr2, xr3, xr13
+ xvssrarni.bu.h xr12, xr11, 7
+ xvpermi.d xr12, xr12, 0xd8
+ xvpermi.q xr14, xr13, 0x01
+ vssrarni.bu.h vr14, vr13, 7
+ xvst xr12, a0, 0
+ vst vr14, a0, 32
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h64_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6// filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ vaddi.bu vr21, vr2, 8
+ vaddi.bu vr22, vr2, 10
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H64:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ vld vr9, a2, 16
+ vld vr10, a2, 32
+ vld vr11, a2, 48
+ vld vr12, a2, 64
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr13
+ vld vr4, a4, 16
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr9, vr21, vr22, vr14
+ vld vr4, a4, 32
+ PUT_HEVC_BI_EPEL_H8_LSX vr9, vr9, vr2, vr3, vr15
+ vld vr4, a4, 48
+ PUT_HEVC_BI_EPEL_H8_LSX vr9, vr10, vr21, vr22, vr16
+ vld vr4, a4, 64
+ PUT_HEVC_BI_EPEL_H8_LSX vr10, vr10, vr2, vr3, vr17
+ vld vr4, a4, 80
+ PUT_HEVC_BI_EPEL_H8_LSX vr10, vr11, vr21, vr22, vr18
+ vld vr4, a4, 96
+ PUT_HEVC_BI_EPEL_H8_LSX vr11, vr11, vr2, vr3, vr19
+ vld vr4, a4, 112
+ PUT_HEVC_BI_EPEL_H8_LSX vr11, vr12, vr21, vr22, vr20
+ vssrarni.bu.h vr14, vr13, 7
+ vssrarni.bu.h vr16, vr15, 7
+ vssrarni.bu.h vr18, vr17, 7
+ vssrarni.bu.h vr20, vr19, 7
+ vst vr14, a0, 0
+ vst vr16, a0, 16
+ vst vr18, a0, 32
+ vst vr20, a0, 48
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H64
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h64_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H64_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvld xr9, a2, 32
+ xvld xr11, a2, 48
+ xvpermi.d xr11, xr11, 0x94
+ xvpermi.d xr10, xr9, 0x94
+ xvpermi.q xr9, xr5, 0x21
+ xvpermi.d xr9, xr9, 0x94
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr12
+ xvld xr4, a4, 32
+ PUT_HEVC_BI_EPEL_H16_LASX xr9, xr9, xr2, xr3, xr13
+ xvld xr4, a4, 64
+ PUT_HEVC_BI_EPEL_H16_LASX xr10, xr10, xr2, xr3, xr14
+ xvld xr4, a4, 96
+ PUT_HEVC_BI_EPEL_H16_LASX xr11, xr11, xr2, xr3, xr15
+ xvssrarni.bu.h xr13, xr12, 7
+ xvssrarni.bu.h xr15, xr14, 7
+ xvpermi.d xr13, xr13, 0xd8
+ xvpermi.d xr15, xr15, 0xd8
+ xvst xr13, a0, 0
+ xvst xr15, a0, 32
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 245a833947..2756755733 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -124,8 +124,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_bi[8][0][1] = ff_hevc_put_hevc_bi_qpel_h48_8_lsx;
c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_bi_qpel_h64_8_lsx;
+ c->put_hevc_epel_bi[1][0][1] = ff_hevc_put_hevc_bi_epel_h4_8_lsx;
+ c->put_hevc_epel_bi[2][0][1] = ff_hevc_put_hevc_bi_epel_h6_8_lsx;
+ c->put_hevc_epel_bi[3][0][1] = ff_hevc_put_hevc_bi_epel_h8_8_lsx;
+ c->put_hevc_epel_bi[4][0][1] = ff_hevc_put_hevc_bi_epel_h12_8_lsx;
+ c->put_hevc_epel_bi[5][0][1] = ff_hevc_put_hevc_bi_epel_h16_8_lsx;
c->put_hevc_epel_bi[6][0][1] = ff_hevc_put_hevc_bi_epel_h24_8_lsx;
c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lsx;
+ c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lsx;
+ c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lsx;
c->put_hevc_epel_bi[4][1][0] = ff_hevc_put_hevc_bi_epel_v12_8_lsx;
c->put_hevc_epel_bi[5][1][0] = ff_hevc_put_hevc_bi_epel_v16_8_lsx;
@@ -138,6 +145,14 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_bi[6][1][1] = ff_hevc_put_hevc_bi_epel_hv24_8_lsx;
c->put_hevc_epel_bi[7][1][1] = ff_hevc_put_hevc_bi_epel_hv32_8_lsx;
+ c->put_hevc_qpel_uni[1][0][1] = ff_hevc_put_hevc_uni_qpel_h4_8_lsx;
+ c->put_hevc_qpel_uni[2][0][1] = ff_hevc_put_hevc_uni_qpel_h6_8_lsx;
+ c->put_hevc_qpel_uni[3][0][1] = ff_hevc_put_hevc_uni_qpel_h8_8_lsx;
+ c->put_hevc_qpel_uni[4][0][1] = ff_hevc_put_hevc_uni_qpel_h12_8_lsx;
+ c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_uni_qpel_h16_8_lsx;
+ c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_uni_qpel_h24_8_lsx;
+ c->put_hevc_qpel_uni[7][0][1] = ff_hevc_put_hevc_uni_qpel_h32_8_lsx;
+ c->put_hevc_qpel_uni[8][0][1] = ff_hevc_put_hevc_uni_qpel_h48_8_lsx;
c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_uni_qpel_h64_8_lsx;
c->put_hevc_qpel_uni[6][1][0] = ff_hevc_put_hevc_uni_qpel_v24_8_lsx;
@@ -191,6 +206,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+ c->put_hevc_epel_uni_w[1][0][1] = ff_hevc_put_hevc_epel_uni_w_h4_8_lsx;
+ c->put_hevc_epel_uni_w[2][0][1] = ff_hevc_put_hevc_epel_uni_w_h6_8_lsx;
+ c->put_hevc_epel_uni_w[3][0][1] = ff_hevc_put_hevc_epel_uni_w_h8_8_lsx;
+ c->put_hevc_epel_uni_w[4][0][1] = ff_hevc_put_hevc_epel_uni_w_h12_8_lsx;
+ c->put_hevc_epel_uni_w[5][0][1] = ff_hevc_put_hevc_epel_uni_w_h16_8_lsx;
+ c->put_hevc_epel_uni_w[6][0][1] = ff_hevc_put_hevc_epel_uni_w_h24_8_lsx;
+ c->put_hevc_epel_uni_w[7][0][1] = ff_hevc_put_hevc_epel_uni_w_h32_8_lsx;
+ c->put_hevc_epel_uni_w[8][0][1] = ff_hevc_put_hevc_epel_uni_w_h48_8_lsx;
+ c->put_hevc_epel_uni_w[9][0][1] = ff_hevc_put_hevc_epel_uni_w_h64_8_lsx;
+
+ c->put_hevc_epel_uni_w[1][1][0] = ff_hevc_put_hevc_epel_uni_w_v4_8_lsx;
+ c->put_hevc_epel_uni_w[2][1][0] = ff_hevc_put_hevc_epel_uni_w_v6_8_lsx;
+ c->put_hevc_epel_uni_w[3][1][0] = ff_hevc_put_hevc_epel_uni_w_v8_8_lsx;
+ c->put_hevc_epel_uni_w[4][1][0] = ff_hevc_put_hevc_epel_uni_w_v12_8_lsx;
+ c->put_hevc_epel_uni_w[5][1][0] = ff_hevc_put_hevc_epel_uni_w_v16_8_lsx;
+ c->put_hevc_epel_uni_w[6][1][0] = ff_hevc_put_hevc_epel_uni_w_v24_8_lsx;
+ c->put_hevc_epel_uni_w[7][1][0] = ff_hevc_put_hevc_epel_uni_w_v32_8_lsx;
+ c->put_hevc_epel_uni_w[8][1][0] = ff_hevc_put_hevc_epel_uni_w_v48_8_lsx;
+ c->put_hevc_epel_uni_w[9][1][0] = ff_hevc_put_hevc_epel_uni_w_v64_8_lsx;
+
c->put_hevc_qpel_uni_w[3][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv8_8_lsx;
c->put_hevc_qpel_uni_w[5][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv16_8_lsx;
c->put_hevc_qpel_uni_w[6][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv24_8_lsx;
@@ -277,6 +312,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx;
c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx;
+ c->put_hevc_epel_uni_w[2][0][1] = ff_hevc_put_hevc_epel_uni_w_h6_8_lasx;
+ c->put_hevc_epel_uni_w[3][0][1] = ff_hevc_put_hevc_epel_uni_w_h8_8_lasx;
+ c->put_hevc_epel_uni_w[4][0][1] = ff_hevc_put_hevc_epel_uni_w_h12_8_lasx;
+ c->put_hevc_epel_uni_w[5][0][1] = ff_hevc_put_hevc_epel_uni_w_h16_8_lasx;
+ c->put_hevc_epel_uni_w[6][0][1] = ff_hevc_put_hevc_epel_uni_w_h24_8_lasx;
+ c->put_hevc_epel_uni_w[7][0][1] = ff_hevc_put_hevc_epel_uni_w_h32_8_lasx;
+ c->put_hevc_epel_uni_w[8][0][1] = ff_hevc_put_hevc_epel_uni_w_h48_8_lasx;
+ c->put_hevc_epel_uni_w[9][0][1] = ff_hevc_put_hevc_epel_uni_w_h64_8_lasx;
+
c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
@@ -285,6 +329,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx;
c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx;
+ c->put_hevc_epel_uni_w[2][1][0] = ff_hevc_put_hevc_epel_uni_w_v6_8_lasx;
+ c->put_hevc_epel_uni_w[3][1][0] = ff_hevc_put_hevc_epel_uni_w_v8_8_lasx;
+ c->put_hevc_epel_uni_w[4][1][0] = ff_hevc_put_hevc_epel_uni_w_v12_8_lasx;
+ c->put_hevc_epel_uni_w[5][1][0] = ff_hevc_put_hevc_epel_uni_w_v16_8_lasx;
+ c->put_hevc_epel_uni_w[6][1][0] = ff_hevc_put_hevc_epel_uni_w_v24_8_lasx;
+ c->put_hevc_epel_uni_w[7][1][0] = ff_hevc_put_hevc_epel_uni_w_v32_8_lasx;
+ c->put_hevc_epel_uni_w[8][1][0] = ff_hevc_put_hevc_epel_uni_w_v48_8_lasx;
+ c->put_hevc_epel_uni_w[9][1][0] = ff_hevc_put_hevc_epel_uni_w_v64_8_lasx;
+
c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx;
c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx;
c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx;
@@ -294,6 +347,19 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx;
c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx;
c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx;
+
+ c->put_hevc_qpel_uni[4][0][1] = ff_hevc_put_hevc_uni_qpel_h12_8_lasx;
+ c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_uni_qpel_h16_8_lasx;
+ c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_uni_qpel_h24_8_lasx;
+ c->put_hevc_qpel_uni[7][0][1] = ff_hevc_put_hevc_uni_qpel_h32_8_lasx;
+ c->put_hevc_qpel_uni[8][0][1] = ff_hevc_put_hevc_uni_qpel_h48_8_lasx;
+ c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_uni_qpel_h64_8_lasx;
+
+ c->put_hevc_epel_bi[4][0][1] = ff_hevc_put_hevc_bi_epel_h12_8_lasx;
+ c->put_hevc_epel_bi[5][0][1] = ff_hevc_put_hevc_bi_epel_h16_8_lasx;
+ c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lasx;
+ c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lasx;
+ c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lasx;
}
}
}
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 7f09d0943a..5db35eed47 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -75,6 +75,60 @@ PEL_UNI_W(epel, hv, 32);
PEL_UNI_W(epel, hv, 48);
PEL_UNI_W(epel, hv, 64);
+PEL_UNI_W(epel, v, 6);
+PEL_UNI_W(epel, v, 8);
+PEL_UNI_W(epel, v, 12);
+PEL_UNI_W(epel, v, 16);
+PEL_UNI_W(epel, v, 24);
+PEL_UNI_W(epel, v, 32);
+PEL_UNI_W(epel, v, 48);
+PEL_UNI_W(epel, v, 64);
+
+PEL_UNI_W(epel, h, 6);
+PEL_UNI_W(epel, h, 8);
+PEL_UNI_W(epel, h, 12);
+PEL_UNI_W(epel, h, 16);
+PEL_UNI_W(epel, h, 24);
+PEL_UNI_W(epel, h, 32);
+PEL_UNI_W(epel, h, 48);
+PEL_UNI_W(epel, h, 64);
+
#undef PEL_UNI_W
+#define UNI_MC(PEL, DIR, WIDTH) \
+void ff_hevc_put_hevc_uni_##PEL##_##DIR##WIDTH##_8_lasx(uint8_t *dst, \
+ ptrdiff_t dst_stride, \
+ const uint8_t *src, \
+ ptrdiff_t src_stride, \
+ int height, \
+ intptr_t mx, \
+ intptr_t my, \
+ int width)
+UNI_MC(qpel, h, 12);
+UNI_MC(qpel, h, 16);
+UNI_MC(qpel, h, 24);
+UNI_MC(qpel, h, 32);
+UNI_MC(qpel, h, 48);
+UNI_MC(qpel, h, 64);
+
+#undef UNI_MC
+
+#define BI_MC(PEL, DIR, WIDTH) \
+void ff_hevc_put_hevc_bi_##PEL##_##DIR##WIDTH##_8_lasx(uint8_t *dst, \
+ ptrdiff_t dst_stride, \
+ const uint8_t *src, \
+ ptrdiff_t src_stride, \
+ const int16_t *src_16bit, \
+ int height, \
+ intptr_t mx, \
+ intptr_t my, \
+ int width)
+BI_MC(epel, h, 12);
+BI_MC(epel, h, 16);
+BI_MC(epel, h, 32);
+BI_MC(epel, h, 48);
+BI_MC(epel, h, 64);
+
+#undef BI_MC
+
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 7769cf25ae..a5ef237b5d 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -126,8 +126,15 @@ BI_MC(qpel, hv, 32);
BI_MC(qpel, hv, 48);
BI_MC(qpel, hv, 64);
+BI_MC(epel, h, 4);
+BI_MC(epel, h, 6);
+BI_MC(epel, h, 8);
+BI_MC(epel, h, 12);
+BI_MC(epel, h, 16);
BI_MC(epel, h, 24);
BI_MC(epel, h, 32);
+BI_MC(epel, h, 48);
+BI_MC(epel, h, 64);
BI_MC(epel, v, 12);
BI_MC(epel, v, 16);
@@ -151,7 +158,14 @@ void ff_hevc_put_hevc_uni_##PEL##_##DIR##WIDTH##_8_lsx(uint8_t *dst, \
intptr_t mx, \
intptr_t my, \
int width)
-
+UNI_MC(qpel, h, 4);
+UNI_MC(qpel, h, 6);
+UNI_MC(qpel, h, 8);
+UNI_MC(qpel, h, 12);
+UNI_MC(qpel, h, 16);
+UNI_MC(qpel, h, 24);
+UNI_MC(qpel, h, 32);
+UNI_MC(qpel, h, 48);
UNI_MC(qpel, h, 64);
UNI_MC(qpel, v, 24);
@@ -287,6 +301,26 @@ PEL_UNI_W(epel, hv, 32);
PEL_UNI_W(epel, hv, 48);
PEL_UNI_W(epel, hv, 64);
+PEL_UNI_W(epel, h, 4);
+PEL_UNI_W(epel, h, 6);
+PEL_UNI_W(epel, h, 8);
+PEL_UNI_W(epel, h, 12);
+PEL_UNI_W(epel, h, 16);
+PEL_UNI_W(epel, h, 24);
+PEL_UNI_W(epel, h, 32);
+PEL_UNI_W(epel, h, 48);
+PEL_UNI_W(epel, h, 64);
+
+PEL_UNI_W(epel, v, 4);
+PEL_UNI_W(epel, v, 6);
+PEL_UNI_W(epel, v, 8);
+PEL_UNI_W(epel, v, 12);
+PEL_UNI_W(epel, v, 16);
+PEL_UNI_W(epel, v, 24);
+PEL_UNI_W(epel, v, 32);
+PEL_UNI_W(epel, v, 48);
+PEL_UNI_W(epel, v, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread
* [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt
2023-12-27 4:50 [FFmpeg-devel] [PATCH v2] [loongarch] Add hevc 128-bit & 256-bit asm optimizatons jinbo
` (5 preceding siblings ...)
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 6/7] avcodec/hevc: Add asm opt for the following functions jinbo
@ 2023-12-27 4:50 ` jinbo
2023-12-28 7:29 ` yinshiyou-hf
6 siblings, 1 reply; 10+ messages in thread
From: jinbo @ 2023-12-27 4:50 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: yuanhecai
From: yuanhecai <yuanhecai@loongson.cn>
tests/checkasm/checkasm:
C LSX LASX
hevc_idct_32x32_8_c: 1243.0 211.7 101.7
Speedup of decoding H265 4K 30FPS 30Mbps on
3A6000 with 8 threads is 1fps(56fps-->57fps).
---
libavcodec/loongarch/Makefile | 3 +-
libavcodec/loongarch/hevc_idct.S | 863 ++++++++++++++++++
libavcodec/loongarch/hevc_idct_lsx.c | 10 +-
libavcodec/loongarch/hevcdsp_init_loongarch.c | 2 +
libavcodec/loongarch/hevcdsp_lasx.h | 2 +
5 files changed, 874 insertions(+), 6 deletions(-)
create mode 100644 libavcodec/loongarch/hevc_idct.S
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index ad98cd4054..07da2964e4 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -29,7 +29,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER) += loongarch/hevcdsp_lsx.o \
loongarch/hevc_mc_uni_lsx.o \
loongarch/hevc_mc_uniw_lsx.o \
loongarch/hevc_add_res.o \
- loongarch/hevc_mc.o
+ loongarch/hevc_mc.o \
+ loongarch/hevc_idct.o
LSX-OBJS-$(CONFIG_H264DSP) += loongarch/h264idct.o \
loongarch/h264idct_loongarch.o \
loongarch/h264dsp.o
diff --git a/libavcodec/loongarch/hevc_idct.S b/libavcodec/loongarch/hevc_idct.S
new file mode 100644
index 0000000000..5593e5fd73
--- /dev/null
+++ b/libavcodec/loongarch/hevc_idct.S
@@ -0,0 +1,863 @@
+/*
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by Hecai Yuan <yuanhecai@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "loongson_asm.S"
+
+.macro fr_store
+ addi.d sp, sp, -64
+ fst.d f24, sp, 0
+ fst.d f25, sp, 8
+ fst.d f26, sp, 16
+ fst.d f27, sp, 24
+ fst.d f28, sp, 32
+ fst.d f29, sp, 40
+ fst.d f30, sp, 48
+ fst.d f31, sp, 56
+.endm
+
+.macro fr_recover
+ fld.d f24, sp, 0
+ fld.d f25, sp, 8
+ fld.d f26, sp, 16
+ fld.d f27, sp, 24
+ fld.d f28, sp, 32
+ fld.d f29, sp, 40
+ fld.d f30, sp, 48
+ fld.d f31, sp, 56
+ addi.d sp, sp, 64
+.endm
+
+.macro malloc_space number
+ li.w t0, \number
+ sub.d sp, sp, t0
+ fr_store
+.endm
+
+.macro free_space number
+ fr_recover
+ li.w t0, \number
+ add.d sp, sp, t0
+.endm
+
+.extern gt32x32_cnst1
+
+.extern gt32x32_cnst2
+
+.extern gt8x8_cnst
+
+.extern gt32x32_cnst0
+
+.macro idct_16x32_step1_lasx
+ xvldrepl.w xr20, t1, 0
+ xvldrepl.w xr21, t1, 4
+ xvldrepl.w xr22, t1, 8
+ xvldrepl.w xr23, t1, 12
+
+ xvmulwev.w.h xr16, xr8, xr20
+ xvmaddwod.w.h xr16, xr8, xr20
+ xvmulwev.w.h xr17, xr9, xr20
+ xvmaddwod.w.h xr17, xr9, xr20
+
+ xvmaddwev.w.h xr16, xr10, xr21
+ xvmaddwod.w.h xr16, xr10, xr21
+ xvmaddwev.w.h xr17, xr11, xr21
+ xvmaddwod.w.h xr17, xr11, xr21
+
+ xvmaddwev.w.h xr16, xr12, xr22
+ xvmaddwod.w.h xr16, xr12, xr22
+ xvmaddwev.w.h xr17, xr13, xr22
+ xvmaddwod.w.h xr17, xr13, xr22
+
+ xvmaddwev.w.h xr16, xr14, xr23
+ xvmaddwod.w.h xr16, xr14, xr23
+ xvmaddwev.w.h xr17, xr15, xr23
+ xvmaddwod.w.h xr17, xr15, xr23
+
+ xvld xr0, t2, 0
+ xvld xr1, t2, 32
+
+ xvadd.w xr18, xr0, xr16
+ xvadd.w xr19, xr1, xr17
+ xvsub.w xr0, xr0, xr16
+ xvsub.w xr1, xr1, xr17
+
+ xvst xr18, t2, 0
+ xvst xr19, t2, 32
+ xvst xr0, t3, 0
+ xvst xr1, t3, 32
+.endm
+
+.macro idct_16x32_step2_lasx in0, in1, in2, in3, in4, in5, in6, in7, out0, out1
+
+ xvldrepl.w xr20, t1, 0
+ xvldrepl.w xr21, t1, 4
+ xvldrepl.w xr22, t1, 8
+ xvldrepl.w xr23, t1, 12
+
+ xvmulwev.w.h \out0, \in0, xr20
+ xvmaddwod.w.h \out0, \in0, xr20
+ xvmulwev.w.h \out1, \in1, xr20
+ xvmaddwod.w.h \out1, \in1, xr20
+ xvmaddwev.w.h \out0, \in2, xr21
+ xvmaddwod.w.h \out0, \in2, xr21
+ xvmaddwev.w.h \out1, \in3, xr21
+ xvmaddwod.w.h \out1, \in3, xr21
+ xvmaddwev.w.h \out0, \in4, xr22
+ xvmaddwod.w.h \out0, \in4, xr22
+ xvmaddwev.w.h \out1, \in5, xr22
+ xvmaddwod.w.h \out1, \in5, xr22
+ xvmaddwev.w.h \out0, \in6, xr23
+ xvmaddwod.w.h \out0, \in6, xr23
+ xvmaddwev.w.h \out1, \in7, xr23 // sum0_r
+ xvmaddwod.w.h \out1, \in7, xr23 // sum0_l
+.endm
+
+ /* loop for all columns of filter constants */
+.macro idct_16x32_step3_lasx round
+ xvadd.w xr16, xr16, xr30
+ xvadd.w xr17, xr17, xr31
+
+ xvld xr0, t2, 0
+ xvld xr1, t2, 32
+
+ xvadd.w xr30, xr0, xr16
+ xvadd.w xr31, xr1, xr17
+ xvsub.w xr16, xr0, xr16
+ xvsub.w xr17, xr1, xr17
+ xvssrarni.h.w xr31, xr30, \round
+ xvssrarni.h.w xr17, xr16, \round
+ xvst xr31, t4, 0
+ xvst xr17, t5, 0
+.endm
+
+.macro idct_16x32_lasx buf_pitch, round
+ addi.d t2, sp, 64
+
+ addi.d t0, a0, \buf_pitch*4*2
+
+ // 4 12 20 28
+ xvld xr0, t0, 0
+ xvld xr1, t0, \buf_pitch*8*2
+ xvld xr2, t0, \buf_pitch*16*2
+ xvld xr3, t0, \buf_pitch*24*2
+
+ xvilvl.h xr10, xr1, xr0
+ xvilvh.h xr11, xr1, xr0
+ xvilvl.h xr12, xr3, xr2
+ xvilvh.h xr13, xr3, xr2
+
+ la.local t1, gt32x32_cnst2
+
+ xvldrepl.w xr20, t1, 0
+ xvldrepl.w xr21, t1, 4
+ xvmulwev.w.h xr14, xr10, xr20
+ xvmaddwod.w.h xr14, xr10, xr20
+ xvmulwev.w.h xr15, xr11, xr20
+ xvmaddwod.w.h xr15, xr11, xr20
+ xvmaddwev.w.h xr14, xr12, xr21
+ xvmaddwod.w.h xr14, xr12, xr21
+ xvmaddwev.w.h xr15, xr13, xr21
+ xvmaddwod.w.h xr15, xr13, xr21
+
+ xvldrepl.w xr20, t1, 8
+ xvldrepl.w xr21, t1, 12
+ xvmulwev.w.h xr16, xr10, xr20
+ xvmaddwod.w.h xr16, xr10, xr20
+ xvmulwev.w.h xr17, xr11, xr20
+ xvmaddwod.w.h xr17, xr11, xr20
+ xvmaddwev.w.h xr16, xr12, xr21
+ xvmaddwod.w.h xr16, xr12, xr21
+ xvmaddwev.w.h xr17, xr13, xr21
+ xvmaddwod.w.h xr17, xr13, xr21
+
+ xvldrepl.w xr20, t1, 16
+ xvldrepl.w xr21, t1, 20
+ xvmulwev.w.h xr18, xr10, xr20
+ xvmaddwod.w.h xr18, xr10, xr20
+ xvmulwev.w.h xr19, xr11, xr20
+ xvmaddwod.w.h xr19, xr11, xr20
+ xvmaddwev.w.h xr18, xr12, xr21
+ xvmaddwod.w.h xr18, xr12, xr21
+ xvmaddwev.w.h xr19, xr13, xr21
+ xvmaddwod.w.h xr19, xr13, xr21
+
+ xvldrepl.w xr20, t1, 24
+ xvldrepl.w xr21, t1, 28
+ xvmulwev.w.h xr22, xr10, xr20
+ xvmaddwod.w.h xr22, xr10, xr20
+ xvmulwev.w.h xr23, xr11, xr20
+ xvmaddwod.w.h xr23, xr11, xr20
+ xvmaddwev.w.h xr22, xr12, xr21
+ xvmaddwod.w.h xr22, xr12, xr21
+ xvmaddwev.w.h xr23, xr13, xr21
+ xvmaddwod.w.h xr23, xr13, xr21
+
+ /* process coeff 0, 8, 16, 24 */
+ la.local t1, gt8x8_cnst
+
+ xvld xr0, a0, 0
+ xvld xr1, a0, \buf_pitch*8*2
+ xvld xr2, a0, \buf_pitch*16*2
+ xvld xr3, a0, \buf_pitch*24*2
+
+ xvldrepl.w xr20, t1, 0
+ xvldrepl.w xr21, t1, 4
+
+ xvilvl.h xr10, xr2, xr0
+ xvilvh.h xr11, xr2, xr0
+ xvilvl.h xr12, xr3, xr1
+ xvilvh.h xr13, xr3, xr1
+
+ xvmulwev.w.h xr4, xr10, xr20
+ xvmaddwod.w.h xr4, xr10, xr20 // sum0_r
+ xvmulwev.w.h xr5, xr11, xr20
+ xvmaddwod.w.h xr5, xr11, xr20 // sum0_l
+ xvmulwev.w.h xr6, xr12, xr21
+ xvmaddwod.w.h xr6, xr12, xr21 // tmp1_r
+ xvmulwev.w.h xr7, xr13, xr21
+ xvmaddwod.w.h xr7, xr13, xr21 // tmp1_l
+
+ xvsub.w xr0, xr4, xr6 // sum1_r
+ xvadd.w xr1, xr4, xr6 // sum0_r
+ xvsub.w xr2, xr5, xr7 // sum1_l
+ xvadd.w xr3, xr5, xr7 // sum0_l
+
+ // HEVC_EVEN16_CALC
+ xvsub.w xr24, xr1, xr14 // 7
+ xvsub.w xr25, xr3, xr15
+ xvadd.w xr14, xr1, xr14 // 0
+ xvadd.w xr15, xr3, xr15
+ xvst xr24, t2, 7*16*4 // 448=16*28=7*16*4
+ xvst xr25, t2, 7*16*4+32 // 480
+ xvst xr14, t2, 0
+ xvst xr15, t2, 32
+
+ xvsub.w xr26, xr0, xr22 // 4
+ xvsub.w xr27, xr2, xr23
+ xvadd.w xr22, xr0, xr22 // 3
+ xvadd.w xr23, xr2, xr23
+ xvst xr26, t2, 4*16*4 // 256=4*16*4
+ xvst xr27, t2, 4*16*4+32 // 288
+ xvst xr22, t2, 3*16*4 // 192=3*16*4
+ xvst xr23, t2, 3*16*4+32 // 224
+
+ xvldrepl.w xr20, t1, 16
+ xvldrepl.w xr21, t1, 20
+
+ xvmulwev.w.h xr4, xr10, xr20
+ xvmaddwod.w.h xr4, xr10, xr20
+ xvmulwev.w.h xr5, xr11, xr20
+ xvmaddwod.w.h xr5, xr11, xr20
+ xvmulwev.w.h xr6, xr12, xr21
+ xvmaddwod.w.h xr6, xr12, xr21
+ xvmulwev.w.h xr7, xr13, xr21
+ xvmaddwod.w.h xr7, xr13, xr21
+
+ xvsub.w xr0, xr4, xr6 // sum1_r
+ xvadd.w xr1, xr4, xr6 // sum0_r
+ xvsub.w xr2, xr5, xr7 // sum1_l
+ xvadd.w xr3, xr5, xr7 // sum0_l
+
+ // HEVC_EVEN16_CALC
+ xvsub.w xr24, xr1, xr16 // 6
+ xvsub.w xr25, xr3, xr17
+ xvadd.w xr16, xr1, xr16 // 1
+ xvadd.w xr17, xr3, xr17
+ xvst xr24, t2, 6*16*4 // 384=6*16*4
+ xvst xr25, t2, 6*16*4+32 // 416
+ xvst xr16, t2, 1*16*4 // 64=1*16*4
+ xvst xr17, t2, 1*16*4+32 // 96
+
+ xvsub.w xr26, xr0, xr18 // 5
+ xvsub.w xr27, xr2, xr19
+ xvadd.w xr18, xr0, xr18 // 2
+ xvadd.w xr19, xr2, xr19
+ xvst xr26, t2, 5*16*4 // 320=5*16*4
+ xvst xr27, t2, 5*16*4+32 // 352
+ xvst xr18, t2, 2*16*4 // 128=2*16*4
+ xvst xr19, t2, 2*16*4+32 // 160
+
+ /* process coeff 2 6 10 14 18 22 26 30 */
+ addi.d t0, a0, \buf_pitch*2*2
+
+ xvld xr0, t0, 0
+ xvld xr1, t0, \buf_pitch*4*2
+ xvld xr2, t0, \buf_pitch*8*2
+ xvld xr3, t0, \buf_pitch*12*2
+
+ xvld xr4, t0, \buf_pitch*16*2
+ xvld xr5, t0, \buf_pitch*20*2
+ xvld xr6, t0, \buf_pitch*24*2
+ xvld xr7, t0, \buf_pitch*28*2
+
+ xvilvl.h xr8, xr1, xr0
+ xvilvh.h xr9, xr1, xr0
+ xvilvl.h xr10, xr3, xr2
+ xvilvh.h xr11, xr3, xr2
+ xvilvl.h xr12, xr5, xr4
+ xvilvh.h xr13, xr5, xr4
+ xvilvl.h xr14, xr7, xr6
+ xvilvh.h xr15, xr7, xr6
+
+ la.local t1, gt32x32_cnst1
+
+ addi.d t2, sp, 64
+ addi.d t3, sp, 64+960 // 30*32
+
+ idct_16x32_step1_lasx
+
+.rept 7
+ addi.d t1, t1, 16
+ addi.d t2, t2, 64
+ addi.d t3, t3, -64
+ idct_16x32_step1_lasx
+.endr
+
+ addi.d t0, a0, \buf_pitch*2
+
+ xvld xr0, t0, 0
+ xvld xr1, t0, \buf_pitch*2*2
+ xvld xr2, t0, \buf_pitch*4*2
+ xvld xr3, t0, \buf_pitch*6*2
+ xvld xr4, t0, \buf_pitch*8*2
+ xvld xr5, t0, \buf_pitch*10*2
+ xvld xr6, t0, \buf_pitch*12*2
+ xvld xr7, t0, \buf_pitch*14*2
+
+ xvilvl.h xr8, xr1, xr0
+ xvilvh.h xr9, xr1, xr0
+ xvilvl.h xr10, xr3, xr2
+ xvilvh.h xr11, xr3, xr2
+ xvilvl.h xr12, xr5, xr4
+ xvilvh.h xr13, xr5, xr4
+ xvilvl.h xr14, xr7, xr6
+ xvilvh.h xr15, xr7, xr6
+
+ la.local t1, gt32x32_cnst0
+
+ idct_16x32_step2_lasx xr8, xr9, xr10, xr11, xr12, xr13, \
+ xr14, xr15, xr16, xr17
+
+ addi.d t0, a0, \buf_pitch*16*2+\buf_pitch*2
+
+ xvld xr0, t0, 0
+ xvld xr1, t0, \buf_pitch*2*2
+ xvld xr2, t0, \buf_pitch*4*2
+ xvld xr3, t0, \buf_pitch*6*2
+ xvld xr4, t0, \buf_pitch*8*2
+ xvld xr5, t0, \buf_pitch*10*2
+ xvld xr6, t0, \buf_pitch*12*2
+ xvld xr7, t0, \buf_pitch*14*2
+
+ xvilvl.h xr18, xr1, xr0
+ xvilvh.h xr19, xr1, xr0
+ xvilvl.h xr24, xr3, xr2
+ xvilvh.h xr25, xr3, xr2
+ xvilvl.h xr26, xr5, xr4
+ xvilvh.h xr27, xr5, xr4
+ xvilvl.h xr28, xr7, xr6
+ xvilvh.h xr29, xr7, xr6
+
+ addi.d t1, t1, 16
+ idct_16x32_step2_lasx xr18, xr19, xr24, xr25, xr26, xr27, \
+ xr28, xr29, xr30, xr31
+
+ addi.d t4, a0, 0
+ addi.d t5, a0, \buf_pitch*31*2
+ addi.d t2, sp, 64
+
+ idct_16x32_step3_lasx \round
+
+.rept 15
+
+ addi.d t1, t1, 16
+ idct_16x32_step2_lasx xr8, xr9, xr10, xr11, xr12, xr13, \
+ xr14, xr15, xr16, xr17
+
+ addi.d t1, t1, 16
+ idct_16x32_step2_lasx xr18, xr19, xr24, xr25, xr26, xr27, \
+ xr28, xr29, xr30, xr31
+
+ addi.d t2, t2, 64
+ addi.d t4, t4, \buf_pitch*2
+ addi.d t5, t5, -\buf_pitch*2
+
+ idct_16x32_step3_lasx \round
+.endr
+
+.endm
+
+function hevc_idct_16x32_column_step1_lasx
+ malloc_space 512+512+512
+
+ idct_16x32_lasx 32, 7
+
+ free_space 512+512+512
+endfunc
+
+function hevc_idct_16x32_column_step2_lasx
+ malloc_space 512+512+512
+
+ idct_16x32_lasx 16, 12
+
+ free_space 512+512+512
+endfunc
+
+function hevc_idct_transpose_32x16_to_16x32_lasx
+ fr_store
+
+ xvld xr0, a0, 0
+ xvld xr1, a0, 64
+ xvld xr2, a0, 128
+ xvld xr3, a0, 192
+ xvld xr4, a0, 256
+ xvld xr5, a0, 320
+ xvld xr6, a0, 384
+ xvld xr7, a0, 448
+
+ xvpermi.q xr8, xr0, 0x01
+ xvpermi.q xr9, xr1, 0x01
+ xvpermi.q xr10, xr2, 0x01
+ xvpermi.q xr11, xr3, 0x01
+ xvpermi.q xr12, xr4, 0x01
+ xvpermi.q xr13, xr5, 0x01
+ xvpermi.q xr14, xr6, 0x01
+ xvpermi.q xr15, xr7, 0x01
+
+ LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+ vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+ vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ addi.d a0, a0, 512
+
+ vld vr24, a0, 0
+ vld vr25, a0, 64
+ vld vr26, a0, 128
+ vld vr27, a0, 192
+ vld vr28, a0, 256
+ vld vr29, a0, 320
+ vld vr30, a0, 384
+ vld vr31, a0, 448
+
+ LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ xvpermi.q xr0, xr24, 0x02
+ xvpermi.q xr1, xr25, 0x02
+ xvpermi.q xr2, xr26, 0x02
+ xvpermi.q xr3, xr27, 0x02
+ xvpermi.q xr4, xr28, 0x02
+ xvpermi.q xr5, xr29, 0x02
+ xvpermi.q xr6, xr30, 0x02
+ xvpermi.q xr7, xr31, 0x02
+
+ xvst xr0, a1, 0
+ xvst xr1, a1, 32
+ xvst xr2, a1, 64
+ xvst xr3, a1, 96
+ xvst xr4, a1, 128
+ xvst xr5, a1, 160
+ xvst xr6, a1, 192
+ xvst xr7, a1, 224
+
+ addi.d a1, a1, 256
+ addi.d a0, a0, 16
+
+ vld vr24, a0, 0
+ vld vr25, a0, 64
+ vld vr26, a0, 128
+ vld vr27, a0, 192
+ vld vr28, a0, 256
+ vld vr29, a0, 320
+ vld vr30, a0, 384
+ vld vr31, a0, 448
+
+ LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ xvpermi.q xr8, xr24, 0x02
+ xvpermi.q xr9, xr25, 0x02
+ xvpermi.q xr10, xr26, 0x02
+ xvpermi.q xr11, xr27, 0x02
+ xvpermi.q xr12, xr28, 0x02
+ xvpermi.q xr13, xr29, 0x02
+ xvpermi.q xr14, xr30, 0x02
+ xvpermi.q xr15, xr31, 0x02
+
+ xvst xr8, a1, 0
+ xvst xr9, a1, 32
+ xvst xr10, a1, 64
+ xvst xr11, a1, 96
+ xvst xr12, a1, 128
+ xvst xr13, a1, 160
+ xvst xr14, a1, 192
+ xvst xr15, a1, 224
+
+ // second
+ addi.d a0, a0, 32-512-16
+
+ xvld xr0, a0, 0
+ xvld xr1, a0, 64
+ xvld xr2, a0, 128
+ xvld xr3, a0, 192
+ xvld xr4, a0, 256
+ xvld xr5, a0, 320
+ xvld xr6, a0, 384
+ xvld xr7, a0, 448
+
+ xvpermi.q xr8, xr0, 0x01
+ xvpermi.q xr9, xr1, 0x01
+ xvpermi.q xr10, xr2, 0x01
+ xvpermi.q xr11, xr3, 0x01
+ xvpermi.q xr12, xr4, 0x01
+ xvpermi.q xr13, xr5, 0x01
+ xvpermi.q xr14, xr6, 0x01
+ xvpermi.q xr15, xr7, 0x01
+
+ LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+ vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+ vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ addi.d a0, a0, 512
+
+ vld vr24, a0, 0
+ vld vr25, a0, 64
+ vld vr26, a0, 128
+ vld vr27, a0, 192
+ vld vr28, a0, 256
+ vld vr29, a0, 320
+ vld vr30, a0, 384
+ vld vr31, a0, 448
+
+ LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ xvpermi.q xr0, xr24, 0x02
+ xvpermi.q xr1, xr25, 0x02
+ xvpermi.q xr2, xr26, 0x02
+ xvpermi.q xr3, xr27, 0x02
+ xvpermi.q xr4, xr28, 0x02
+ xvpermi.q xr5, xr29, 0x02
+ xvpermi.q xr6, xr30, 0x02
+ xvpermi.q xr7, xr31, 0x02
+
+ addi.d a1, a1, 256
+ xvst xr0, a1, 0
+ xvst xr1, a1, 32
+ xvst xr2, a1, 64
+ xvst xr3, a1, 96
+ xvst xr4, a1, 128
+ xvst xr5, a1, 160
+ xvst xr6, a1, 192
+ xvst xr7, a1, 224
+
+ addi.d a1, a1, 256
+ addi.d a0, a0, 16
+
+ vld vr24, a0, 0
+ vld vr25, a0, 64
+ vld vr26, a0, 128
+ vld vr27, a0, 192
+ vld vr28, a0, 256
+ vld vr29, a0, 320
+ vld vr30, a0, 384
+ vld vr31, a0, 448
+
+ LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ xvpermi.q xr8, xr24, 0x02
+ xvpermi.q xr9, xr25, 0x02
+ xvpermi.q xr10, xr26, 0x02
+ xvpermi.q xr11, xr27, 0x02
+ xvpermi.q xr12, xr28, 0x02
+ xvpermi.q xr13, xr29, 0x02
+ xvpermi.q xr14, xr30, 0x02
+ xvpermi.q xr15, xr31, 0x02
+
+ xvst xr8, a1, 0
+ xvst xr9, a1, 32
+ xvst xr10, a1, 64
+ xvst xr11, a1, 96
+ xvst xr12, a1, 128
+ xvst xr13, a1, 160
+ xvst xr14, a1, 192
+ xvst xr15, a1, 224
+
+ fr_recover
+endfunc
+
+function hevc_idct_transpose_16x32_to_32x16_lasx
+ fr_store
+
+ xvld xr0, a0, 0
+ xvld xr1, a0, 32
+ xvld xr2, a0, 64
+ xvld xr3, a0, 96
+ xvld xr4, a0, 128
+ xvld xr5, a0, 160
+ xvld xr6, a0, 192
+ xvld xr7, a0, 224
+
+ xvpermi.q xr8, xr0, 0x01
+ xvpermi.q xr9, xr1, 0x01
+ xvpermi.q xr10, xr2, 0x01
+ xvpermi.q xr11, xr3, 0x01
+ xvpermi.q xr12, xr4, 0x01
+ xvpermi.q xr13, xr5, 0x01
+ xvpermi.q xr14, xr6, 0x01
+ xvpermi.q xr15, xr7, 0x01
+
+ LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+ vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+ vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ addi.d a0, a0, 256
+
+ vld vr24, a0, 0
+ vld vr25, a0, 32
+ vld vr26, a0, 64
+ vld vr27, a0, 96
+ vld vr28, a0, 128
+ vld vr29, a0, 160
+ vld vr30, a0, 192
+ vld vr31, a0, 224
+
+ LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ xvpermi.q xr0, xr24, 0x02
+ xvpermi.q xr1, xr25, 0x02
+ xvpermi.q xr2, xr26, 0x02
+ xvpermi.q xr3, xr27, 0x02
+ xvpermi.q xr4, xr28, 0x02
+ xvpermi.q xr5, xr29, 0x02
+ xvpermi.q xr6, xr30, 0x02
+ xvpermi.q xr7, xr31, 0x02
+
+ xvst xr0, a1, 0
+ xvst xr1, a1, 64
+ xvst xr2, a1, 128
+ xvst xr3, a1, 192
+ xvst xr4, a1, 256
+ xvst xr5, a1, 320
+ xvst xr6, a1, 384
+ xvst xr7, a1, 448
+
+ addi.d a1, a1, 512
+ addi.d a0, a0, 16
+
+ vld vr24, a0, 0
+ vld vr25, a0, 32
+ vld vr26, a0, 64
+ vld vr27, a0, 96
+ vld vr28, a0, 128
+ vld vr29, a0, 160
+ vld vr30, a0, 192
+ vld vr31, a0, 224
+
+ LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ xvpermi.q xr8, xr24, 0x02
+ xvpermi.q xr9, xr25, 0x02
+ xvpermi.q xr10, xr26, 0x02
+ xvpermi.q xr11, xr27, 0x02
+ xvpermi.q xr12, xr28, 0x02
+ xvpermi.q xr13, xr29, 0x02
+ xvpermi.q xr14, xr30, 0x02
+ xvpermi.q xr15, xr31, 0x02
+
+ xvst xr8, a1, 0
+ xvst xr9, a1, 64
+ xvst xr10, a1, 128
+ xvst xr11, a1, 192
+ xvst xr12, a1, 256
+ xvst xr13, a1, 320
+ xvst xr14, a1, 384
+ xvst xr15, a1, 448
+
+ // second
+ addi.d a0, a0, 256-16
+
+ xvld xr0, a0, 0
+ xvld xr1, a0, 32
+ xvld xr2, a0, 64
+ xvld xr3, a0, 96
+ xvld xr4, a0, 128
+ xvld xr5, a0, 160
+ xvld xr6, a0, 192
+ xvld xr7, a0, 224
+
+ xvpermi.q xr8, xr0, 0x01
+ xvpermi.q xr9, xr1, 0x01
+ xvpermi.q xr10, xr2, 0x01
+ xvpermi.q xr11, xr3, 0x01
+ xvpermi.q xr12, xr4, 0x01
+ xvpermi.q xr13, xr5, 0x01
+ xvpermi.q xr14, xr6, 0x01
+ xvpermi.q xr15, xr7, 0x01
+
+ LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+ vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+ vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ addi.d a0, a0, 256
+
+ vld vr24, a0, 0
+ vld vr25, a0, 32
+ vld vr26, a0, 64
+ vld vr27, a0, 96
+ vld vr28, a0, 128
+ vld vr29, a0, 160
+ vld vr30, a0, 192
+ vld vr31, a0, 224
+
+ LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ xvpermi.q xr0, xr24, 0x02
+ xvpermi.q xr1, xr25, 0x02
+ xvpermi.q xr2, xr26, 0x02
+ xvpermi.q xr3, xr27, 0x02
+ xvpermi.q xr4, xr28, 0x02
+ xvpermi.q xr5, xr29, 0x02
+ xvpermi.q xr6, xr30, 0x02
+ xvpermi.q xr7, xr31, 0x02
+
+ addi.d a1, a1, -512+32
+
+ xvst xr0, a1, 0
+ xvst xr1, a1, 64
+ xvst xr2, a1, 128
+ xvst xr3, a1, 192
+ xvst xr4, a1, 256
+ xvst xr5, a1, 320
+ xvst xr6, a1, 384
+ xvst xr7, a1, 448
+
+ addi.d a1, a1, 512
+ addi.d a0, a0, 16
+
+ vld vr24, a0, 0
+ vld vr25, a0, 32
+ vld vr26, a0, 64
+ vld vr27, a0, 96
+ vld vr28, a0, 128
+ vld vr29, a0, 160
+ vld vr30, a0, 192
+ vld vr31, a0, 224
+
+ LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \
+ vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23
+
+ xvpermi.q xr8, xr24, 0x02
+ xvpermi.q xr9, xr25, 0x02
+ xvpermi.q xr10, xr26, 0x02
+ xvpermi.q xr11, xr27, 0x02
+ xvpermi.q xr12, xr28, 0x02
+ xvpermi.q xr13, xr29, 0x02
+ xvpermi.q xr14, xr30, 0x02
+ xvpermi.q xr15, xr31, 0x02
+
+ xvst xr8, a1, 0
+ xvst xr9, a1, 64
+ xvst xr10, a1, 128
+ xvst xr11, a1, 192
+ xvst xr12, a1, 256
+ xvst xr13, a1, 320
+ xvst xr14, a1, 384
+ xvst xr15, a1, 448
+
+ fr_recover
+endfunc
+
+function ff_hevc_idct_32x32_lasx
+
+ addi.d t7, a0, 0
+ addi.d t6, a1, 0
+
+ addi.d sp, sp, -8
+ st.d ra, sp, 0
+
+ bl hevc_idct_16x32_column_step1_lasx
+
+ addi.d a0, a0, 32
+
+ bl hevc_idct_16x32_column_step1_lasx
+
+ malloc_space (16*32+31)*2
+
+ addi.d t8, sp, 64+31*2 // tmp_buf_ptr
+
+ addi.d a0, t7, 0
+ addi.d a1, t8, 0
+ bl hevc_idct_transpose_32x16_to_16x32_lasx
+
+ addi.d a0, t8, 0
+ bl hevc_idct_16x32_column_step2_lasx
+
+ addi.d a0, t8, 0
+ addi.d a1, t7, 0
+ bl hevc_idct_transpose_16x32_to_32x16_lasx
+
+ // second
+ addi.d a0, t7, 32*8*2*2
+ addi.d a1, t8, 0
+ bl hevc_idct_transpose_32x16_to_16x32_lasx
+
+ addi.d a0, t8, 0
+ bl hevc_idct_16x32_column_step2_lasx
+
+ addi.d a0, t8, 0
+ addi.d a1, t7, 32*8*2*2
+ bl hevc_idct_transpose_16x32_to_32x16_lasx
+
+ free_space (16*32+31)*2
+
+ ld.d ra, sp, 0
+ addi.d sp, sp, 8
+
+endfunc
diff --git a/libavcodec/loongarch/hevc_idct_lsx.c b/libavcodec/loongarch/hevc_idct_lsx.c
index 2193b27546..527279d85d 100644
--- a/libavcodec/loongarch/hevc_idct_lsx.c
+++ b/libavcodec/loongarch/hevc_idct_lsx.c
@@ -23,18 +23,18 @@
#include "libavutil/loongarch/loongson_intrinsics.h"
#include "hevcdsp_lsx.h"
-static const int16_t gt8x8_cnst[16] __attribute__ ((aligned (64))) = {
+const int16_t gt8x8_cnst[16] __attribute__ ((aligned (64))) = {
64, 64, 83, 36, 89, 50, 18, 75, 64, -64, 36, -83, 75, -89, -50, -18
};
-static const int16_t gt16x16_cnst[64] __attribute__ ((aligned (64))) = {
+const int16_t gt16x16_cnst[64] __attribute__ ((aligned (64))) = {
64, 83, 64, 36, 89, 75, 50, 18, 90, 80, 57, 25, 70, 87, 9, 43,
64, 36, -64, -83, 75, -18, -89, -50, 87, 9, -80, -70, -43, 57, -25, -90,
64, -36, -64, 83, 50, -89, 18, 75, 80, -70, -25, 90, -87, 9, 43, 57,
64, -83, 64, -36, 18, -50, 75, -89, 70, -87, 90, -80, 9, -43, -57, 25
};
-static const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = {
+const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = {
90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4,
90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13,
88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22,
@@ -53,14 +53,14 @@ static const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = {
4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90
};
-static const int16_t gt32x32_cnst1[64] __attribute__ ((aligned (64))) = {
+const int16_t gt32x32_cnst1[64] __attribute__ ((aligned (64))) = {
90, 87, 80, 70, 57, 43, 25, 9, 87, 57, 9, -43, -80, -90, -70, -25,
80, 9, -70, -87, -25, 57, 90, 43, 70, -43, -87, 9, 90, 25, -80, -57,
57, -80, -25, 90, -9, -87, 43, 70, 43, -90, 57, 25, -87, 70, 9, -80,
25, -70, 90, -80, 43, 9, -57, 87, 9, -25, 43, -57, 70, -80, 87, -90
};
-static const int16_t gt32x32_cnst2[16] __attribute__ ((aligned (64))) = {
+const int16_t gt32x32_cnst2[16] __attribute__ ((aligned (64))) = {
89, 75, 50, 18, 75, -18, -89, -50, 50, -89, 18, 75, 18, -50, 75, -89
};
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 2756755733..1585bda276 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -360,6 +360,8 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lasx;
c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lasx;
c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lasx;
+
+ c->idct[3] = ff_hevc_idct_32x32_lasx;
}
}
}
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 5db35eed47..714cbf5880 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -131,4 +131,6 @@ BI_MC(epel, h, 64);
#undef BI_MC
+void ff_hevc_idct_32x32_lasx(int16_t *coeffs, int col_limit);
+
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt
2023-12-27 4:50 ` [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt jinbo
@ 2023-12-28 7:29 ` yinshiyou-hf
0 siblings, 0 replies; 10+ messages in thread
From: yinshiyou-hf @ 2023-12-28 7:29 UTC (permalink / raw)
To: FFmpeg development discussions and patches; +Cc: yuanhecai
> -----原始邮件-----
> 发件人: jinbo <jinbo@loongson.cn>
> 发送时间:2023-12-27 12:50:19 (星期三)
> 收件人: ffmpeg-devel@ffmpeg.org
> 抄送: yuanhecai <yuanhecai@loongson.cn>
> 主题: [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt
> +
> +.macro malloc_space number
> + li.w t0, \number
> + sub.d sp, sp, t0
> + fr_store
> +.endm
> +
> +.macro free_space number
> + fr_recover
> + li.w t0, \number
> + add.d sp, sp, t0
> +.endm
> +
use subi and addi, these two macro is not needed.
本邮件及其附件含有龙芯中科的商业秘密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制或散发)本邮件及其附件中的信息。如果您错收本邮件,请您立即电话或邮件通知发件人并删除本邮件。
This email and its attachments contain confidential information from Loongson Technology , which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 10+ messages in thread