* [FFmpeg-devel] [PATCH v1] [loongarch] Add hevc 128-bit & 256-bit asm optimizations
@ 2023-12-22 10:52 jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 1/6] avcodec/hevc: Add init for sao_edge_filter jinbo
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: jinbo @ 2023-12-22 10:52 UTC (permalink / raw)
To: ffmpeg-devel
Hello, everyone! The hevc asm optimizatons are submitted, here is a
brief introduction. After the 6 patches, the speedup of decoding H265
4K 30FPS 30Mbps on 3A6000 with 8 threads is about 33%(42fps-->56fps).
Reviews are welcome, thanks for in advance.
[PATCH v1 1/6] avcodec/hevc: Add init for sao_edge_filter
[PATCH v1 2/6] avcodec/hevc: Add add_residual_4/8/16/32 asm opt
[PATCH v1 3/6] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
[PATCH v1 4/6] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 asm opt
[PATCH v1 5/6] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 asm opt
[PATCH v1 6/6] avcodec/hevc: Add qpel_uni_h,epel_uni_w_h|v,epel_bi_h asm opt
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v1 1/6] avcodec/hevc: Add init for sao_edge_filter
2023-12-22 10:52 [FFmpeg-devel] [PATCH v1] [loongarch] Add hevc 128-bit & 256-bit asm optimizations jinbo
@ 2023-12-22 10:52 ` jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 2/6] avcodec/hevc: Add add_residual_4/8/16/32 asm opt jinbo
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: jinbo @ 2023-12-22 10:52 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
Forgot to init c->sao_edge_filter[idx] when idx=0/1/2/3.
After this patch, the speedup of decoding H265 4K 30FPS
30Mbps on 3A6000 is about 7% (42fps==>45fps).
Change-Id: I521999b397fa72b931a23c165cf45f276440cdfb
---
libavcodec/loongarch/hevcdsp_init_loongarch.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 22739c6f5b..5a96f3a4c9 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -167,6 +167,10 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[8][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv48_8_lsx;
c->put_hevc_qpel_uni_w[9][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv64_8_lsx;
+ c->sao_edge_filter[0] = ff_hevc_sao_edge_filter_8_lsx;
+ c->sao_edge_filter[1] = ff_hevc_sao_edge_filter_8_lsx;
+ c->sao_edge_filter[2] = ff_hevc_sao_edge_filter_8_lsx;
+ c->sao_edge_filter[3] = ff_hevc_sao_edge_filter_8_lsx;
c->sao_edge_filter[4] = ff_hevc_sao_edge_filter_8_lsx;
c->hevc_h_loop_filter_luma = ff_hevc_loop_filter_luma_h_8_lsx;
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v1 2/6] avcodec/hevc: Add add_residual_4/8/16/32 asm opt
2023-12-22 10:52 [FFmpeg-devel] [PATCH v1] [loongarch] Add hevc 128-bit & 256-bit asm optimizations jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 1/6] avcodec/hevc: Add init for sao_edge_filter jinbo
@ 2023-12-22 10:52 ` jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 3/6] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: jinbo @ 2023-12-22 10:52 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
After this patch, the peformance of decoding H265 4K 30FPS 30Mbps
on 3A6000 with 8 threads improves 2fps (45fps-->47fsp).
---
libavcodec/loongarch/Makefile | 3 +-
libavcodec/loongarch/hevc_add_res.S | 162 ++++++++++++++++++
libavcodec/loongarch/hevcdsp_init_loongarch.c | 5 +
libavcodec/loongarch/hevcdsp_lsx.h | 5 +
4 files changed, 174 insertions(+), 1 deletion(-)
create mode 100644 libavcodec/loongarch/hevc_add_res.S
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 06cfab5c20..07ea97f803 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -27,7 +27,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER) += loongarch/hevcdsp_lsx.o \
loongarch/hevc_lpf_sao_lsx.o \
loongarch/hevc_mc_bi_lsx.o \
loongarch/hevc_mc_uni_lsx.o \
- loongarch/hevc_mc_uniw_lsx.o
+ loongarch/hevc_mc_uniw_lsx.o \
+ loongarch/hevc_add_res.o
LSX-OBJS-$(CONFIG_H264DSP) += loongarch/h264idct.o \
loongarch/h264idct_loongarch.o \
loongarch/h264dsp.o
diff --git a/libavcodec/loongarch/hevc_add_res.S b/libavcodec/loongarch/hevc_add_res.S
new file mode 100644
index 0000000000..dd2d820af8
--- /dev/null
+++ b/libavcodec/loongarch/hevc_add_res.S
@@ -0,0 +1,162 @@
+/*
+ * Loongson LSX optimized add_residual functions for HEVC decoding
+ *
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "loongson_asm.S"
+
+/*
+ * void ff_hevc_add_residual4x4_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+.macro ADD_RES_LSX_4x4_8
+ vldrepl.w vr0, a0, 0
+ add.d t0, a0, a2
+ vldrepl.w vr1, t0, 0
+ vld vr2, a1, 0
+
+ vilvl.w vr1, vr1, vr0
+ vsllwil.hu.bu vr1, vr1, 0
+ vadd.h vr1, vr1, vr2
+ vssrani.bu.h vr1, vr1, 0
+
+ vstelm.w vr1, a0, 0, 0
+ vstelm.w vr1, t0, 0, 1
+.endm
+
+function ff_hevc_add_residual4x4_8_lsx
+ ADD_RES_LSX_4x4_8
+ alsl.d a0, a2, a0, 1
+ addi.d a1, a1, 16
+ ADD_RES_LSX_4x4_8
+endfunc
+
+/*
+ * void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+.macro ADD_RES_LSX_8x8_8
+ vldrepl.d vr0, a0, 0
+ add.d t0, a0, a2
+ vldrepl.d vr1, t0, 0
+ add.d t1, t0, a2
+ vldrepl.d vr2, t1, 0
+ add.d t2, t1, a2
+ vldrepl.d vr3, t2, 0
+
+ vld vr4, a1, 0
+ addi.d t3, zero, 16
+ vldx vr5, a1, t3
+ addi.d t4, a1, 32
+ vld vr6, t4, 0
+ vldx vr7, t4, t3
+
+ vsllwil.hu.bu vr0, vr0, 0
+ vsllwil.hu.bu vr1, vr1, 0
+ vsllwil.hu.bu vr2, vr2, 0
+ vsllwil.hu.bu vr3, vr3, 0
+ vadd.h vr0, vr0, vr4
+ vadd.h vr1, vr1, vr5
+ vadd.h vr2, vr2, vr6
+ vadd.h vr3, vr3, vr7
+ vssrani.bu.h vr1, vr0, 0
+ vssrani.bu.h vr3, vr2, 0
+
+ vstelm.d vr1, a0, 0, 0
+ vstelm.d vr1, t0, 0, 1
+ vstelm.d vr3, t1, 0, 0
+ vstelm.d vr3, t2, 0, 1
+.endm
+
+function ff_hevc_add_residual8x8_8_lsx
+ ADD_RES_LSX_8x8_8
+ alsl.d a0, a2, a0, 2
+ addi.d a1, a1, 64
+ ADD_RES_LSX_8x8_8
+endfunc
+
+/*
+ * void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+function ff_hevc_add_residual16x16_8_lsx
+.rept 8
+ vld vr0, a0, 0
+ vldx vr2, a0, a2
+
+ vld vr4, a1, 0
+ addi.d t0, zero, 16
+ vldx vr5, a1, t0
+ addi.d t1, a1, 32
+ vld vr6, t1, 0
+ vldx vr7, t1, t0
+
+ vexth.hu.bu vr1, vr0
+ vsllwil.hu.bu vr0, vr0, 0
+ vexth.hu.bu vr3, vr2
+ vsllwil.hu.bu vr2, vr2, 0
+ vadd.h vr0, vr0, vr4
+ vadd.h vr1, vr1, vr5
+ vadd.h vr2, vr2, vr6
+ vadd.h vr3, vr3, vr7
+
+ vssrani.bu.h vr1, vr0, 0
+ vssrani.bu.h vr3, vr2, 0
+
+ vst vr1, a0, 0
+ vstx vr3, a0, a2
+
+ alsl.d a0, a2, a0, 1
+ addi.d a1, a1, 64
+.endr
+endfunc
+
+/*
+ * void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride)
+ */
+function ff_hevc_add_residual32x32_8_lsx
+.rept 32
+ vld vr0, a0, 0
+ addi.w t0, zero, 16
+ vldx vr2, a0, t0
+
+ vld vr4, a1, 0
+ vldx vr5, a1, t0
+ addi.d t1, a1, 32
+ vld vr6, t1, 0
+ vldx vr7, t1, t0
+
+ vexth.hu.bu vr1, vr0
+ vsllwil.hu.bu vr0, vr0, 0
+ vexth.hu.bu vr3, vr2
+ vsllwil.hu.bu vr2, vr2, 0
+ vadd.h vr0, vr0, vr4
+ vadd.h vr1, vr1, vr5
+ vadd.h vr2, vr2, vr6
+ vadd.h vr3, vr3, vr7
+
+ vssrani.bu.h vr1, vr0, 0
+ vssrani.bu.h vr3, vr2, 0
+
+ vst vr1, a0, 0
+ vstx vr3, a0, t0
+
+ add.d a0, a0, a2
+ addi.d a1, a1, 64
+.endr
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 5a96f3a4c9..a8f753dc86 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -189,6 +189,11 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->idct[1] = ff_hevc_idct_8x8_lsx;
c->idct[2] = ff_hevc_idct_16x16_lsx;
c->idct[3] = ff_hevc_idct_32x32_lsx;
+
+ c->add_residual[0] = ff_hevc_add_residual4x4_8_lsx;
+ c->add_residual[1] = ff_hevc_add_residual8x8_8_lsx;
+ c->add_residual[2] = ff_hevc_add_residual16x16_8_lsx;
+ c->add_residual[3] = ff_hevc_add_residual32x32_8_lsx;
}
}
}
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 0d54196caf..ac509984fd 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -227,4 +227,9 @@ void ff_hevc_idct_8x8_lsx(int16_t *coeffs, int col_limit);
void ff_hevc_idct_16x16_lsx(int16_t *coeffs, int col_limit);
void ff_hevc_idct_32x32_lsx(int16_t *coeffs, int col_limit);
+void ff_hevc_add_residual4x4_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v1 3/6] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt
2023-12-22 10:52 [FFmpeg-devel] [PATCH v1] [loongarch] Add hevc 128-bit & 256-bit asm optimizations jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 1/6] avcodec/hevc: Add init for sao_edge_filter jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 2/6] avcodec/hevc: Add add_residual_4/8/16/32 asm opt jinbo
@ 2023-12-22 10:52 ` jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 4/6] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 " jinbo
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: jinbo @ 2023-12-22 10:52 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
tests/checkasm/checkasm: C LSX LASX
put_hevc_pel_uni_w_pixels4_8_c: 2.7 1.0
put_hevc_pel_uni_w_pixels6_8_c: 6.2 2.0 1.5
put_hevc_pel_uni_w_pixels8_8_c: 10.7 2.5 1.7
put_hevc_pel_uni_w_pixels12_8_c: 23.0 5.5 5.0
put_hevc_pel_uni_w_pixels16_8_c: 41.0 8.2 5.0
put_hevc_pel_uni_w_pixels24_8_c: 91.0 19.7 13.2
put_hevc_pel_uni_w_pixels32_8_c: 161.7 32.5 16.2
put_hevc_pel_uni_w_pixels48_8_c: 354.5 73.7 43.0
put_hevc_pel_uni_w_pixels64_8_c: 641.5 130.0 64.2
Speedup of decoding H265 4K 30FPS 30Mbps on 3A6000 with
8 threads is 1fps(47fps-->48fps).
---
libavcodec/loongarch/Makefile | 3 +-
libavcodec/loongarch/hevc_mc.S | 471 ++++++++++++++++++
libavcodec/loongarch/hevcdsp_init_loongarch.c | 43 ++
libavcodec/loongarch/hevcdsp_lasx.h | 53 ++
libavcodec/loongarch/hevcdsp_lsx.h | 27 +
5 files changed, 596 insertions(+), 1 deletion(-)
create mode 100644 libavcodec/loongarch/hevc_mc.S
create mode 100644 libavcodec/loongarch/hevcdsp_lasx.h
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 07ea97f803..ad98cd4054 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -28,7 +28,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER) += loongarch/hevcdsp_lsx.o \
loongarch/hevc_mc_bi_lsx.o \
loongarch/hevc_mc_uni_lsx.o \
loongarch/hevc_mc_uniw_lsx.o \
- loongarch/hevc_add_res.o
+ loongarch/hevc_add_res.o \
+ loongarch/hevc_mc.o
LSX-OBJS-$(CONFIG_H264DSP) += loongarch/h264idct.o \
loongarch/h264idct_loongarch.o \
loongarch/h264dsp.o
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
new file mode 100644
index 0000000000..c5d553effe
--- /dev/null
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -0,0 +1,471 @@
+/*
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "loongson_asm.S"
+
+.macro LOAD_VAR bit
+ addi.w t1, a5, 6 //shift
+ addi.w t3, zero, 1 //one
+ sub.w t4, t1, t3
+ sll.w t3, t3, t4 //offset
+.if \bit == 128
+ vreplgr2vr.w vr1, a6 //wx
+ vreplgr2vr.w vr2, t3 //offset
+ vreplgr2vr.w vr3, t1 //shift
+ vreplgr2vr.w vr4, a7 //ox
+.else
+ xvreplgr2vr.w xr1, a6
+ xvreplgr2vr.w xr2, t3
+ xvreplgr2vr.w xr3, t1
+ xvreplgr2vr.w xr4, a7
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS8_LSX src0, dst0, w
+ vldrepl.d vr0, \src0, 0
+ vsllwil.hu.bu vr0, vr0, 0
+ vexth.wu.hu vr5, vr0
+ vsllwil.wu.hu vr0, vr0, 0
+ vslli.w vr0, vr0, 6
+ vslli.w vr5, vr5, 6
+ vmul.w vr0, vr0, vr1
+ vmul.w vr5, vr5, vr1
+ vadd.w vr0, vr0, vr2
+ vadd.w vr5, vr5, vr2
+ vsra.w vr0, vr0, vr3
+ vsra.w vr5, vr5, vr3
+ vadd.w vr0, vr0, vr4
+ vadd.w vr5, vr5, vr4
+ vssrani.h.w vr5, vr0, 0
+ vssrani.bu.h vr5, vr5, 0
+.if \w == 6
+ fst.s f5, \dst0, 0
+ vstelm.h vr5, \dst0, 4, 2
+.else
+ fst.d f5, \dst0, 0
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS8x2_LASX src0, dst0, w
+ vldrepl.d vr0, \src0, 0
+ add.d t2, \src0, a3
+ vldrepl.d vr5, t2, 0
+ xvpermi.q xr0, xr5, 0x02
+ xvsllwil.hu.bu xr0, xr0, 0
+ xvexth.wu.hu xr5, xr0
+ xvsllwil.wu.hu xr0, xr0, 0
+ xvslli.w xr0, xr0, 6
+ xvslli.w xr5, xr5, 6
+ xvmul.w xr0, xr0, xr1
+ xvmul.w xr5, xr5, xr1
+ xvadd.w xr0, xr0, xr2
+ xvadd.w xr5, xr5, xr2
+ xvsra.w xr0, xr0, xr3
+ xvsra.w xr5, xr5, xr3
+ xvadd.w xr0, xr0, xr4
+ xvadd.w xr5, xr5, xr4
+ xvssrani.h.w xr5, xr0, 0
+ xvpermi.q xr0, xr5, 0x01
+ xvssrani.bu.h xr0, xr5, 0
+ add.d t3, \dst0, a1
+.if \w == 6
+ vstelm.w vr0, \dst0, 0, 0
+ vstelm.h vr0, \dst0, 4, 2
+ vstelm.w vr0, t3, 0, 2
+ vstelm.h vr0, t3, 4, 6
+.else
+ vstelm.d vr0, \dst0, 0, 0
+ vstelm.d vr0, t3, 0, 1
+.endif
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS16_LSX src0, dst0
+ vld vr0, \src0, 0
+ vexth.hu.bu vr7, vr0
+ vexth.wu.hu vr8, vr7
+ vsllwil.wu.hu vr7, vr7, 0
+ vsllwil.hu.bu vr5, vr0, 0
+ vexth.wu.hu vr6, vr5
+ vsllwil.wu.hu vr5, vr5, 0
+ vslli.w vr5, vr5, 6
+ vslli.w vr6, vr6, 6
+ vslli.w vr7, vr7, 6
+ vslli.w vr8, vr8, 6
+ vmul.w vr5, vr5, vr1
+ vmul.w vr6, vr6, vr1
+ vmul.w vr7, vr7, vr1
+ vmul.w vr8, vr8, vr1
+ vadd.w vr5, vr5, vr2
+ vadd.w vr6, vr6, vr2
+ vadd.w vr7, vr7, vr2
+ vadd.w vr8, vr8, vr2
+ vsra.w vr5, vr5, vr3
+ vsra.w vr6, vr6, vr3
+ vsra.w vr7, vr7, vr3
+ vsra.w vr8, vr8, vr3
+ vadd.w vr5, vr5, vr4
+ vadd.w vr6, vr6, vr4
+ vadd.w vr7, vr7, vr4
+ vadd.w vr8, vr8, vr4
+ vssrani.h.w vr6, vr5, 0
+ vssrani.h.w vr8, vr7, 0
+ vssrani.bu.h vr8, vr6, 0
+ vst vr8, \dst0, 0
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS16_LASX src0, dst0
+ vld vr0, \src0, 0
+ xvpermi.d xr0, xr0, 0xd8
+ xvsllwil.hu.bu xr0, xr0, 0
+ xvexth.wu.hu xr6, xr0
+ xvsllwil.wu.hu xr5, xr0, 0
+ xvslli.w xr5, xr5, 6
+ xvslli.w xr6, xr6, 6
+ xvmul.w xr5, xr5, xr1
+ xvmul.w xr6, xr6, xr1
+ xvadd.w xr5, xr5, xr2
+ xvadd.w xr6, xr6, xr2
+ xvsra.w xr5, xr5, xr3
+ xvsra.w xr6, xr6, xr3
+ xvadd.w xr5, xr5, xr4
+ xvadd.w xr6, xr6, xr4
+ xvssrani.h.w xr6, xr5, 0
+ xvpermi.q xr7, xr6, 0x01
+ xvssrani.bu.h xr7, xr6, 0
+ vst vr7, \dst0, 0
+.endm
+
+.macro HEVC_PEL_UNI_W_PIXELS32_LASX src0, dst0, w
+.if \w == 16
+ vld vr0, \src0, 0
+ add.d t2, \src0, a3
+ vld vr5, t2, 0
+ xvpermi.q xr0, xr5, 0x02
+.else //w=24/32
+ xvld xr0, \src0, 0
+.endif
+ xvexth.hu.bu xr7, xr0
+ xvexth.wu.hu xr8, xr7
+ xvsllwil.wu.hu xr7, xr7, 0
+ xvsllwil.hu.bu xr5, xr0, 0
+ xvexth.wu.hu xr6, xr5
+ xvsllwil.wu.hu xr5, xr5, 0
+ xvslli.w xr5, xr5, 6
+ xvslli.w xr6, xr6, 6
+ xvslli.w xr7, xr7, 6
+ xvslli.w xr8, xr8, 6
+ xvmul.w xr5, xr5, xr1
+ xvmul.w xr6, xr6, xr1
+ xvmul.w xr7, xr7, xr1
+ xvmul.w xr8, xr8, xr1
+ xvadd.w xr5, xr5, xr2
+ xvadd.w xr6, xr6, xr2
+ xvadd.w xr7, xr7, xr2
+ xvadd.w xr8, xr8, xr2
+ xvsra.w xr5, xr5, xr3
+ xvsra.w xr6, xr6, xr3
+ xvsra.w xr7, xr7, xr3
+ xvsra.w xr8, xr8, xr3
+ xvadd.w xr5, xr5, xr4
+ xvadd.w xr6, xr6, xr4
+ xvadd.w xr7, xr7, xr4
+ xvadd.w xr8, xr8, xr4
+ xvssrani.h.w xr6, xr5, 0
+ xvssrani.h.w xr8, xr7, 0
+ xvssrani.bu.h xr8, xr6, 0
+.if \w == 16
+ vst vr8, \dst0, 0
+ add.d t2, \dst0, a1
+ xvpermi.q xr8, xr8, 0x01
+ vst vr8, t2, 0
+.elseif \w == 24
+ vst vr8, \dst0, 0
+ xvstelm.d xr8, \dst0, 16, 2
+.else
+ xvst xr8, \dst0, 0
+.endif
+.endm
+
+function ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx
+ LOAD_VAR 128
+ srli.w t0, a4, 1
+.LOOP_PIXELS4:
+ vldrepl.w vr0, a2, 0
+ add.d t1, a2, a3
+ vldrepl.w vr5, t1, 0
+ vsllwil.hu.bu vr0, vr0, 0
+ vsllwil.wu.hu vr0, vr0, 0
+ vsllwil.hu.bu vr5, vr5, 0
+ vsllwil.wu.hu vr5, vr5, 0
+ vslli.w vr0, vr0, 6
+ vslli.w vr5, vr5, 6
+ vmul.w vr0, vr0, vr1
+ vmul.w vr5, vr5, vr1
+ vadd.w vr0, vr0, vr2
+ vadd.w vr5, vr5, vr2
+ vsra.w vr0, vr0, vr3
+ vsra.w vr5, vr5, vr3
+ vadd.w vr0, vr0, vr4
+ vadd.w vr5, vr5, vr4
+ vssrani.h.w vr5, vr0, 0
+ vssrani.bu.h vr5, vr5, 0
+ fst.s f5, a0, 0
+ add.d t2, a0, a1
+ vstelm.w vr5, t2, 0, 1
+ alsl.d a2, a3, a2, 1
+ alsl.d a0, a1, a0, 1
+ addi.w t0, t0, -1
+ bnez t0, .LOOP_PIXELS4
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS6:
+ HEVC_PEL_UNI_W_PIXELS8_LSX a2, a0, 6
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS6
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx
+ LOAD_VAR 256
+ srli.w t0, a4, 1
+.LOOP_PIXELS6_LASX:
+ HEVC_PEL_UNI_W_PIXELS8x2_LASX a2, a0, 6
+ alsl.d a2, a3, a2, 1
+ alsl.d a0, a1, a0, 1
+ addi.w t0, t0, -1
+ bnez t0, .LOOP_PIXELS6_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS8:
+ HEVC_PEL_UNI_W_PIXELS8_LSX a2, a0, 8
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS8
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx
+ LOAD_VAR 256
+ srli.w t0, a4, 1
+.LOOP_PIXELS8_LASX:
+ HEVC_PEL_UNI_W_PIXELS8x2_LASX a2, a0, 8
+ alsl.d a2, a3, a2, 1
+ alsl.d a0, a1, a0, 1
+ addi.w t0, t0, -1
+ bnez t0, .LOOP_PIXELS8_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS12:
+ vld vr0, a2, 0
+ vexth.hu.bu vr7, vr0
+ vsllwil.wu.hu vr7, vr7, 0
+ vsllwil.hu.bu vr5, vr0, 0
+ vexth.wu.hu vr6, vr5
+ vsllwil.wu.hu vr5, vr5, 0
+ vslli.w vr5, vr5, 6
+ vslli.w vr6, vr6, 6
+ vslli.w vr7, vr7, 6
+ vmul.w vr5, vr5, vr1
+ vmul.w vr6, vr6, vr1
+ vmul.w vr7, vr7, vr1
+ vadd.w vr5, vr5, vr2
+ vadd.w vr6, vr6, vr2
+ vadd.w vr7, vr7, vr2
+ vsra.w vr5, vr5, vr3
+ vsra.w vr6, vr6, vr3
+ vsra.w vr7, vr7, vr3
+ vadd.w vr5, vr5, vr4
+ vadd.w vr6, vr6, vr4
+ vadd.w vr7, vr7, vr4
+ vssrani.h.w vr6, vr5, 0
+ vssrani.h.w vr7, vr7, 0
+ vssrani.bu.h vr7, vr6, 0
+ fst.d f7, a0, 0
+ vstelm.w vr7, a0, 8, 2
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS12
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS12_LASX:
+ vld vr0, a2, 0
+ xvpermi.d xr0, xr0, 0xd8
+ xvsllwil.hu.bu xr0, xr0, 0
+ xvexth.wu.hu xr6, xr0
+ xvsllwil.wu.hu xr5, xr0, 0
+ xvslli.w xr5, xr5, 6
+ xvslli.w xr6, xr6, 6
+ xvmul.w xr5, xr5, xr1
+ xvmul.w xr6, xr6, xr1
+ xvadd.w xr5, xr5, xr2
+ xvadd.w xr6, xr6, xr2
+ xvsra.w xr5, xr5, xr3
+ xvsra.w xr6, xr6, xr3
+ xvadd.w xr5, xr5, xr4
+ xvadd.w xr6, xr6, xr4
+ xvssrani.h.w xr6, xr5, 0
+ xvpermi.q xr7, xr6, 0x01
+ xvssrani.bu.h xr7, xr6, 0
+ fst.d f7, a0, 0
+ vstelm.w vr7, a0, 8, 2
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS12_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS16:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS16
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx
+ LOAD_VAR 256
+ srli.w t0, a4, 1
+.LOOP_PIXELS16_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 16
+ alsl.d a2, a3, a2, 1
+ alsl.d a0, a1, a0, 1
+ addi.w t0, t0, -1
+ bnez t0, .LOOP_PIXELS16_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS24:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ addi.d t0, a2, 16
+ addi.d t1, a0, 16
+ HEVC_PEL_UNI_W_PIXELS8_LSX t0, t1, 8
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS24
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS24_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 24
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS24_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS32:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ addi.d t0, a2, 16
+ addi.d t1, a0, 16
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS32
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS32_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS32_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS48:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ addi.d t0, a2, 16
+ addi.d t1, a0, 16
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ addi.d t0, a2, 32
+ addi.d t1, a0, 32
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS48
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS48_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32
+ addi.d t0, a2, 32
+ addi.d t1, a0, 32
+ HEVC_PEL_UNI_W_PIXELS16_LASX t0, t1
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS48_LASX
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx
+ LOAD_VAR 128
+.LOOP_PIXELS64:
+ HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0
+ addi.d t0, a2, 16
+ addi.d t1, a0, 16
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ addi.d t0, a2, 32
+ addi.d t1, a0, 32
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ addi.d t0, a2, 48
+ addi.d t1, a0, 48
+ HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS64
+endfunc
+
+function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx
+ LOAD_VAR 256
+.LOOP_PIXELS64_LASX:
+ HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32
+ addi.d t0, a2, 32
+ addi.d t1, a0, 32
+ HEVC_PEL_UNI_W_PIXELS32_LASX t0, t1, 32
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.w a4, a4, -1
+ bnez a4, .LOOP_PIXELS64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index a8f753dc86..d0ee99d6b5 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -22,6 +22,7 @@
#include "libavutil/loongarch/cpu.h"
#include "hevcdsp_lsx.h"
+#include "hevcdsp_lasx.h"
void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
{
@@ -160,6 +161,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni[6][1][1] = ff_hevc_put_hevc_uni_epel_hv24_8_lsx;
c->put_hevc_epel_uni[7][1][1] = ff_hevc_put_hevc_uni_epel_hv32_8_lsx;
+ c->put_hevc_qpel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
+ c->put_hevc_qpel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
+ c->put_hevc_qpel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
+ c->put_hevc_qpel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx;
+ c->put_hevc_qpel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx;
+ c->put_hevc_qpel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx;
+ c->put_hevc_qpel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx;
+ c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
+ c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+
+ c->put_hevc_epel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
+ c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
+ c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
+ c->put_hevc_epel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx;
+ c->put_hevc_epel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx;
+ c->put_hevc_epel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx;
+ c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx;
+ c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
+ c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+
c->put_hevc_qpel_uni_w[3][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv8_8_lsx;
c->put_hevc_qpel_uni_w[5][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv16_8_lsx;
c->put_hevc_qpel_uni_w[6][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv24_8_lsx;
@@ -196,4 +217,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->add_residual[3] = ff_hevc_add_residual32x32_8_lsx;
}
}
+
+ if (have_lasx(cpu_flags)) {
+ if (bit_depth == 8) {
+ c->put_hevc_qpel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx;
+ c->put_hevc_qpel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx;
+ c->put_hevc_qpel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx;
+ c->put_hevc_qpel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx;
+ c->put_hevc_qpel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx;
+ c->put_hevc_qpel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
+ c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
+ c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+
+ c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx;
+ c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx;
+ c->put_hevc_epel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx;
+ c->put_hevc_epel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx;
+ c->put_hevc_epel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx;
+ c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
+ c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
+ c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+ }
+ }
}
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
new file mode 100644
index 0000000000..819c3c3ecf
--- /dev/null
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -0,0 +1,53 @@
+/*
+ * Copyright (c) 2023 Loongson Technology Corporation Limited
+ * Contributed by jinbo <jinbo@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
+#define AVCODEC_LOONGARCH_HEVCDSP_LASX_H
+
+#include "libavcodec/hevcdsp.h"
+
+#define PEL_UNI_W(PEL, DIR, WIDTH) \
+void ff_hevc_put_hevc_##PEL##_uni_w_##DIR##WIDTH##_8_lasx(uint8_t *dst, \
+ ptrdiff_t \
+ dst_stride, \
+ const uint8_t *src, \
+ ptrdiff_t \
+ src_stride, \
+ int height, \
+ int denom, \
+ int wx, \
+ int ox, \
+ intptr_t mx, \
+ intptr_t my, \
+ int width)
+
+PEL_UNI_W(pel, pixels, 6);
+PEL_UNI_W(pel, pixels, 8);
+PEL_UNI_W(pel, pixels, 12);
+PEL_UNI_W(pel, pixels, 16);
+PEL_UNI_W(pel, pixels, 24);
+PEL_UNI_W(pel, pixels, 32);
+PEL_UNI_W(pel, pixels, 48);
+PEL_UNI_W(pel, pixels, 64);
+
+#undef PEL_UNI_W
+
+#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index ac509984fd..0d724a90ef 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -232,4 +232,31 @@ void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t s
void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride);
+#define PEL_UNI_W(PEL, DIR, WIDTH) \
+void ff_hevc_put_hevc_##PEL##_uni_w_##DIR##WIDTH##_8_lsx(uint8_t *dst, \
+ ptrdiff_t \
+ dst_stride, \
+ const uint8_t *src, \
+ ptrdiff_t \
+ src_stride, \
+ int height, \
+ int denom, \
+ int wx, \
+ int ox, \
+ intptr_t mx, \
+ intptr_t my, \
+ int width)
+
+PEL_UNI_W(pel, pixels, 4);
+PEL_UNI_W(pel, pixels, 6);
+PEL_UNI_W(pel, pixels, 8);
+PEL_UNI_W(pel, pixels, 12);
+PEL_UNI_W(pel, pixels, 16);
+PEL_UNI_W(pel, pixels, 24);
+PEL_UNI_W(pel, pixels, 32);
+PEL_UNI_W(pel, pixels, 48);
+PEL_UNI_W(pel, pixels, 64);
+
+#undef PEL_UNI_W
+
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v1 4/6] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 asm opt
2023-12-22 10:52 [FFmpeg-devel] [PATCH v1] [loongarch] Add hevc 128-bit & 256-bit asm optimizations jinbo
` (2 preceding siblings ...)
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 3/6] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-22 10:52 ` jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 5/6] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 " jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 6/6] avcodec/hevc: Add asm opt for the following functions jinbo
5 siblings, 0 replies; 7+ messages in thread
From: jinbo @ 2023-12-22 10:52 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
tests/checkasm/checkasm: C LSX LASX
put_hevc_qpel_uni_w_h4_8_c: 6.5 1.7 1.2
put_hevc_qpel_uni_w_h6_8_c: 14.5 4.5 3.7
put_hevc_qpel_uni_w_h8_8_c: 24.5 5.7 4.5
put_hevc_qpel_uni_w_h12_8_c: 54.7 17.5 12.0
put_hevc_qpel_uni_w_h16_8_c: 96.5 22.7 13.2
put_hevc_qpel_uni_w_h24_8_c: 216.0 51.2 33.2
put_hevc_qpel_uni_w_h32_8_c: 385.7 87.0 53.2
put_hevc_qpel_uni_w_h48_8_c: 860.5 192.0 113.2
put_hevc_qpel_uni_w_h64_8_c: 1531.0 334.2 200.0
put_hevc_qpel_uni_w_v4_8_c: 8.0 1.7
put_hevc_qpel_uni_w_v6_8_c: 17.2 4.5
put_hevc_qpel_uni_w_v8_8_c: 29.5 6.0 5.2
put_hevc_qpel_uni_w_v12_8_c: 65.2 16.0 11.7
put_hevc_qpel_uni_w_v16_8_c: 116.5 20.5 14.0
put_hevc_qpel_uni_w_v24_8_c: 259.2 48.5 37.2
put_hevc_qpel_uni_w_v32_8_c: 459.5 80.5 56.0
put_hevc_qpel_uni_w_v48_8_c: 1028.5 180.2 126.5
put_hevc_qpel_uni_w_v64_8_c: 1831.2 319.2 224.2
Speedup of decoding H265 4K 30FPS 30Mbps on
3A6000 with 8 threads is 4fps(48fps-->52fps).
Change-Id: I1178848541d90083869225ba98a02e6aa8bb8c5a
---
libavcodec/loongarch/hevc_mc.S | 1294 +++++++++++++++++
libavcodec/loongarch/hevcdsp_init_loongarch.c | 38 +
libavcodec/loongarch/hevcdsp_lasx.h | 18 +
libavcodec/loongarch/hevcdsp_lsx.h | 20 +
4 files changed, 1370 insertions(+)
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index c5d553effe..2ee338fb8e 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -21,6 +21,8 @@
#include "loongson_asm.S"
+.extern ff_hevc_qpel_filters
+
.macro LOAD_VAR bit
addi.w t1, a5, 6 //shift
addi.w t3, zero, 1 //one
@@ -469,3 +471,1295 @@ function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx
addi.w a4, a4, -1
bnez a4, .LOOP_PIXELS64_LASX
endfunc
+
+.macro vhaddw.d.h in0
+ vhaddw.w.h \in0, \in0, \in0
+ vhaddw.d.w \in0, \in0, \in0
+.endm
+
+.macro xvhaddw.d.h in0
+ xvhaddw.w.h \in0, \in0, \in0
+ xvhaddw.d.w \in0, \in0, \in0
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ fld.s f6, a2, 0 //0
+ fldx.s f7, a2, a3 //1
+ fldx.s f8, a2, t0 //2
+ add.d a2, a2, t1
+ fld.s f9, a2, 0 //3
+ fldx.s f10, a2, a3 //4
+ fldx.s f11, a2, t0 //5
+ fldx.s f12, a2, t1 //6
+ add.d a2, a2, t2
+ vilvl.b vr6, vr7, vr6
+ vilvl.b vr7, vr9, vr8
+ vilvl.b vr8, vr11, vr10
+ vilvl.b vr9, vr13, vr12
+ vilvl.h vr6, vr7, vr6
+ vilvl.h vr7, vr9, vr8
+ vilvl.w vr8, vr7, vr6
+ vilvh.w vr9, vr7, vr6
+.LOOP_V4:
+ fld.s f13, a2, 0 //7
+ fldx.s f14, a2, a3 //8 next loop
+ add.d a2, a2, t0
+ vextrins.b vr8, vr13, 0x70
+ vextrins.b vr8, vr13, 0xf1
+ vextrins.b vr9, vr13, 0x72
+ vextrins.b vr9, vr13, 0xf3
+ vbsrl.v vr10, vr8, 1
+ vbsrl.v vr11, vr9, 1
+ vextrins.b vr10, vr14, 0x70
+ vextrins.b vr10, vr14, 0xf1
+ vextrins.b vr11, vr14, 0x72
+ vextrins.b vr11, vr14, 0xf3
+ vdp2.h.bu.b vr6, vr8, vr5 //QPEL_FILTER(src, stride)
+ vdp2.h.bu.b vr7, vr9, vr5
+ vdp2.h.bu.b vr12, vr10, vr5
+ vdp2.h.bu.b vr13, vr11, vr5
+ vbsrl.v vr8, vr10, 1
+ vbsrl.v vr9, vr11, 1
+ vhaddw.d.h vr6
+ vhaddw.d.h vr7
+ vhaddw.d.h vr12
+ vhaddw.d.h vr13
+ vpickev.w vr6, vr7, vr6
+ vpickev.w vr12, vr13, vr12
+ vmulwev.w.h vr6, vr6, vr1 //QPEL_FILTER(src, stride) * wx
+ vmulwev.w.h vr12, vr12, vr1
+ vadd.w vr6, vr6, vr2
+ vsra.w vr6, vr6, vr3
+ vadd.w vr6, vr6, vr4
+ vadd.w vr12, vr12, vr2
+ vsra.w vr12, vr12, vr3
+ vadd.w vr12, vr12, vr4
+ vssrani.h.w vr12, vr6, 0
+ vssrani.bu.h vr12, vr12, 0
+ fst.s f12, a0, 0
+ add.d a0, a0, a1
+ vstelm.w vr12, a0, 0, 1
+ add.d a0, a0, a1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_V4
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ fld.d f6, a2, 0
+ fldx.d f7, a2, a3
+ fldx.d f8, a2, t0
+ add.d a2, a2, t1
+ fld.d f9, a2, 0
+ fldx.d f10, a2, a3
+ fldx.d f11, a2, t0
+ fldx.d f12, a2, t1
+ add.d a2, a2, t2
+ vilvl.b vr6, vr7, vr6 //transpose 8x6 to 3x16
+ vilvl.b vr7, vr9, vr8
+ vilvl.b vr8, vr11, vr10
+ vilvl.b vr9, vr13, vr12
+ vilvl.h vr10, vr7, vr6
+ vilvh.h vr11, vr7, vr6
+ vilvl.h vr12, vr9, vr8
+ vilvh.h vr13, vr9, vr8
+ vilvl.w vr6, vr12, vr10
+ vilvh.w vr7, vr12, vr10
+ vilvl.w vr8, vr13, vr11
+.LOOP_V6:
+ fld.d f13, a2, 0
+ add.d a2, a2, a3
+ vextrins.b vr6, vr13, 0x70
+ vextrins.b vr6, vr13, 0xf1
+ vextrins.b vr7, vr13, 0x72
+ vextrins.b vr7, vr13, 0xf3
+ vextrins.b vr8, vr13, 0x74
+ vextrins.b vr8, vr13, 0xf5
+ vdp2.h.bu.b vr10, vr6, vr5 //QPEL_FILTER(src, stride)
+ vdp2.h.bu.b vr11, vr7, vr5
+ vdp2.h.bu.b vr12, vr8, vr5
+ vbsrl.v vr6, vr6, 1
+ vbsrl.v vr7, vr7, 1
+ vbsrl.v vr8, vr8, 1
+ vhaddw.d.h vr10
+ vhaddw.d.h vr11
+ vhaddw.d.h vr12
+ vpickev.w vr10, vr11, vr10
+ vpickev.w vr11, vr13, vr12
+ vmulwev.w.h vr10, vr10, vr1 //QPEL_FILTER(src, stride) * wx
+ vmulwev.w.h vr11, vr11, vr1
+ vadd.w vr10, vr10, vr2
+ vadd.w vr11, vr11, vr2
+ vsra.w vr10, vr10, vr3
+ vsra.w vr11, vr11, vr3
+ vadd.w vr10, vr10, vr4
+ vadd.w vr11, vr11, vr4
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vstelm.h vr11, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_V6
+endfunc
+
+// transpose 8x8b to 4x16b
+.macro TRANSPOSE8X8B_LSX in0, in1, in2, in3, in4, in5, in6, in7, \
+ out0, out1, out2, out3
+ vilvl.b \in0, \in1, \in0
+ vilvl.b \in1, \in3, \in2
+ vilvl.b \in2, \in5, \in4
+ vilvl.b \in3, \in7, \in6
+ vilvl.h \in4, \in1, \in0
+ vilvh.h \in5, \in1, \in0
+ vilvl.h \in6, \in3, \in2
+ vilvh.h \in7, \in3, \in2
+ vilvl.w \out0, \in6, \in4
+ vilvh.w \out1, \in6, \in4
+ vilvl.w \out2, \in7, \in5
+ vilvh.w \out3, \in7, \in5
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_V8_LSX in0, in1, in2, in3, out0, out1, pos
+.if \pos == 0
+ vextrins.b \in0, vr13, 0x70 //insert the 8th load
+ vextrins.b \in0, vr13, 0xf1
+ vextrins.b \in1, vr13, 0x72
+ vextrins.b \in1, vr13, 0xf3
+ vextrins.b \in2, vr13, 0x74
+ vextrins.b \in2, vr13, 0xf5
+ vextrins.b \in3, vr13, 0x76
+ vextrins.b \in3, vr13, 0xf7
+.else// \pos == 8
+ vextrins.b \in0, vr13, 0x78
+ vextrins.b \in0, vr13, 0xf9
+ vextrins.b \in1, vr13, 0x7a
+ vextrins.b \in1, vr13, 0xfb
+ vextrins.b \in2, vr13, 0x7c
+ vextrins.b \in2, vr13, 0xfd
+ vextrins.b \in3, vr13, 0x7e
+ vextrins.b \in3, vr13, 0xff
+.endif
+ vdp2.h.bu.b \out0, \in0, vr5 //QPEL_FILTER(src, stride)
+ vdp2.h.bu.b \out1, \in1, vr5
+ vdp2.h.bu.b vr12, \in2, vr5
+ vdp2.h.bu.b vr20, \in3, vr5
+ vbsrl.v \in0, \in0, 1 //Back up previous 7 loaded datas,
+ vbsrl.v \in1, \in1, 1 //so just need to insert the 8th
+ vbsrl.v \in2, \in2, 1 //load in the next loop.
+ vbsrl.v \in3, \in3, 1
+ vhaddw.d.h \out0
+ vhaddw.d.h \out1
+ vhaddw.d.h vr12
+ vhaddw.d.h vr20
+ vpickev.w \out0, \out1, \out0
+ vpickev.w \out1, vr20, vr12
+ vmulwev.w.h \out0, \out0, vr1 //QPEL_FILTER(src, stride) * wx
+ vmulwev.w.h \out1, \out1, vr1
+ vadd.w \out0, \out0, vr2
+ vadd.w \out1, \out1, vr2
+ vsra.w \out0, \out0, vr3
+ vsra.w \out1, \out1, vr3
+ vadd.w \out0, \out0, vr4
+ vadd.w \out1, \out1, vr4
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ fld.d f6, a2, 0
+ fldx.d f7, a2, a3
+ fldx.d f8, a2, t0
+ add.d a2, a2, t1
+ fld.d f9, a2, 0
+ fldx.d f10, a2, a3
+ fldx.d f11, a2, t0
+ fldx.d f12, a2, t1
+ add.d a2, a2, t2
+ TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \
+ vr6, vr7, vr8, vr9
+.LOOP_V8:
+ fld.d f13, a2, 0 //the 8th load
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_V8_LSX vr6, vr7, vr8, vr9, vr10, vr11, 0
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.d f11, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_V8
+endfunc
+
+.macro PUT_HEVC_UNI_W_V8_LASX w
+ fld.d f6, a2, 0
+ fldx.d f7, a2, a3
+ fldx.d f8, a2, t0
+ add.d a2, a2, t1
+ fld.d f9, a2, 0
+ fldx.d f10, a2, a3
+ fldx.d f11, a2, t0
+ fldx.d f12, a2, t1
+ add.d a2, a2, t2
+ TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \
+ vr6, vr7, vr8, vr9
+ xvpermi.q xr6, xr7, 0x02
+ xvpermi.q xr8, xr9, 0x02
+.LOOP_V8_LASX_\w:
+ fld.d f13, a2, 0 // 0 1 2 3 4 5 6 7 the 8th load
+ add.d a2, a2, a3
+ vshuf4i.h vr13, vr13, 0xd8
+ vbsrl.v vr14, vr13, 4
+ xvpermi.q xr13, xr14, 0x02 //0 1 4 5 * * * * 2 3 6 7 * * * *
+ xvextrins.b xr6, xr13, 0x70 //begin to insert the 8th load
+ xvextrins.b xr6, xr13, 0xf1
+ xvextrins.b xr8, xr13, 0x72
+ xvextrins.b xr8, xr13, 0xf3
+ xvdp2.h.bu.b xr20, xr6, xr5 //QPEL_FILTER(src, stride)
+ xvdp2.h.bu.b xr21, xr8, xr5
+ xvbsrl.v xr6, xr6, 1
+ xvbsrl.v xr8, xr8, 1
+ xvhaddw.d.h xr20
+ xvhaddw.d.h xr21
+ xvpickev.w xr20, xr21, xr20
+ xvpermi.d xr20, xr20, 0xd8
+ xvmulwev.w.h xr20, xr20, xr1 //QPEL_FILTER(src, stride) * wx
+ xvadd.w xr20, xr20, xr2
+ xvsra.w xr20, xr20, xr3
+ xvadd.w xr10, xr20, xr4
+ xvpermi.q xr11, xr10, 0x01
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.d f11, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_V8_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_UNI_W_V8_LASX 8
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_V16_LSX w
+ vld vr6, a2, 0
+ vldx vr7, a2, a3
+ vldx vr8, a2, t0
+ add.d a2, a2, t1
+ vld vr9, a2, 0
+ vldx vr10, a2, a3
+ vldx vr11, a2, t0
+ vldx vr12, a2, t1
+ add.d a2, a2, t2
+.if \w > 8
+ vilvh.d vr14, vr14, vr6
+ vilvh.d vr15, vr15, vr7
+ vilvh.d vr16, vr16, vr8
+ vilvh.d vr17, vr17, vr9
+ vilvh.d vr18, vr18, vr10
+ vilvh.d vr19, vr19, vr11
+ vilvh.d vr20, vr20, vr12
+.endif
+ TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \
+ vr6, vr7, vr8, vr9
+.if \w > 8
+ TRANSPOSE8X8B_LSX vr14, vr15, vr16, vr17, vr18, vr19, vr20, vr21, \
+ vr14, vr15, vr16, vr17
+.endif
+.LOOP_HORI_16_\w:
+ vld vr13, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_V8_LSX vr6, vr7, vr8, vr9, vr10, vr11, 0
+.if \w > 8
+ PUT_HEVC_QPEL_UNI_W_V8_LSX vr14, vr15, vr16, vr17, vr18, vr19, 8
+.endif
+ vssrani.h.w vr11, vr10, 0
+.if \w > 8
+ vssrani.h.w vr19, vr18, 0
+ vssrani.bu.h vr19, vr11, 0
+.else
+ vssrani.bu.h vr11, vr11, 0
+.endif
+.if \w == 8
+ fst.d f11, a0, 0
+.elseif \w == 12
+ fst.d f19, a0, 0
+ vstelm.w vr19, a0, 8, 2
+.else
+ vst vr19, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HORI_16_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 16
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_V16_LASX w
+ vld vr6, a2, 0
+ vldx vr7, a2, a3
+ vldx vr8, a2, t0
+ add.d a2, a2, t1
+ vld vr9, a2, 0
+ vldx vr10, a2, a3
+ vldx vr11, a2, t0
+ vldx vr12, a2, t1
+ add.d a2, a2, t2
+ xvpermi.q xr6, xr10, 0x02 //pack and transpose the 8x16 to 4x32 begin
+ xvpermi.q xr7, xr11, 0x02
+ xvpermi.q xr8, xr12, 0x02
+ xvpermi.q xr9, xr13, 0x02
+ xvilvl.b xr14, xr7, xr6 //0 2
+ xvilvh.b xr15, xr7, xr6 //1 3
+ xvilvl.b xr16, xr9, xr8 //0 2
+ xvilvh.b xr17, xr9, xr8 //1 3
+ xvpermi.d xr14, xr14, 0xd8
+ xvpermi.d xr15, xr15, 0xd8
+ xvpermi.d xr16, xr16, 0xd8
+ xvpermi.d xr17, xr17, 0xd8
+ xvilvl.h xr6, xr16, xr14
+ xvilvh.h xr7, xr16, xr14
+ xvilvl.h xr8, xr17, xr15
+ xvilvh.h xr9, xr17, xr15
+ xvilvl.w xr14, xr7, xr6 //0 1 4 5
+ xvilvh.w xr15, xr7, xr6 //2 3 6 7
+ xvilvl.w xr16, xr9, xr8 //8 9 12 13
+ xvilvh.w xr17, xr9, xr8 //10 11 14 15 end
+.LOOP_HORI_16_LASX_\w:
+ vld vr13, a2, 0 //the 8th load
+ add.d a2, a2, a3
+ vshuf4i.w vr13, vr13, 0xd8
+ vbsrl.v vr12, vr13, 8
+ xvpermi.q xr13, xr12, 0x02
+ xvextrins.b xr14, xr13, 0x70 //inset the 8th load
+ xvextrins.b xr14, xr13, 0xf1
+ xvextrins.b xr15, xr13, 0x72
+ xvextrins.b xr15, xr13, 0xf3
+ xvextrins.b xr16, xr13, 0x74
+ xvextrins.b xr16, xr13, 0xf5
+ xvextrins.b xr17, xr13, 0x76
+ xvextrins.b xr17, xr13, 0xf7
+ xvdp2.h.bu.b xr6, xr14, xr5 //QPEL_FILTER(src, stride)
+ xvdp2.h.bu.b xr7, xr15, xr5
+ xvdp2.h.bu.b xr8, xr16, xr5
+ xvdp2.h.bu.b xr9, xr17, xr5
+ xvhaddw.d.h xr6
+ xvhaddw.d.h xr7
+ xvhaddw.d.h xr8
+ xvhaddw.d.h xr9
+ xvbsrl.v xr14, xr14, 1 //Back up previous 7 loaded datas,
+ xvbsrl.v xr15, xr15, 1 //so just need to insert the 8th
+ xvbsrl.v xr16, xr16, 1 //load in next loop.
+ xvbsrl.v xr17, xr17, 1
+ xvpickev.w xr6, xr7, xr6 //0 1 2 3 4 5 6 7
+ xvpickev.w xr7, xr9, xr8 //8 9 10 11 12 13 14 15
+ xvmulwev.w.h xr6, xr6, xr1 //QPEL_FILTER(src, stride) * wx
+ xvmulwev.w.h xr7, xr7, xr1
+ xvadd.w xr6, xr6, xr2
+ xvadd.w xr7, xr7, xr2
+ xvsra.w xr6, xr6, xr3
+ xvsra.w xr7, xr7, xr3
+ xvadd.w xr6, xr6, xr4
+ xvadd.w xr7, xr7, xr4
+ xvssrani.h.w xr7, xr6, 0 //0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15
+ xvpermi.q xr6, xr7, 0x01
+ vssrani.bu.h vr6, vr7, 0
+ vshuf4i.w vr6, vr6, 0xd8
+.if \w == 12
+ fst.d f6, a0, 0
+ vstelm.w vr6, a0, 8, 2
+.else
+ vst vr6, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HORI_16_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 24
+ addi.d a0, t4, 16
+ addi.d a2, t5, 16
+ addi.d a4, t6, 0
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 24
+ addi.d a0, t4, 16
+ addi.d a2, t5, 16
+ addi.d a4, t6, 0
+ PUT_HEVC_UNI_W_V8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 2
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V32:
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 32
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d a2, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V32
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 2
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V32_LASX:
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 32
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d a2, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V32_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 3
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V48:
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 48
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d t4, t4, 16
+ addi.d a2, t5, 16
+ addi.d t5, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V48
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 3
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V48_LASX:
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 48
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d t4, t4, 16
+ addi.d a2, t5, 16
+ addi.d t5, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V48_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 4
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V64:
+ PUT_HEVC_QPEL_UNI_W_V16_LSX 64
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d t4, t4, 16
+ addi.d a2, t5, 16
+ addi.d t5, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V64
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ add.d t2, t1, a3 //stride * 4
+ sub.d a2, a2, t1 //src -= stride*3
+ addi.d t3, zero, 4
+ addi.d t4, a0, 0 //save dst
+ addi.d t5, a2, 0 //save src
+ addi.d t6, a4, 0
+.LOOP_V64_LASX:
+ PUT_HEVC_QPEL_UNI_W_V16_LASX 64
+ addi.d t3, t3, -1
+ addi.d a0, t4, 16
+ addi.d t4, t4, 16
+ addi.d a2, t5, 16
+ addi.d t5, t5, 16
+ addi.d a4, t6, 0
+ bnez t3, .LOOP_V64_LASX
+endfunc
+
+.macro PUT_HEVC_QPEL_UNI_W_H8_LSX in0, out0, out1
+ vbsrl.v vr7, \in0, 1
+ vbsrl.v vr8, \in0, 2
+ vbsrl.v vr9, \in0, 3
+ vbsrl.v vr10, \in0, 4
+ vbsrl.v vr11, \in0, 5
+ vbsrl.v vr12, \in0, 6
+ vbsrl.v vr13, \in0, 7
+ vilvl.d vr6, vr7, \in0
+ vilvl.d vr7, vr9, vr8
+ vilvl.d vr8, vr11, vr10
+ vilvl.d vr9, vr13, vr12
+ vdp2.h.bu.b vr10, vr6, vr5
+ vdp2.h.bu.b vr11, vr7, vr5
+ vdp2.h.bu.b vr12, vr8, vr5
+ vdp2.h.bu.b vr13, vr9, vr5
+ vhaddw.d.h vr10
+ vhaddw.d.h vr11
+ vhaddw.d.h vr12
+ vhaddw.d.h vr13
+ vpickev.w vr10, vr11, vr10
+ vpickev.w vr11, vr13, vr12
+ vmulwev.w.h vr10, vr10, vr1
+ vmulwev.w.h vr11, vr11, vr1
+ vadd.w vr10, vr10, vr2
+ vadd.w vr11, vr11, vr2
+ vsra.w vr10, vr10, vr3
+ vsra.w vr11, vr11, vr3
+ vadd.w \out0, vr10, vr4
+ vadd.w \out1, vr11, vr4
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_H8_LASX in0, out0
+ xvbsrl.v xr7, \in0, 4
+ xvpermi.q xr7, \in0, 0x20
+ xvbsrl.v xr8, xr7, 1
+ xvbsrl.v xr9, xr7, 2
+ xvbsrl.v xr10, xr7, 3
+ xvpackev.d xr7, xr8, xr7
+ xvpackev.d xr8, xr10, xr9
+ xvdp2.h.bu.b xr10, xr7, xr5
+ xvdp2.h.bu.b xr11, xr8, xr5
+ xvhaddw.d.h xr10
+ xvhaddw.d.h xr11
+ xvpickev.w xr10, xr11, xr10
+ xvmulwev.w.h xr10, xr10, xr1
+ xvadd.w xr10, xr10, xr2
+ xvsra.w xr10, xr10, xr3
+ xvadd.w \out0, xr10, xr4
+.endm
+
+.macro PUT_HEVC_QPEL_UNI_W_H16_LASX in0, out0
+ xvpermi.d xr6, \in0, 0x94
+ xvbsrl.v xr7, xr6, 1
+ xvbsrl.v xr8, xr6, 2
+ xvbsrl.v xr9, xr6, 3
+ xvbsrl.v xr10, xr6, 4
+ xvbsrl.v xr11, xr6, 5
+ xvbsrl.v xr12, xr6, 6
+ xvbsrl.v xr13, xr6, 7
+ xvpackev.d xr6, xr7, xr6
+ xvpackev.d xr7, xr9, xr8
+ xvpackev.d xr8, xr11, xr10
+ xvpackev.d xr9, xr13, xr12
+ xvdp2.h.bu.b xr10, xr6, xr5
+ xvdp2.h.bu.b xr11, xr7, xr5
+ xvdp2.h.bu.b xr12, xr8, xr5
+ xvdp2.h.bu.b xr13, xr9, xr5
+ xvhaddw.d.h xr10
+ xvhaddw.d.h xr11
+ xvhaddw.d.h xr12
+ xvhaddw.d.h xr13
+ xvpickev.w xr10, xr11, xr10
+ xvpickev.w xr11, xr13, xr12
+ xvmulwev.w.h xr10, xr10, xr1
+ xvmulwev.w.h xr11, xr11, xr1
+ xvadd.w xr10, xr10, xr2
+ xvadd.w xr11, xr11, xr2
+ xvsra.w xr10, xr10, xr3
+ xvsra.w xr11, xr11, xr3
+ xvadd.w xr10, xr10, xr4
+ xvadd.w xr11, xr11, xr4
+ xvssrani.h.w xr11, xr10, 0
+ xvpermi.q \out0, xr11, 0x01
+ xvssrani.bu.h \out0, xr11, 0
+.endm
+
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H4:
+ vld vr18, a2, 0
+ vldx vr19, a2, a3
+ alsl.d a2, a3, a2, 1
+ vbsrl.v vr6, vr18, 1
+ vbsrl.v vr7, vr18, 2
+ vbsrl.v vr8, vr18, 3
+ vbsrl.v vr9, vr19, 1
+ vbsrl.v vr10, vr19, 2
+ vbsrl.v vr11, vr19, 3
+ vilvl.d vr6, vr6, vr18
+ vilvl.d vr7, vr8, vr7
+ vilvl.d vr8, vr9, vr19
+ vilvl.d vr9, vr11, vr10
+ vdp2.h.bu.b vr10, vr6, vr5
+ vdp2.h.bu.b vr11, vr7, vr5
+ vdp2.h.bu.b vr12, vr8, vr5
+ vdp2.h.bu.b vr13, vr9, vr5
+ vhaddw.d.h vr10
+ vhaddw.d.h vr11
+ vhaddw.d.h vr12
+ vhaddw.d.h vr13
+ vpickev.w vr10, vr11, vr10
+ vpickev.w vr11, vr13, vr12
+ vmulwev.w.h vr10, vr10, vr1
+ vmulwev.w.h vr11, vr11, vr1
+ vadd.w vr10, vr10, vr2
+ vadd.w vr11, vr11, vr2
+ vsra.w vr10, vr10, vr3
+ vsra.w vr11, vr11, vr3
+ vadd.w vr10, vr10, vr4
+ vadd.w vr11, vr11, vr4
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vbsrl.v vr11, vr11, 4
+ fstx.s f11, a0, a1
+ alsl.d a0, a1, a0, 1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_H4
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H4_LASX:
+ vld vr18, a2, 0
+ vldx vr19, a2, a3
+ alsl.d a2, a3, a2, 1
+ xvpermi.q xr18, xr19, 0x02
+ xvbsrl.v xr6, xr18, 1
+ xvbsrl.v xr7, xr18, 2
+ xvbsrl.v xr8, xr18, 3
+ xvpackev.d xr6, xr6, xr18
+ xvpackev.d xr7, xr8, xr7
+ xvdp2.h.bu.b xr10, xr6, xr5
+ xvdp2.h.bu.b xr11, xr7, xr5
+ xvhaddw.d.h xr10
+ xvhaddw.d.h xr11
+ xvpickev.w xr10, xr11, xr10
+ xvmulwev.w.h xr10, xr10, xr1
+ xvadd.w xr10, xr10, xr2
+ xvsra.w xr10, xr10, xr3
+ xvadd.w xr10, xr10, xr4
+ xvpermi.q xr11, xr10, 0x01
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vbsrl.v vr11, vr11, 4
+ fstx.s f11, a0, a1
+ alsl.d a0, a1, a0, 1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_H4_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H6:
+ vld vr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr10, vr11
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vstelm.h vr11, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H6
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H6_LASX:
+ vld vr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LASX xr6, xr10
+ xvpermi.q xr11, xr10, 0x01
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.s f11, a0, 0
+ vstelm.h vr11, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H6_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H8:
+ vld vr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr10, vr11
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.d f11, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H8
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H8_LASX:
+ vld vr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LASX xr6, xr10
+ xvpermi.q xr11, xr10, 0x01
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr11, vr11, 0
+ fst.d f11, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H8_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H12:
+ vld vr6, a2, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr14, vr15
+ vld vr6, a2, 8
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr16, vr17
+ add.d a2, a2, a3
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ fst.d f17, a0, 0
+ vbsrl.v vr17, vr17, 8
+ fst.s f17, a0, 8
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H12
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H12_LASX:
+ xvld xr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr6, xr14
+ fst.d f14, a0, 0
+ vstelm.w vr14, a0, 8, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H16:
+ vld vr6, a2, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr14, vr15
+ vld vr6, a2, 8
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr16, vr17
+ add.d a2, a2, a3
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H16
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H16_LASX:
+ xvld xr6, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr6, xr10
+ vst vr10, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H24:
+ vld vr18, a2, 0
+ vld vr19, a2, 16
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+ vshuf4i.d vr18, vr19, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.d f15, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H24
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H24_LASX:
+ xvld xr18, a2, 0
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+ xvpermi.q xr19, xr18, 0x01
+ vst vr20, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LASX xr19, xr20
+ xvpermi.q xr21, xr20, 0x01
+ vssrani.h.w vr21, vr20, 0
+ vssrani.bu.h vr21, vr21, 0
+ fst.d f21, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H32:
+ vld vr18, a2, 0
+ vld vr19, a2, 16
+ vld vr20, a2, 32
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+ vshuf4i.d vr18, vr19, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+ vshuf4i.d vr19, vr20, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H32
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H32_LASX:
+ xvld xr18, a2, 0
+ xvld xr19, a2, 16
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr21
+ xvpermi.q xr20, xr21, 0x02
+ xvst xr20, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H48:
+ vld vr18, a2, 0
+ vld vr19, a2, 16
+ vld vr20, a2, 32
+ vld vr21, a2, 48
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+ vshuf4i.d vr18, vr19, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+ vshuf4i.d vr19, vr20, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 16
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr14, vr15
+ vshuf4i.d vr20, vr21, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H48
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H48_LASX:
+ xvld xr18, a2, 0
+ xvld xr19, a2, 32
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20
+ xvpermi.q xr18, xr19, 0x03
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr21
+ xvpermi.q xr20, xr21, 0x02
+ xvst xr20, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr20
+ vst vr20, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H64:
+ vld vr18, a2, 0
+ vld vr19, a2, 16
+ vld vr20, a2, 32
+ vld vr21, a2, 48
+ vld vr22, a2, 64
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15
+ vshuf4i.d vr18, vr19, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15
+ vshuf4i.d vr19, vr20, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 16
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr14, vr15
+ vshuf4i.d vr20, vr21, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 32
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr21, vr14, vr15
+ vshuf4i.d vr21, vr22, 0x09
+ PUT_HEVC_QPEL_UNI_W_H8_LSX vr21, vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, 48
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H64
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ xvreplve0.q xr5, xr5
+ addi.d a2, a2, -3 //src -= 3
+.LOOP_H64_LASX:
+ xvld xr18, a2, 0
+ xvld xr19, a2, 32
+ xvld xr20, a2, 64
+ add.d a2, a2, a3
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr21
+ xvpermi.q xr18, xr19, 0x03
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr22
+ xvpermi.q xr21, xr22, 0x02
+ xvst xr21, a0, 0
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr21
+ xvpermi.q xr19, xr20, 0x03
+ PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr22
+ xvpermi.q xr21, xr22, 0x02
+ xvst xr21, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_H64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index d0ee99d6b5..3cdb3fb2d7 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -188,6 +188,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[8][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv48_8_lsx;
c->put_hevc_qpel_uni_w[9][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv64_8_lsx;
+ c->put_hevc_qpel_uni_w[1][1][0] = ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx;
+ c->put_hevc_qpel_uni_w[2][1][0] = ff_hevc_put_hevc_qpel_uni_w_v6_8_lsx;
+ c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lsx;
+ c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lsx;
+ c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lsx;
+ c->put_hevc_qpel_uni_w[6][1][0] = ff_hevc_put_hevc_qpel_uni_w_v24_8_lsx;
+ c->put_hevc_qpel_uni_w[7][1][0] = ff_hevc_put_hevc_qpel_uni_w_v32_8_lsx;
+ c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lsx;
+ c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lsx;
+
+ c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx;
+ c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lsx;
+ c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lsx;
+ c->put_hevc_qpel_uni_w[4][0][1] = ff_hevc_put_hevc_qpel_uni_w_h12_8_lsx;
+ c->put_hevc_qpel_uni_w[5][0][1] = ff_hevc_put_hevc_qpel_uni_w_h16_8_lsx;
+ c->put_hevc_qpel_uni_w[6][0][1] = ff_hevc_put_hevc_qpel_uni_w_h24_8_lsx;
+ c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lsx;
+ c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lsx;
+ c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lsx;
+
c->sao_edge_filter[0] = ff_hevc_sao_edge_filter_8_lsx;
c->sao_edge_filter[1] = ff_hevc_sao_edge_filter_8_lsx;
c->sao_edge_filter[2] = ff_hevc_sao_edge_filter_8_lsx;
@@ -237,6 +257,24 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx;
c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+
+ c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
+ c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
+ c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
+ c->put_hevc_qpel_uni_w[6][1][0] = ff_hevc_put_hevc_qpel_uni_w_v24_8_lasx;
+ c->put_hevc_qpel_uni_w[7][1][0] = ff_hevc_put_hevc_qpel_uni_w_v32_8_lasx;
+ c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx;
+ c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx;
+
+ c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx;
+ c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx;
+ c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx;
+ c->put_hevc_qpel_uni_w[4][0][1] = ff_hevc_put_hevc_qpel_uni_w_h12_8_lasx;
+ c->put_hevc_qpel_uni_w[5][0][1] = ff_hevc_put_hevc_qpel_uni_w_h16_8_lasx;
+ c->put_hevc_qpel_uni_w[6][0][1] = ff_hevc_put_hevc_qpel_uni_w_h24_8_lasx;
+ c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx;
+ c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx;
+ c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx;
}
}
}
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 819c3c3ecf..8a9266d375 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -48,6 +48,24 @@ PEL_UNI_W(pel, pixels, 32);
PEL_UNI_W(pel, pixels, 48);
PEL_UNI_W(pel, pixels, 64);
+PEL_UNI_W(qpel, v, 8);
+PEL_UNI_W(qpel, v, 12);
+PEL_UNI_W(qpel, v, 16);
+PEL_UNI_W(qpel, v, 24);
+PEL_UNI_W(qpel, v, 32);
+PEL_UNI_W(qpel, v, 48);
+PEL_UNI_W(qpel, v, 64);
+
+PEL_UNI_W(qpel, h, 4);
+PEL_UNI_W(qpel, h, 6);
+PEL_UNI_W(qpel, h, 8);
+PEL_UNI_W(qpel, h, 12);
+PEL_UNI_W(qpel, h, 16);
+PEL_UNI_W(qpel, h, 24);
+PEL_UNI_W(qpel, h, 32);
+PEL_UNI_W(qpel, h, 48);
+PEL_UNI_W(qpel, h, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 0d724a90ef..3291294ed9 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -257,6 +257,26 @@ PEL_UNI_W(pel, pixels, 32);
PEL_UNI_W(pel, pixels, 48);
PEL_UNI_W(pel, pixels, 64);
+PEL_UNI_W(qpel, v, 4);
+PEL_UNI_W(qpel, v, 6);
+PEL_UNI_W(qpel, v, 8);
+PEL_UNI_W(qpel, v, 12);
+PEL_UNI_W(qpel, v, 16);
+PEL_UNI_W(qpel, v, 24);
+PEL_UNI_W(qpel, v, 32);
+PEL_UNI_W(qpel, v, 48);
+PEL_UNI_W(qpel, v, 64);
+
+PEL_UNI_W(qpel, h, 4);
+PEL_UNI_W(qpel, h, 6);
+PEL_UNI_W(qpel, h, 8);
+PEL_UNI_W(qpel, h, 12);
+PEL_UNI_W(qpel, h, 16);
+PEL_UNI_W(qpel, h, 24);
+PEL_UNI_W(qpel, h, 32);
+PEL_UNI_W(qpel, h, 48);
+PEL_UNI_W(qpel, h, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v1 5/6] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 asm opt
2023-12-22 10:52 [FFmpeg-devel] [PATCH v1] [loongarch] Add hevc 128-bit & 256-bit asm optimizations jinbo
` (3 preceding siblings ...)
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 4/6] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-22 10:52 ` jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 6/6] avcodec/hevc: Add asm opt for the following functions jinbo
5 siblings, 0 replies; 7+ messages in thread
From: jinbo @ 2023-12-22 10:52 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
tests/checkasm/checkasm: C LSX LASX
put_hevc_epel_uni_w_hv4_8_c: 9.5 2.2
put_hevc_epel_uni_w_hv6_8_c: 18.5 5.0 3.7
put_hevc_epel_uni_w_hv8_8_c: 30.7 6.0 4.5
put_hevc_epel_uni_w_hv12_8_c: 63.7 14.0 10.7
put_hevc_epel_uni_w_hv16_8_c: 107.5 22.7 17.0
put_hevc_epel_uni_w_hv24_8_c: 236.7 50.2 31.7
put_hevc_epel_uni_w_hv32_8_c: 414.5 88.0 53.0
put_hevc_epel_uni_w_hv48_8_c: 917.5 197.7 118.5
put_hevc_epel_uni_w_hv64_8_c: 1617.0 349.5 203.0
After this patch, the peformance of decoding H265 4K 30FPS 30Mbps
on 3A6000 with 8 threads improves 3fps (52fps-->55fsp).
Change-Id: If067e394cec4685c62193e7adb829ac93ba4804d
---
libavcodec/loongarch/hevc_mc.S | 821 ++++++++++++++++++
libavcodec/loongarch/hevcdsp_init_loongarch.c | 19 +
libavcodec/loongarch/hevcdsp_lasx.h | 9 +
libavcodec/loongarch/hevcdsp_lsx.h | 10 +
4 files changed, 859 insertions(+)
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index 2ee338fb8e..0b0647546b 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -22,6 +22,7 @@
#include "loongson_asm.S"
.extern ff_hevc_qpel_filters
+.extern ff_hevc_epel_filters
.macro LOAD_VAR bit
addi.w t1, a5, 6 //shift
@@ -206,6 +207,12 @@
.endif
.endm
+/*
+ * void FUNC(put_hevc_pel_uni_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
function ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx
LOAD_VAR 128
srli.w t0, a4, 1
@@ -482,6 +489,12 @@ endfunc
xvhaddw.d.w \in0, \in0, \in0
.endm
+/*
+ * void FUNC(put_hevc_qpel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
function ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx
LOAD_VAR 128
ld.d t0, sp, 8 //my
@@ -1253,6 +1266,12 @@ endfunc
xvssrani.bu.h \out0, xr11, 0
.endm
+/*
+ * void FUNC(put_hevc_qpel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
function ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx
LOAD_VAR 128
ld.d t0, sp, 0 //mx
@@ -1763,3 +1782,805 @@ function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
addi.d a4, a4, -1
bnez a4, .LOOP_H64_LASX
endfunc
+
+const shufb
+ .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6
+ .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10
+endconst
+
+.macro PUT_HEVC_EPEL_UNI_W_HV4_LSX w
+ fld.d f7, a2, 0 // start to load src
+ fldx.d f8, a2, a3
+ alsl.d a2, a3, a2, 1
+ fld.d f9, a2, 0
+ vshuf.b vr7, vr7, vr7, vr0 // 0123 1234 2345 3456
+ vshuf.b vr8, vr8, vr8, vr0
+ vshuf.b vr9, vr9, vr9, vr0
+ vdp2.h.bu.b vr10, vr7, vr5 // EPEL_FILTER(src, 1)
+ vdp2.h.bu.b vr11, vr8, vr5
+ vdp2.h.bu.b vr12, vr9, vr5
+ vhaddw.w.h vr10, vr10, vr10 // tmp[0/1/2/3]
+ vhaddw.w.h vr11, vr11, vr11 // vr10,vr11,vr12 corresponding to EPEL_EXTRA
+ vhaddw.w.h vr12, vr12, vr12
+.LOOP_HV4_\w:
+ add.d a2, a2, a3
+ fld.d f14, a2, 0 // height loop begin
+ vshuf.b vr14, vr14, vr14, vr0
+ vdp2.h.bu.b vr13, vr14, vr5
+ vhaddw.w.h vr13, vr13, vr13
+ vmul.w vr14, vr10, vr16 // EPEL_FILTER(tmp, MAX_PB_SIZE)
+ vmadd.w vr14, vr11, vr17
+ vmadd.w vr14, vr12, vr18
+ vmadd.w vr14, vr13, vr19
+ vaddi.wu vr10, vr11, 0 //back up previous value
+ vaddi.wu vr11, vr12, 0
+ vaddi.wu vr12, vr13, 0
+ vsrai.w vr14, vr14, 6 // >> 6
+ vmul.w vr14, vr14, vr1 // * wx
+ vadd.w vr14, vr14, vr2 // + offset
+ vsra.w vr14, vr14, vr3 // >> shift
+ vadd.w vr14, vr14, vr4 // + ox
+ vssrani.h.w vr14, vr14, 0
+ vssrani.bu.h vr14, vr14, 0 // clip
+ fst.s f14, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HV4_\w
+.endm
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_hv4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV4_LSX 4
+endfunc
+
+.macro PUT_HEVC_EPEL_UNI_W_HV8_LSX w
+ vld vr7, a2, 0 // start to load src
+ vldx vr8, a2, a3
+ alsl.d a2, a3, a2, 1
+ vld vr9, a2, 0
+ vshuf.b vr10, vr7, vr7, vr0 // 0123 1234 2345 3456
+ vshuf.b vr11, vr8, vr8, vr0
+ vshuf.b vr12, vr9, vr9, vr0
+ vshuf.b vr7, vr7, vr7, vr22// 4567 5678 6789 78910
+ vshuf.b vr8, vr8, vr8, vr22
+ vshuf.b vr9, vr9, vr9, vr22
+ vdp2.h.bu.b vr13, vr10, vr5 // EPEL_FILTER(src, 1)
+ vdp2.h.bu.b vr14, vr11, vr5
+ vdp2.h.bu.b vr15, vr12, vr5
+ vdp2.h.bu.b vr23, vr7, vr5
+ vdp2.h.bu.b vr20, vr8, vr5
+ vdp2.h.bu.b vr21, vr9, vr5
+ vhaddw.w.h vr7, vr13, vr13
+ vhaddw.w.h vr8, vr14, vr14
+ vhaddw.w.h vr9, vr15, vr15
+ vhaddw.w.h vr10, vr23, vr23
+ vhaddw.w.h vr11, vr20, vr20
+ vhaddw.w.h vr12, vr21, vr21
+.LOOP_HV8_HORI_\w:
+ add.d a2, a2, a3
+ vld vr15, a2, 0
+ vshuf.b vr23, vr15, vr15, vr0
+ vshuf.b vr15, vr15, vr15, vr22
+ vdp2.h.bu.b vr13, vr23, vr5
+ vdp2.h.bu.b vr14, vr15, vr5
+ vhaddw.w.h vr13, vr13, vr13 //789--13
+ vhaddw.w.h vr14, vr14, vr14 //101112--14
+ vmul.w vr15, vr7, vr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+ vmadd.w vr15, vr8, vr17
+ vmadd.w vr15, vr9, vr18
+ vmadd.w vr15, vr13, vr19
+ vmul.w vr20, vr10, vr16
+ vmadd.w vr20, vr11, vr17
+ vmadd.w vr20, vr12, vr18
+ vmadd.w vr20, vr14, vr19
+ vaddi.wu vr7, vr8, 0 //back up previous value
+ vaddi.wu vr8, vr9, 0
+ vaddi.wu vr9, vr13, 0
+ vaddi.wu vr10, vr11, 0
+ vaddi.wu vr11, vr12, 0
+ vaddi.wu vr12, vr14, 0
+ vsrai.w vr15, vr15, 6 // >> 6
+ vsrai.w vr20, vr20, 6
+ vmul.w vr15, vr15, vr1 // * wx
+ vmul.w vr20, vr20, vr1
+ vadd.w vr15, vr15, vr2 // + offset
+ vadd.w vr20, vr20, vr2
+ vsra.w vr15, vr15, vr3 // >> shift
+ vsra.w vr20, vr20, vr3
+ vadd.w vr15, vr15, vr4 // + ox
+ vadd.w vr20, vr20, vr4
+ vssrani.h.w vr20, vr15, 0
+ vssrani.bu.h vr20, vr20, 0
+.if \w > 6
+ fst.d f20, a0, 0
+.else
+ fst.s f20, a0, 0
+ vstelm.h vr20, a0, 4, 2
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HV8_HORI_\w
+.endm
+
+.macro PUT_HEVC_EPEL_UNI_W_HV8_LASX w
+ vld vr7, a2, 0 // start to load src
+ vldx vr8, a2, a3
+ alsl.d a2, a3, a2, 1
+ vld vr9, a2, 0
+ xvreplve0.q xr7, xr7
+ xvreplve0.q xr8, xr8
+ xvreplve0.q xr9, xr9
+ xvshuf.b xr10, xr7, xr7, xr0 // 0123 1234 2345 3456
+ xvshuf.b xr11, xr8, xr8, xr0
+ xvshuf.b xr12, xr9, xr9, xr0
+ xvdp2.h.bu.b xr13, xr10, xr5 // EPEL_FILTER(src, 1)
+ xvdp2.h.bu.b xr14, xr11, xr5
+ xvdp2.h.bu.b xr15, xr12, xr5
+ xvhaddw.w.h xr7, xr13, xr13
+ xvhaddw.w.h xr8, xr14, xr14
+ xvhaddw.w.h xr9, xr15, xr15
+.LOOP_HV8_HORI_LASX_\w:
+ add.d a2, a2, a3
+ vld vr15, a2, 0
+ xvreplve0.q xr15, xr15
+ xvshuf.b xr23, xr15, xr15, xr0
+ xvdp2.h.bu.b xr10, xr23, xr5
+ xvhaddw.w.h xr10, xr10, xr10
+ xvmul.w xr15, xr7, xr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+ xvmadd.w xr15, xr8, xr17
+ xvmadd.w xr15, xr9, xr18
+ xvmadd.w xr15, xr10, xr19
+ xvaddi.wu xr7, xr8, 0 //back up previous value
+ xvaddi.wu xr8, xr9, 0
+ xvaddi.wu xr9, xr10, 0
+ xvsrai.w xr15, xr15, 6 // >> 6
+ xvmul.w xr15, xr15, xr1 // * wx
+ xvadd.w xr15, xr15, xr2 // + offset
+ xvsra.w xr15, xr15, xr3 // >> shift
+ xvadd.w xr15, xr15, xr4 // + ox
+ xvpermi.q xr20, xr15, 0x01
+ vssrani.h.w vr20, vr15, 0
+ vssrani.bu.h vr20, vr20, 0
+.if \w > 6
+ fst.d f20, a0, 0
+.else
+ fst.s f20, a0, 0
+ vstelm.h vr20, a0, 4, 2
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HV8_HORI_LASX_\w
+.endm
+
+.macro PUT_HEVC_EPEL_UNI_W_HV16_LASX w
+ xvld xr7, a2, 0 // start to load src
+ xvldx xr8, a2, a3
+ alsl.d a2, a3, a2, 1
+ xvld xr9, a2, 0
+ xvpermi.d xr10, xr7, 0x09 //8..18
+ xvpermi.d xr11, xr8, 0x09
+ xvpermi.d xr12, xr9, 0x09
+ xvreplve0.q xr7, xr7
+ xvreplve0.q xr8, xr8
+ xvreplve0.q xr9, xr9
+ xvshuf.b xr13, xr7, xr7, xr0 // 0123 1234 2345 3456
+ xvshuf.b xr14, xr8, xr8, xr0
+ xvshuf.b xr15, xr9, xr9, xr0
+ xvdp2.h.bu.b xr20, xr13, xr5 // EPEL_FILTER(src, 1)
+ xvdp2.h.bu.b xr21, xr14, xr5
+ xvdp2.h.bu.b xr22, xr15, xr5
+ xvhaddw.w.h xr7, xr20, xr20
+ xvhaddw.w.h xr8, xr21, xr21
+ xvhaddw.w.h xr9, xr22, xr22
+ xvreplve0.q xr10, xr10
+ xvreplve0.q xr11, xr11
+ xvreplve0.q xr12, xr12
+ xvshuf.b xr13, xr10, xr10, xr0
+ xvshuf.b xr14, xr11, xr11, xr0
+ xvshuf.b xr15, xr12, xr12, xr0
+ xvdp2.h.bu.b xr20, xr13, xr5
+ xvdp2.h.bu.b xr21, xr14, xr5
+ xvdp2.h.bu.b xr22, xr15, xr5
+ xvhaddw.w.h xr10, xr20, xr20
+ xvhaddw.w.h xr11, xr21, xr21
+ xvhaddw.w.h xr12, xr22, xr22
+.LOOP_HV16_HORI_LASX_\w:
+ add.d a2, a2, a3
+ xvld xr15, a2, 0
+ xvpermi.d xr20, xr15, 0x09 //8...18
+ xvreplve0.q xr15, xr15
+ xvreplve0.q xr20, xr20
+ xvshuf.b xr21, xr15, xr15, xr0
+ xvshuf.b xr22, xr20, xr20, xr0
+ xvdp2.h.bu.b xr13, xr21, xr5
+ xvdp2.h.bu.b xr14, xr22, xr5
+ xvhaddw.w.h xr13, xr13, xr13
+ xvhaddw.w.h xr14, xr14, xr14
+ xvmul.w xr15, xr7, xr16 //EPEL_FILTER(tmp, MAX_PB_SIZE)
+ xvmadd.w xr15, xr8, xr17
+ xvmadd.w xr15, xr9, xr18
+ xvmadd.w xr15, xr13, xr19
+ xvmul.w xr20, xr10, xr16
+ xvmadd.w xr20, xr11, xr17
+ xvmadd.w xr20, xr12, xr18
+ xvmadd.w xr20, xr14, xr19
+ xvaddi.wu xr7, xr8, 0 //back up previous value
+ xvaddi.wu xr8, xr9, 0
+ xvaddi.wu xr9, xr13, 0
+ xvaddi.wu xr10, xr11, 0
+ xvaddi.wu xr11, xr12, 0
+ xvaddi.wu xr12, xr14, 0
+ xvsrai.w xr15, xr15, 6 // >> 6
+ xvsrai.w xr20, xr20, 6 // >> 6
+ xvmul.w xr15, xr15, xr1 // * wx
+ xvmul.w xr20, xr20, xr1 // * wx
+ xvadd.w xr15, xr15, xr2 // + offset
+ xvadd.w xr20, xr20, xr2 // + offset
+ xvsra.w xr15, xr15, xr3 // >> shift
+ xvsra.w xr20, xr20, xr3 // >> shift
+ xvadd.w xr15, xr15, xr4 // + ox
+ xvadd.w xr20, xr20, xr4 // + ox
+ xvssrani.h.w xr20, xr15, 0
+ xvpermi.q xr21, xr20, 0x01
+ vssrani.bu.h vr21, vr20, 0
+ vpermi.w vr21, vr21, 0xd8
+.if \w < 16
+ fst.d f21, a0, 0
+ vstelm.w vr21, a0, 8, 2
+.else
+ vst vr21, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_HV16_HORI_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV8_LASX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV8_LASX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 12
+ addi.d a0, t2, 8
+ addi.d a2, t3, 8
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_HV4_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 2
+.LOOP_HV16:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 16
+ addi.d a0, t2, 8
+ addi.d a2, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 3
+.LOOP_HV24:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 24
+ addi.d a0, t2, 8
+ addi.d t2, t2, 8
+ addi.d a2, t3, 8
+ addi.d t3, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 24
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_HV8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 4
+.LOOP_HV32:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 32
+ addi.d a0, t2, 8
+ addi.d t2, t2, 8
+ addi.d a2, t3, 8
+ addi.d t3, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV32
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 2
+.LOOP_HV32_LASX:
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 32
+ addi.d a0, t2, 16
+ addi.d t2, t2, 16
+ addi.d a2, t3, 16
+ addi.d t3, t3, 16
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV32_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 6
+.LOOP_HV48:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 48
+ addi.d a0, t2, 8
+ addi.d t2, t2, 8
+ addi.d a2, t3, 8
+ addi.d t3, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV48
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 3
+.LOOP_HV48_LASX:
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 48
+ addi.d a0, t2, 16
+ addi.d t2, t2, 16
+ addi.d a2, t3, 16
+ addi.d t3, t3, 16
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV48_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ vreplvei.w vr5, vr5, 0
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ vreplvei.w vr16, vr6, 0
+ vreplvei.w vr17, vr6, 1
+ vreplvei.w vr18, vr6, 2
+ vreplvei.w vr19, vr6, 3
+ la.local t1, shufb
+ vld vr0, t1, 0
+ vaddi.bu vr22, vr0, 4 // update shufb to get high part
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 8
+.LOOP_HV64:
+ PUT_HEVC_EPEL_UNI_W_HV8_LSX 64
+ addi.d a0, t2, 8
+ addi.d t2, t2, 8
+ addi.d a2, t3, 8
+ addi.d t3, t3, 8
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV64
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 // mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1];
+ xvreplve0.w xr5, xr5
+ ld.d t0, sp, 8 // my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1];
+ vsllwil.h.b vr6, vr6, 0
+ vsllwil.w.h vr6, vr6, 0
+ xvreplve0.q xr6, xr6
+ xvrepl128vei.w xr16, xr6, 0
+ xvrepl128vei.w xr17, xr6, 1
+ xvrepl128vei.w xr18, xr6, 2
+ xvrepl128vei.w xr19, xr6, 3
+ la.local t1, shufb
+ xvld xr0, t1, 0
+ sub.d a2, a2, a3 // src -= srcstride
+ addi.d a2, a2, -1
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ addi.d t5, zero, 4
+.LOOP_HV64_LASX:
+ PUT_HEVC_EPEL_UNI_W_HV16_LASX 64
+ addi.d a0, t2, 16
+ addi.d t2, t2, 16
+ addi.d a2, t3, 16
+ addi.d t3, t3, 16
+ addi.d a4, t4, 0
+ addi.d t5, t5, -1
+ bnez t5, .LOOP_HV64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 3cdb3fb2d7..245a833947 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -171,6 +171,16 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+ c->put_hevc_epel_uni_w[1][1][1] = ff_hevc_put_hevc_epel_uni_w_hv4_8_lsx;
+ c->put_hevc_epel_uni_w[2][1][1] = ff_hevc_put_hevc_epel_uni_w_hv6_8_lsx;
+ c->put_hevc_epel_uni_w[3][1][1] = ff_hevc_put_hevc_epel_uni_w_hv8_8_lsx;
+ c->put_hevc_epel_uni_w[4][1][1] = ff_hevc_put_hevc_epel_uni_w_hv12_8_lsx;
+ c->put_hevc_epel_uni_w[5][1][1] = ff_hevc_put_hevc_epel_uni_w_hv16_8_lsx;
+ c->put_hevc_epel_uni_w[6][1][1] = ff_hevc_put_hevc_epel_uni_w_hv24_8_lsx;
+ c->put_hevc_epel_uni_w[7][1][1] = ff_hevc_put_hevc_epel_uni_w_hv32_8_lsx;
+ c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lsx;
+ c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lsx;
+
c->put_hevc_epel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx;
c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx;
c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx;
@@ -258,6 +268,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx;
c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx;
+ c->put_hevc_epel_uni_w[2][1][1] = ff_hevc_put_hevc_epel_uni_w_hv6_8_lasx;
+ c->put_hevc_epel_uni_w[3][1][1] = ff_hevc_put_hevc_epel_uni_w_hv8_8_lasx;
+ c->put_hevc_epel_uni_w[4][1][1] = ff_hevc_put_hevc_epel_uni_w_hv12_8_lasx;
+ c->put_hevc_epel_uni_w[5][1][1] = ff_hevc_put_hevc_epel_uni_w_hv16_8_lasx;
+ c->put_hevc_epel_uni_w[6][1][1] = ff_hevc_put_hevc_epel_uni_w_hv24_8_lasx;
+ c->put_hevc_epel_uni_w[7][1][1] = ff_hevc_put_hevc_epel_uni_w_hv32_8_lasx;
+ c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx;
+ c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx;
+
c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 8a9266d375..7f09d0943a 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -66,6 +66,15 @@ PEL_UNI_W(qpel, h, 32);
PEL_UNI_W(qpel, h, 48);
PEL_UNI_W(qpel, h, 64);
+PEL_UNI_W(epel, hv, 6);
+PEL_UNI_W(epel, hv, 8);
+PEL_UNI_W(epel, hv, 12);
+PEL_UNI_W(epel, hv, 16);
+PEL_UNI_W(epel, hv, 24);
+PEL_UNI_W(epel, hv, 32);
+PEL_UNI_W(epel, hv, 48);
+PEL_UNI_W(epel, hv, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 3291294ed9..7769cf25ae 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -277,6 +277,16 @@ PEL_UNI_W(qpel, h, 32);
PEL_UNI_W(qpel, h, 48);
PEL_UNI_W(qpel, h, 64);
+PEL_UNI_W(epel, hv, 4);
+PEL_UNI_W(epel, hv, 6);
+PEL_UNI_W(epel, hv, 8);
+PEL_UNI_W(epel, hv, 12);
+PEL_UNI_W(epel, hv, 16);
+PEL_UNI_W(epel, hv, 24);
+PEL_UNI_W(epel, hv, 32);
+PEL_UNI_W(epel, hv, 48);
+PEL_UNI_W(epel, hv, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v1 6/6] avcodec/hevc: Add asm opt for the following functions
2023-12-22 10:52 [FFmpeg-devel] [PATCH v1] [loongarch] Add hevc 128-bit & 256-bit asm optimizations jinbo
` (4 preceding siblings ...)
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 5/6] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 " jinbo
@ 2023-12-22 10:52 ` jinbo
5 siblings, 0 replies; 7+ messages in thread
From: jinbo @ 2023-12-22 10:52 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: jinbo
tests/checkasm/checkasm: C LSX LASX
put_hevc_qpel_uni_h4_8_c: 5.7 1.2
put_hevc_qpel_uni_h6_8_c: 12.2 2.7
put_hevc_qpel_uni_h8_8_c: 21.5 3.2
put_hevc_qpel_uni_h12_8_c: 47.2 9.2 7.2
put_hevc_qpel_uni_h16_8_c: 87.0 11.7 9.0
put_hevc_qpel_uni_h24_8_c: 188.2 27.5 21.0
put_hevc_qpel_uni_h32_8_c: 335.2 46.7 28.5
put_hevc_qpel_uni_h48_8_c: 772.5 104.5 65.2
put_hevc_qpel_uni_h64_8_c: 1383.2 142.2 109.0
put_hevc_epel_uni_w_v4_8_c: 5.0 1.5
put_hevc_epel_uni_w_v6_8_c: 10.7 3.5 2.5
put_hevc_epel_uni_w_v8_8_c: 18.2 3.7 3.0
put_hevc_epel_uni_w_v12_8_c: 40.2 10.7 7.5
put_hevc_epel_uni_w_v16_8_c: 70.2 13.0 9.2
put_hevc_epel_uni_w_v24_8_c: 158.2 30.2 22.5
put_hevc_epel_uni_w_v32_8_c: 281.0 52.0 36.5
put_hevc_epel_uni_w_v48_8_c: 631.7 116.7 82.7
put_hevc_epel_uni_w_v64_8_c: 1108.2 207.5 142.2
put_hevc_epel_uni_w_h4_8_c: 4.7 1.2
put_hevc_epel_uni_w_h6_8_c: 9.7 3.5 2.7
put_hevc_epel_uni_w_h8_8_c: 17.2 4.2 3.5
put_hevc_epel_uni_w_h12_8_c: 38.0 11.5 7.2
put_hevc_epel_uni_w_h16_8_c: 69.2 14.5 9.2
put_hevc_epel_uni_w_h24_8_c: 152.0 34.7 22.5
put_hevc_epel_uni_w_h32_8_c: 271.0 58.0 40.0
put_hevc_epel_uni_w_h48_8_c: 597.5 136.7 95.0
put_hevc_epel_uni_w_h64_8_c: 1074.0 252.2 168.0
put_hevc_epel_bi_h4_8_c: 4.5 0.7
put_hevc_epel_bi_h6_8_c: 9.0 1.5
put_hevc_epel_bi_h8_8_c: 15.2 1.7
put_hevc_epel_bi_h12_8_c: 33.5 4.2 3.7
put_hevc_epel_bi_h16_8_c: 59.7 5.2 4.7
put_hevc_epel_bi_h24_8_c: 132.2 11.0
put_hevc_epel_bi_h32_8_c: 232.7 20.2 13.2
put_hevc_epel_bi_h48_8_c: 521.7 45.2 31.2
put_hevc_epel_bi_h64_8_c: 949.0 71.5 51.0
After this patch, the peformance of decoding H265 4K 30FPS
30Mbps on 3A6000 with 8 threads improves 1fps(55fps-->56fsp).
Change-Id: I8cc1e41daa63ca478039bc55d1ee8934a7423f51
---
libavcodec/loongarch/hevc_mc.S | 1991 ++++++++++++++++-
libavcodec/loongarch/hevcdsp_init_loongarch.c | 66 +
libavcodec/loongarch/hevcdsp_lasx.h | 54 +
libavcodec/loongarch/hevcdsp_lsx.h | 36 +-
4 files changed, 2144 insertions(+), 3 deletions(-)
diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S
index 0b0647546b..a0e5938fbd 100644
--- a/libavcodec/loongarch/hevc_mc.S
+++ b/libavcodec/loongarch/hevc_mc.S
@@ -1784,8 +1784,12 @@ function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx
endfunc
const shufb
- .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6
- .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10
+ .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6 //mask for epel_uni_w(128-bit)
+ .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10 //mask for epel_uni_w(256-bit)
+ .byte 0,1,2,3, 4,5,6,7 ,1,2,3,4, 5,6,7,8 //mask for qpel_uni_h4
+ .byte 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8 //mask for qpel_uni_h/v6/8...
+ .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9, 7,8,9,10 //epel_uni_w_h16/24/32/48/64
+ .byte 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8, 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8 //mask for bi_epel_h16/24/32/48/64
endconst
.macro PUT_HEVC_EPEL_UNI_W_HV4_LSX w
@@ -2584,3 +2588,1986 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx
addi.d t5, t5, -1
bnez t5, .LOOP_HV64_LASX
endfunc
+
+/*
+ * void FUNC(put_hevc_qpel_uni_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, intptr_t mx, intptr_t my,
+ * int width)
+ */
+function ff_hevc_put_hevc_uni_qpel_h4_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr5, t1, t0 //filter
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr1, t1
+ la.local t1, shufb
+ vld vr2, t1, 32 //mask0 0 1
+ vaddi.bu vr3, vr2, 2 //mask1 2 3
+.LOOP_UNI_H4:
+ vld vr18, a2, 0
+ vldx vr19, a2, a3
+ alsl.d a2, a3, a2, 1
+ vshuf.b vr6, vr18, vr18, vr2
+ vshuf.b vr7, vr18, vr18, vr3
+ vshuf.b vr8, vr19, vr19, vr2
+ vshuf.b vr9, vr19, vr19, vr3
+ vdp2.h.bu.b vr10, vr6, vr5
+ vdp2.h.bu.b vr11, vr7, vr5
+ vdp2.h.bu.b vr12, vr8, vr5
+ vdp2.h.bu.b vr13, vr9, vr5
+ vhaddw.d.h vr10
+ vhaddw.d.h vr11
+ vhaddw.d.h vr12
+ vhaddw.d.h vr13
+ vpickev.w vr10, vr11, vr10
+ vpickev.w vr11, vr13, vr12
+ vpickev.h vr10, vr11, vr10
+ vadd.h vr10, vr10, vr1
+ vsrai.h vr10, vr10, 6
+ vssrani.bu.h vr10, vr10, 0
+ fst.s f10, a0, 0
+ vbsrl.v vr10, vr10, 4
+ fstx.s f10, a0, a1
+ alsl.d a0, a1, a0, 1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_UNI_H4
+endfunc
+
+.macro HEVC_UNI_QPEL_H8_LSX in0, out0
+ vshuf.b vr10, \in0, \in0, vr5
+ vshuf.b vr11, \in0, \in0, vr6
+ vshuf.b vr12, \in0, \in0, vr7
+ vshuf.b vr13, \in0, \in0, vr8
+ vdp2.h.bu.b \out0, vr10, vr0 //(QPEL_FILTER(src, 1)
+ vdp2add.h.bu.b \out0, vr11, vr1
+ vdp2add.h.bu.b \out0, vr12, vr2
+ vdp2add.h.bu.b \out0, vr13, vr3
+ vadd.h \out0, \out0, vr4
+ vsrai.h \out0, \out0, 6
+.endm
+
+.macro HEVC_UNI_QPEL_H16_LASX in0, out0
+ xvshuf.b xr10, \in0, \in0, xr5
+ xvshuf.b xr11, \in0, \in0, xr6
+ xvshuf.b xr12, \in0, \in0, xr7
+ xvshuf.b xr13, \in0, \in0, xr8
+ xvdp2.h.bu.b \out0, xr10, xr0 //(QPEL_FILTER(src, 1)
+ xvdp2add.h.bu.b \out0, xr11, xr1
+ xvdp2add.h.bu.b \out0, xr12, xr2
+ xvdp2add.h.bu.b \out0, xr13, xr3
+ xvadd.h \out0, \out0, xr4
+ xvsrai.h \out0, \out0, 6
+.endm
+
+function ff_hevc_put_hevc_uni_qpel_h6_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H6:
+ vld vr9, a2, 0
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vssrani.bu.h vr14, vr14, 0
+ fst.s f14, a0, 0
+ vstelm.h vr14, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H6
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h8_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H8:
+ vld vr9, a2, 0
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vssrani.bu.h vr14, vr14, 0
+ fst.d f14, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H8
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h12_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H12:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vssrani.bu.h vr15, vr14, 0
+ fst.d f15, a0, 0
+ vstelm.w vr15, a0, 8, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H12
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h12_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H12_LASX:
+ xvld xr9, a2, 0
+ add.d a2, a2, a3
+ xvpermi.d xr9, xr9, 0x94 //rearrange data
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.bu.h vr15, vr14, 0
+ fst.d f15, a0, 0
+ vstelm.w vr15, a0, 8, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h16_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H16:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vssrani.bu.h vr15, vr14, 0
+ vst vr15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H16
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h16_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H16_LASX:
+ xvld xr9, a2, 0
+ add.d a2, a2, a3
+ xvpermi.d xr9, xr9, 0x94 //rearrange data
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.bu.h vr15, vr14, 0
+ vst vr15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h24_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H24:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vld vr9, a2, 16
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr16
+ vssrani.bu.h vr15, vr14, 0
+ vssrani.bu.h vr16, vr16, 0
+ vst vr15, a0, 0
+ fst.d f16, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H24
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h24_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H24_LASX:
+ xvld xr9, a2, 0
+ xvpermi.q xr19, xr9, 0x01 //16...23
+ add.d a2, a2, a3
+ xvpermi.d xr9, xr9, 0x94 //rearrange data
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.bu.h vr15, vr14, 0
+ vst vr15, a0, 0
+ HEVC_UNI_QPEL_H8_LSX vr19, vr16
+ vssrani.bu.h vr16, vr16, 0
+ fst.d f16, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h32_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H32:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vld vr9, a2, 16
+ HEVC_UNI_QPEL_H8_LSX vr9, vr16
+ vld vr9, a2, 24
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr17
+ vssrani.bu.h vr15, vr14, 0
+ vssrani.bu.h vr17, vr16, 0
+ vst vr15, a0, 0
+ vst vr17, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H32
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h32_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H32_LASX:
+ xvld xr9, a2, 0
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvld xr9, a2, 16
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr15
+ add.d a2, a2, a3
+ xvssrani.bu.h xr15, xr14, 0
+ xvpermi.d xr15, xr15, 0xd8
+ xvst xr15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h48_8_lsx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ vreplvei.h vr1, vr0, 1 //cd...
+ vreplvei.h vr2, vr0, 2 //ef...
+ vreplvei.h vr3, vr0, 3 //gh...
+ vreplvei.h vr0, vr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ vreplgr2vr.h vr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ vaddi.bu vr6, vr5, 2
+ vaddi.bu vr7, vr5, 4
+ vaddi.bu vr8, vr5, 6
+.LOOP_UNI_H48:
+ vld vr9, a2, 0
+ HEVC_UNI_QPEL_H8_LSX vr9, vr14
+ vld vr9, a2, 8
+ HEVC_UNI_QPEL_H8_LSX vr9, vr15
+ vld vr9, a2, 16
+ HEVC_UNI_QPEL_H8_LSX vr9, vr16
+ vld vr9, a2, 24
+ HEVC_UNI_QPEL_H8_LSX vr9, vr17
+ vld vr9, a2, 32
+ HEVC_UNI_QPEL_H8_LSX vr9, vr18
+ vld vr9, a2, 40
+ add.d a2, a2, a3
+ HEVC_UNI_QPEL_H8_LSX vr9, vr19
+ vssrani.bu.h vr15, vr14, 0
+ vssrani.bu.h vr17, vr16, 0
+ vssrani.bu.h vr19, vr18, 0
+ vst vr15, a0, 0
+ vst vr17, a0, 16
+ vst vr19, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H48
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h48_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H48_LASX:
+ xvld xr9, a2, 0
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvld xr9, a2, 16
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr15
+ xvld xr9, a2, 32
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr16
+ add.d a2, a2, a3
+ xvssrani.bu.h xr15, xr14, 0
+ xvpermi.d xr15, xr15, 0xd8
+ xvst xr15, a0, 0
+ xvpermi.q xr17, xr16, 0x01
+ vssrani.bu.h vr17, vr16, 0
+ vst vr17, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_uni_qpel_h64_8_lasx
+ addi.d t0, a5, -1
+ slli.w t0, t0, 4
+ la.local t1, ff_hevc_qpel_filters
+ vldx vr0, t1, t0 //filter abcdefgh
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1 //cd...
+ xvrepl128vei.h xr2, xr0, 2 //ef...
+ xvrepl128vei.h xr3, xr0, 3 //gh...
+ xvrepl128vei.h xr0, xr0, 0 //ab...
+ addi.d a2, a2, -3 //src -= 3
+ addi.w t1, zero, 32
+ xvreplgr2vr.h xr4, t1
+ la.local t1, shufb
+ vld vr5, t1, 48
+ xvreplve0.q xr5, xr5
+ xvaddi.bu xr6, xr5, 2
+ xvaddi.bu xr7, xr5, 4
+ xvaddi.bu xr8, xr5, 6
+.LOOP_UNI_H64_LASX:
+ xvld xr9, a2, 0
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr14
+ xvld xr9, a2, 16
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr15
+ xvld xr9, a2, 32
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr16
+ xvld xr9, a2, 48
+ xvpermi.d xr9, xr9, 0x94
+ HEVC_UNI_QPEL_H16_LASX xr9, xr17
+ add.d a2, a2, a3
+ xvssrani.bu.h xr15, xr14, 0
+ xvpermi.d xr15, xr15, 0xd8
+ xvst xr15, a0, 0
+ xvssrani.bu.h xr17, xr16, 0
+ xvpermi.d xr17, xr17, 0xd8
+ xvst xr17, a0, 32
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_H64_LASX
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_v4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ fld.s f6, a2, 0 //0
+ fldx.s f7, a2, a3 //1
+ fldx.s f8, a2, t0 //2
+ add.d a2, a2, t1
+ vilvl.b vr6, vr7, vr6
+ vilvl.b vr7, vr8, vr8
+ vilvl.h vr6, vr7, vr6
+ vreplvei.w vr0, vr0, 0
+.LOOP_UNI_V4:
+ fld.s f9, a2, 0 //3
+ fldx.s f10, a2, a3 //4
+ add.d a2, a2, t0
+ vextrins.b vr6, vr9, 0x30 //insert the 3th load
+ vextrins.b vr6, vr9, 0x71
+ vextrins.b vr6, vr9, 0xb2
+ vextrins.b vr6, vr9, 0xf3
+ vbsrl.v vr7, vr6, 1
+ vextrins.b vr7, vr10, 0x30 //insert the 4th load
+ vextrins.b vr7, vr10, 0x71
+ vextrins.b vr7, vr10, 0xb2
+ vextrins.b vr7, vr10, 0xf3
+ vdp2.h.bu.b vr8, vr6, vr0 //EPEL_FILTER(src, stride)
+ vdp2.h.bu.b vr9, vr7, vr0
+ vhaddw.w.h vr10, vr8, vr8
+ vhaddw.w.h vr11, vr9, vr9
+ vmulwev.w.h vr10, vr10, vr1 //EPEL_FILTER(src, stride) * wx
+ vmulwev.w.h vr11, vr11, vr1
+ vadd.w vr10, vr10, vr2 // + offset
+ vadd.w vr11, vr11, vr2
+ vsra.w vr10, vr10, vr3 // >> shift
+ vsra.w vr11, vr11, vr3
+ vadd.w vr10, vr10, vr4 // + ox
+ vadd.w vr11, vr11, vr4
+ vssrani.h.w vr11, vr10, 0
+ vssrani.bu.h vr10, vr11, 0
+ vbsrl.v vr6, vr7, 1
+ fst.s f10, a0, 0
+ vbsrl.v vr10, vr10, 4
+ fstx.s f10, a0, a1
+ alsl.d a0, a1, a0, 1
+ addi.d a4, a4, -2
+ bnez a4, .LOOP_UNI_V4
+endfunc
+
+.macro CALC_EPEL_FILTER_LSX out0, out1
+ vdp2.h.bu.b vr12, vr10, vr0 //EPEL_FILTER(src, stride)
+ vdp2add.h.bu.b vr12, vr11, vr5
+ vexth.w.h vr13, vr12
+ vsllwil.w.h vr12, vr12, 0
+ vmulwev.w.h vr12, vr12, vr1 //EPEL_FILTER(src, stride) * wx
+ vmulwev.w.h vr13, vr13, vr1 //EPEL_FILTER(src, stride) * wx
+ vadd.w vr12, vr12, vr2 // + offset
+ vadd.w vr13, vr13, vr2
+ vsra.w vr12, vr12, vr3 // >> shift
+ vsra.w vr13, vr13, vr3
+ vadd.w \out0, vr12, vr4 // + ox
+ vadd.w \out1, vr13, vr4
+.endm
+
+.macro CALC_EPEL_FILTER_LASX out0
+ xvdp2.h.bu.b xr11, xr12, xr0 //EPEL_FILTER(src, stride)
+ xvhaddw.w.h xr12, xr11, xr11
+ xvmulwev.w.h xr12, xr12, xr1 //EPEL_FILTER(src, stride) * wx
+ xvadd.w xr12, xr12, xr2 // + offset
+ xvsra.w xr12, xr12, xr3 // >> shift
+ xvadd.w \out0, xr12, xr4 // + ox
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V8_LSX w
+ fld.d f6, a2, 0 //0
+ fldx.d f7, a2, a3 //1
+ fldx.d f8, a2, t0 //2
+ add.d a2, a2, t1
+.LOOP_UNI_V8_\w:
+ fld.d f9, a2, 0 // 3
+ add.d a2, a2, a3
+ vilvl.b vr10, vr7, vr6
+ vilvl.b vr11, vr9, vr8
+ vaddi.bu vr6, vr7, 0 //back up previous value
+ vaddi.bu vr7, vr8, 0
+ vaddi.bu vr8, vr9, 0
+ CALC_EPEL_FILTER_LSX vr12, vr13
+ vssrani.h.w vr13, vr12, 0
+ vssrani.bu.h vr13, vr13, 0
+.if \w < 8
+ fst.s f13, a0, 0
+ vstelm.h vr13, a0, 4, 2
+.else
+ fst.d f13, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_V8_\w
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V8_LASX w
+ fld.d f6, a2, 0 //0
+ fldx.d f7, a2, a3 //1
+ fldx.d f8, a2, t0 //2
+ add.d a2, a2, t1
+.LOOP_UNI_V8_LASX_\w:
+ fld.d f9, a2, 0 // 3
+ add.d a2, a2, a3
+ vilvl.b vr10, vr7, vr6
+ vilvl.b vr11, vr9, vr8
+ xvilvl.h xr12, xr11, xr10
+ xvilvh.h xr13, xr11, xr10
+ xvpermi.q xr12, xr13, 0x02
+ vaddi.bu vr6, vr7, 0 //back up previous value
+ vaddi.bu vr7, vr8, 0
+ vaddi.bu vr8, vr9, 0
+ CALC_EPEL_FILTER_LASX xr12
+ xvpermi.q xr13, xr12, 0x01
+ vssrani.h.w vr13, vr12, 0
+ vssrani.bu.h vr13, vr13, 0
+.if \w < 8
+ fst.s f13, a0, 0
+ vstelm.h vr13, a0, 4, 2
+.else
+ fst.d f13, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_V8_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_v6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ PUT_HEVC_EPEL_UNI_W_V8_LSX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v6_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ PUT_HEVC_EPEL_UNI_W_V8_LASX 6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ PUT_HEVC_EPEL_UNI_W_V8_LSX 8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ PUT_HEVC_EPEL_UNI_W_V8_LASX 8
+endfunc
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V16_LSX w
+ vld vr6, a2, 0 //0
+ vldx vr7, a2, a3 //1
+ vldx vr8, a2, t0 //2
+ add.d a2, a2, t1
+.LOOP_UNI_V16_\w:
+ vld vr9, a2, 0 //3
+ add.d a2, a2, a3
+ vilvl.b vr10, vr7, vr6
+ vilvl.b vr11, vr9, vr8
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vilvh.b vr10, vr7, vr6
+ vilvh.b vr11, vr9, vr8
+ CALC_EPEL_FILTER_LSX vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vaddi.bu vr6, vr7, 0 //back up previous value
+ vaddi.bu vr7, vr8, 0
+ vaddi.bu vr8, vr9, 0
+.if \w < 16
+ fst.d f17, a0, 0
+ vstelm.w vr17, a0, 8, 2
+.else
+ vst vr17, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_V16_\w
+.endm
+
+//w is a label, also can be used as a condition for ".if" statement.
+.macro PUT_HEVC_EPEL_UNI_W_V16_LASX w
+ vld vr6, a2, 0 //0
+ vldx vr7, a2, a3 //1
+ vldx vr8, a2, t0 //2
+ add.d a2, a2, t1
+.LOOP_UNI_V16_LASX_\w:
+ vld vr9, a2, 0 //3
+ add.d a2, a2, a3
+ xvilvl.b xr10, xr7, xr6
+ xvilvh.b xr11, xr7, xr6
+ xvpermi.q xr11, xr10, 0x20
+ xvilvl.b xr12, xr9, xr8
+ xvilvh.b xr13, xr9, xr8
+ xvpermi.q xr13, xr12, 0x20
+ xvdp2.h.bu.b xr10, xr11, xr0 //EPEL_FILTER(src, stride)
+ xvdp2add.h.bu.b xr10, xr13, xr5
+ xvexth.w.h xr11, xr10
+ xvsllwil.w.h xr10, xr10, 0
+ xvmulwev.w.h xr10, xr10, xr1 //EPEL_FILTER(src, stride) * wx
+ xvmulwev.w.h xr11, xr11, xr1
+ xvadd.w xr10, xr10, xr2 // + offset
+ xvadd.w xr11, xr11, xr2
+ xvsra.w xr10, xr10, xr3 // >> shift
+ xvsra.w xr11, xr11, xr3
+ xvadd.w xr10, xr10, xr4 // + wx
+ xvadd.w xr11, xr11, xr4
+ xvssrani.h.w xr11, xr10, 0
+ xvpermi.q xr10, xr11, 0x01
+ vssrani.bu.h vr10, vr11, 0
+ vaddi.bu vr6, vr7, 0 //back up previous value
+ vaddi.bu vr7, vr8, 0
+ vaddi.bu vr8, vr9, 0
+.if \w < 16
+ fst.d f10, a0, 0
+ vstelm.w vr10, a0, 8, 2
+.else
+ vst vr10, a0, 0
+.endif
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_V16_LASX_\w
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_v12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ addi.d t2, a0, 0 //save init
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 24
+ addi.d a0, t2, 16 //increase step
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V8_LSX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr20, xr0 //save xr0
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ addi.d t2, a0, 0 //save init
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 24
+ addi.d a0, t2, 16 //increase step
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ xvaddi.bu xr0, xr20, 0
+ PUT_HEVC_EPEL_UNI_W_V8_LASX 24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 32
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 33
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 32
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 33
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 48
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 49
+ addi.d a0, t2, 32
+ addi.d a2, t3, 32
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 50
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 48
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 49
+ addi.d a0, t2, 32
+ addi.d a2, t3, 32
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 50
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 64
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 65
+ addi.d a0, t2, 32
+ addi.d a2, t3, 32
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 66
+ addi.d a0, t2, 48
+ addi.d a2, t3, 48
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LSX 67
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_v64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 8 //my
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.q xr0, xr0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ sub.d a2, a2, a3 //src -= stride
+ xvrepl128vei.h xr5, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ addi.d t2, a0, 0
+ addi.d t3, a2, 0
+ addi.d t4, a4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 64
+ addi.d a0, t2, 16
+ addi.d a2, t3, 16
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 65
+ addi.d a0, t2, 32
+ addi.d a2, t3, 32
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 66
+ addi.d a0, t2, 48
+ addi.d a2, t3, 48
+ addi.d a4, t4, 0
+ PUT_HEVC_EPEL_UNI_W_V16_LASX 67
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * int height, int denom, int wx, int ox,
+ * intptr_t mx, intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_epel_uni_w_h4_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr5, t1, 0
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H4:
+ fld.d f6, a2, 0
+ add.d a2, a2, a3
+ vshuf.b vr6, vr6, vr6, vr5
+ vdp2.h.bu.b vr7, vr6, vr0
+ vhaddw.w.h vr7, vr7, vr7
+ vmulwev.w.h vr7, vr7, vr1
+ vadd.w vr7, vr7, vr2
+ vsra.w vr7, vr7, vr3
+ vadd.w vr7, vr7, vr4
+ vssrani.h.w vr7, vr7, 0
+ vssrani.bu.h vr7, vr7, 0
+ fst.s f7, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H4
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H6:
+ vld vr8, a2, 0
+ add.d a2, a2, a3
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.s f15, a0, 0
+ vstelm.h vr15, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H6
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H6_LASX:
+ vld vr8, a2, 0
+ xvreplve0.q xr8, xr8
+ add.d a2, a2, a3
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.s f15, a0, 0
+ vstelm.h vr15, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H6_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H8:
+ vld vr8, a2, 0
+ add.d a2, a2, a3
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.d f15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H8
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H8_LASX:
+ vld vr8, a2, 0
+ xvreplve0.q xr8, xr8
+ add.d a2, a2, a3
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.d f15, a0, 0
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H8_LASX
+endfunc
+
+.macro EPEL_UNI_W_H16_LOOP_LSX idx0, idx1, idx2
+ vld vr8, a2, \idx0
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vld vr8, a2, \idx1
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ vst vr17, a0, \idx2
+.endm
+
+.macro EPEL_UNI_W_H16_LOOP_LASX idx0, idx2, w
+ xvld xr8, a2, \idx0
+ xvpermi.d xr9, xr8, 0x09
+ xvreplve0.q xr8, xr8
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr14
+ xvreplve0.q xr8, xr9
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr16
+ xvssrani.h.w xr16, xr14, 0
+ xvpermi.q xr17, xr16, 0x01
+ vssrani.bu.h vr17, vr16, 0
+ vpermi.w vr17, vr17, 0xd8
+.if \w == 12
+ fst.d f17, a0, 0
+ vstelm.w vr17, a0, 8, 2
+.else
+ vst vr17, a0, \idx2
+.endif
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_h12_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H12:
+ vld vr8, a2, 0
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr14, vr15
+ vld vr8, a2, 8
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr16, vr17
+ vssrani.h.w vr15, vr14, 0
+ vssrani.h.w vr17, vr16, 0
+ vssrani.bu.h vr17, vr15, 0
+ fst.d f17, a0, 0
+ vstelm.w vr17, a0, 8, 2
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H12
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h12_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H12_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 12
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h16_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H16:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H16
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h16_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H16_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 16
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h24_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H24:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ vld vr8, a2, 16
+ add.d a2, a2, a3
+ vshuf.b vr10, vr8, vr8, vr6
+ vshuf.b vr11, vr8, vr8, vr7
+ CALC_EPEL_FILTER_LSX vr18, vr19
+ vssrani.h.w vr19, vr18, 0
+ vssrani.bu.h vr19, vr19, 0
+ fst.d f19, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H24
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h24_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H24_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 24
+ vld vr8, a2, 16
+ add.d a2, a2, a3
+ xvreplve0.q xr8, xr8
+ xvshuf.b xr12, xr8, xr8, xr6
+ CALC_EPEL_FILTER_LASX xr14
+ xvpermi.q xr15, xr14, 0x01
+ vssrani.h.w vr15, vr14, 0
+ vssrani.bu.h vr15, vr15, 0
+ fst.d f15, a0, 16
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H24_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h32_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H32:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H32
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h32_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H32_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 32
+ EPEL_UNI_W_H16_LOOP_LASX 16, 16, 32
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h48_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H48:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+ EPEL_UNI_W_H16_LOOP_LSX 32, 40, 32
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H48
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h48_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H48_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 48
+ EPEL_UNI_W_H16_LOOP_LASX 16, 16, 48
+ EPEL_UNI_W_H16_LOOP_LASX 32, 32, 48
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h64_8_lsx
+ LOAD_VAR 128
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ vreplvei.w vr0, vr0, 0
+ la.local t1, shufb
+ vld vr6, t1, 48
+ vaddi.bu vr7, vr6, 2
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+ vreplvei.h vr5, vr0, 1
+ vreplvei.h vr0, vr0, 0
+.LOOP_UNI_W_H64:
+ EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0
+ EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16
+ EPEL_UNI_W_H16_LOOP_LSX 32, 40, 32
+ EPEL_UNI_W_H16_LOOP_LSX 48, 56, 48
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H64
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h64_8_lasx
+ LOAD_VAR 256
+ ld.d t0, sp, 0 //mx
+ addi.d t0, t0, -1
+ slli.w t0, t0, 2
+ la.local t1, ff_hevc_epel_filters
+ vldx vr0, t1, t0 //filter
+ xvreplve0.w xr0, xr0
+ la.local t1, shufb
+ xvld xr6, t1, 64
+ slli.d t0, a3, 1 //stride * 2
+ add.d t1, t0, a3 //stride * 3
+ addi.d a2, a2, -1 //src -= 1
+.LOOP_UNI_W_H64_LASX:
+ EPEL_UNI_W_H16_LOOP_LASX 0, 0, 64
+ EPEL_UNI_W_H16_LOOP_LASX 16, 16, 64
+ EPEL_UNI_W_H16_LOOP_LASX 32, 32, 64
+ EPEL_UNI_W_H16_LOOP_LASX 48, 48, 64
+ add.d a2, a2, a3
+ add.d a0, a0, a1
+ addi.d a4, a4, -1
+ bnez a4, .LOOP_UNI_W_H64_LASX
+endfunc
+
+/*
+ * void FUNC(put_hevc_epel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride,
+ * const uint8_t *_src, ptrdiff_t _srcstride,
+ * const int16_t *src2, int height, intptr_t mx,
+ * intptr_t my, int width)
+ */
+function ff_hevc_put_hevc_bi_epel_h4_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.w vr0, vr0, 0
+ la.local t0, shufb
+ vld vr1, t0, 0 // mask
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H4:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ vshuf.b vr5, vr5, vr5, vr1
+ vdp2.h.bu.b vr6, vr5, vr0 // EPEL_FILTER(src, 1)
+ vsllwil.w.h vr4, vr4, 0
+ vhaddw.w.h vr6, vr6, vr6
+ vadd.w vr6, vr6, vr4 // src2[x]
+ vssrani.h.w vr6, vr6, 0
+ vssrarni.bu.h vr6, vr6, 7
+ fst.s f6, a0, 0
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H4
+endfunc
+
+.macro PUT_HEVC_BI_EPEL_H8_LSX in0, in1, in2, in3, out0
+ vshuf.b vr6, \in1, \in0, \in2
+ vshuf.b vr7, \in1, \in0, \in3
+ vdp2.h.bu.b vr8, vr6, vr0 // EPEL_FILTER(src, 1)
+ vdp2add.h.bu.b vr8, vr7, vr1 // EPEL_FILTER(src, 1)
+ vsadd.h \out0, vr8, vr4 // src2[x]
+.endm
+
+.macro PUT_HEVC_BI_EPEL_H16_LASX in0, in1, in2, in3, out0
+ xvshuf.b xr6, \in1, \in0, \in2
+ xvshuf.b xr7, \in1, \in0, \in3
+ xvdp2.h.bu.b xr8, xr6, xr0 // EPEL_FILTER(src, 1)
+ xvdp2add.h.bu.b xr8, xr7, xr1 // EPEL_FILTER(src, 1)
+ xvsadd.h \out0, xr8, xr4 // src2[x]
+.endm
+
+function ff_hevc_put_hevc_bi_epel_h6_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H6:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr7
+ vssrarni.bu.h vr7, vr7, 7
+ fst.s f7, a0, 0
+ vstelm.h vr7, a0, 4, 2
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H6
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h8_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H8:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr7
+ vssrarni.bu.h vr7, vr7, 7
+ fst.d f7, a0, 0
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H8
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h12_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H12:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr11
+ vld vr5, a2, 8
+ vld vr4, a4, 16
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+ vssrarni.bu.h vr12, vr11, 7
+ fst.d f12, a0, 0
+ vstelm.w vr12, a0, 8, 2
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H12
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h12_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H12_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+ xvpermi.q xr10, xr9, 0x01
+ vssrarni.bu.h vr10, vr9, 7
+ fst.d f10, a0, 0
+ vstelm.w vr10, a0, 8, 2
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H12_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h16_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H16:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr11
+ vld vr5, a2, 8
+ vld vr4, a4, 16
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+ vssrarni.bu.h vr12, vr11, 7
+ vst vr12, a0, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H16
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h16_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H16_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+ xvpermi.q xr10, xr9, 0x01
+ vssrarni.bu.h vr10, vr9, 7
+ vst vr10, a0, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H16_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h32_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H32_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvpermi.q xr15, xr5, 0x01
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9
+ xvld xr4, a4, 32
+ xvld xr15, a2, 16
+ xvpermi.d xr15, xr15, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr15, xr15, xr2, xr3, xr11
+ xvssrarni.bu.h xr11, xr9, 7
+ xvpermi.d xr11, xr11, 0xd8
+ xvst xr11, a0, 0
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H32_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h48_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6// filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ vaddi.bu vr21, vr2, 8
+ vaddi.bu vr22, vr2, 10
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H48:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ vld vr9, a2, 16
+ vld vr10, a2, 32
+ vld vr11, a2, 48
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12
+ vld vr4, a4, 16
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr9, vr21, vr22, vr13
+ vld vr4, a4, 32
+ PUT_HEVC_BI_EPEL_H8_LSX vr9, vr9, vr2, vr3, vr14
+ vld vr4, a4, 48
+ PUT_HEVC_BI_EPEL_H8_LSX vr9, vr10, vr21, vr22, vr15
+ vld vr4, a4, 64
+ PUT_HEVC_BI_EPEL_H8_LSX vr10, vr10, vr2, vr3, vr16
+ vld vr4, a4, 80
+ PUT_HEVC_BI_EPEL_H8_LSX vr10, vr11, vr21, vr22, vr17
+ vssrarni.bu.h vr13, vr12, 7
+ vssrarni.bu.h vr15, vr14, 7
+ vssrarni.bu.h vr17, vr16, 7
+ vst vr13, a0, 0
+ vst vr15, a0, 16
+ vst vr17, a0, 32
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H48
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h48_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H48_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvld xr9, a2, 32
+ xvpermi.d xr10, xr9, 0x94
+ xvpermi.q xr9, xr5, 0x21
+ xvpermi.d xr9, xr9, 0x94
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr11
+ xvld xr4, a4, 32
+ PUT_HEVC_BI_EPEL_H16_LASX xr9, xr9, xr2, xr3, xr12
+ xvld xr4, a4, 64
+ PUT_HEVC_BI_EPEL_H16_LASX xr10, xr10, xr2, xr3, xr13
+ xvssrarni.bu.h xr12, xr11, 7
+ xvpermi.d xr12, xr12, 0xd8
+ xvpermi.q xr14, xr13, 0x01
+ vssrarni.bu.h vr14, vr13, 7
+ xvst xr12, a0, 0
+ vst vr14, a0, 32
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H48_LASX
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h64_8_lsx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6// filter
+ vreplvei.h vr1, vr0, 1
+ vreplvei.h vr0, vr0, 0
+ la.local t0, shufb
+ vld vr2, t0, 48// mask
+ vaddi.bu vr3, vr2, 2
+ vaddi.bu vr21, vr2, 8
+ vaddi.bu vr22, vr2, 10
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H64:
+ vld vr4, a4, 0 // src2
+ vld vr5, a2, 0
+ vld vr9, a2, 16
+ vld vr10, a2, 32
+ vld vr11, a2, 48
+ vld vr12, a2, 64
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr13
+ vld vr4, a4, 16
+ PUT_HEVC_BI_EPEL_H8_LSX vr5, vr9, vr21, vr22, vr14
+ vld vr4, a4, 32
+ PUT_HEVC_BI_EPEL_H8_LSX vr9, vr9, vr2, vr3, vr15
+ vld vr4, a4, 48
+ PUT_HEVC_BI_EPEL_H8_LSX vr9, vr10, vr21, vr22, vr16
+ vld vr4, a4, 64
+ PUT_HEVC_BI_EPEL_H8_LSX vr10, vr10, vr2, vr3, vr17
+ vld vr4, a4, 80
+ PUT_HEVC_BI_EPEL_H8_LSX vr10, vr11, vr21, vr22, vr18
+ vld vr4, a4, 96
+ PUT_HEVC_BI_EPEL_H8_LSX vr11, vr11, vr2, vr3, vr19
+ vld vr4, a4, 112
+ PUT_HEVC_BI_EPEL_H8_LSX vr11, vr12, vr21, vr22, vr20
+ vssrarni.bu.h vr14, vr13, 7
+ vssrarni.bu.h vr16, vr15, 7
+ vssrarni.bu.h vr18, vr17, 7
+ vssrarni.bu.h vr20, vr19, 7
+ vst vr14, a0, 0
+ vst vr16, a0, 16
+ vst vr18, a0, 32
+ vst vr20, a0, 48
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H64
+endfunc
+
+function ff_hevc_put_hevc_bi_epel_h64_8_lasx
+ addi.d a6, a6, -1
+ slli.w a6, a6, 2
+ la.local t0, ff_hevc_epel_filters
+ vldx vr0, t0, a6 // filter
+ xvreplve0.q xr0, xr0
+ xvrepl128vei.h xr1, xr0, 1
+ xvrepl128vei.h xr0, xr0, 0
+ la.local t0, shufb
+ xvld xr2, t0, 96// mask
+ xvaddi.bu xr3, xr2, 2
+ addi.d a2, a2, -1 // src -= 1
+.LOOP_BI_EPEL_H64_LASX:
+ xvld xr4, a4, 0 // src2
+ xvld xr5, a2, 0
+ xvld xr9, a2, 32
+ xvld xr11, a2, 48
+ xvpermi.d xr11, xr11, 0x94
+ xvpermi.d xr10, xr9, 0x94
+ xvpermi.q xr9, xr5, 0x21
+ xvpermi.d xr9, xr9, 0x94
+ xvpermi.d xr5, xr5, 0x94
+ PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr12
+ xvld xr4, a4, 32
+ PUT_HEVC_BI_EPEL_H16_LASX xr9, xr9, xr2, xr3, xr13
+ xvld xr4, a4, 64
+ PUT_HEVC_BI_EPEL_H16_LASX xr10, xr10, xr2, xr3, xr14
+ xvld xr4, a4, 96
+ PUT_HEVC_BI_EPEL_H16_LASX xr11, xr11, xr2, xr3, xr15
+ xvssrarni.bu.h xr13, xr12, 7
+ xvssrarni.bu.h xr15, xr14, 7
+ xvpermi.d xr13, xr13, 0xd8
+ xvpermi.d xr15, xr15, 0xd8
+ xvst xr13, a0, 0
+ xvst xr15, a0, 32
+ add.d a2, a2, a3
+ addi.d a4, a4, 128
+ add.d a0, a0, a1
+ addi.d a5, a5, -1
+ bnez a5, .LOOP_BI_EPEL_H64_LASX
+endfunc
diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c
index 245a833947..2756755733 100644
--- a/libavcodec/loongarch/hevcdsp_init_loongarch.c
+++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c
@@ -124,8 +124,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_bi[8][0][1] = ff_hevc_put_hevc_bi_qpel_h48_8_lsx;
c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_bi_qpel_h64_8_lsx;
+ c->put_hevc_epel_bi[1][0][1] = ff_hevc_put_hevc_bi_epel_h4_8_lsx;
+ c->put_hevc_epel_bi[2][0][1] = ff_hevc_put_hevc_bi_epel_h6_8_lsx;
+ c->put_hevc_epel_bi[3][0][1] = ff_hevc_put_hevc_bi_epel_h8_8_lsx;
+ c->put_hevc_epel_bi[4][0][1] = ff_hevc_put_hevc_bi_epel_h12_8_lsx;
+ c->put_hevc_epel_bi[5][0][1] = ff_hevc_put_hevc_bi_epel_h16_8_lsx;
c->put_hevc_epel_bi[6][0][1] = ff_hevc_put_hevc_bi_epel_h24_8_lsx;
c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lsx;
+ c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lsx;
+ c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lsx;
c->put_hevc_epel_bi[4][1][0] = ff_hevc_put_hevc_bi_epel_v12_8_lsx;
c->put_hevc_epel_bi[5][1][0] = ff_hevc_put_hevc_bi_epel_v16_8_lsx;
@@ -138,6 +145,14 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_bi[6][1][1] = ff_hevc_put_hevc_bi_epel_hv24_8_lsx;
c->put_hevc_epel_bi[7][1][1] = ff_hevc_put_hevc_bi_epel_hv32_8_lsx;
+ c->put_hevc_qpel_uni[1][0][1] = ff_hevc_put_hevc_uni_qpel_h4_8_lsx;
+ c->put_hevc_qpel_uni[2][0][1] = ff_hevc_put_hevc_uni_qpel_h6_8_lsx;
+ c->put_hevc_qpel_uni[3][0][1] = ff_hevc_put_hevc_uni_qpel_h8_8_lsx;
+ c->put_hevc_qpel_uni[4][0][1] = ff_hevc_put_hevc_uni_qpel_h12_8_lsx;
+ c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_uni_qpel_h16_8_lsx;
+ c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_uni_qpel_h24_8_lsx;
+ c->put_hevc_qpel_uni[7][0][1] = ff_hevc_put_hevc_uni_qpel_h32_8_lsx;
+ c->put_hevc_qpel_uni[8][0][1] = ff_hevc_put_hevc_uni_qpel_h48_8_lsx;
c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_uni_qpel_h64_8_lsx;
c->put_hevc_qpel_uni[6][1][0] = ff_hevc_put_hevc_uni_qpel_v24_8_lsx;
@@ -191,6 +206,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx;
c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx;
+ c->put_hevc_epel_uni_w[1][0][1] = ff_hevc_put_hevc_epel_uni_w_h4_8_lsx;
+ c->put_hevc_epel_uni_w[2][0][1] = ff_hevc_put_hevc_epel_uni_w_h6_8_lsx;
+ c->put_hevc_epel_uni_w[3][0][1] = ff_hevc_put_hevc_epel_uni_w_h8_8_lsx;
+ c->put_hevc_epel_uni_w[4][0][1] = ff_hevc_put_hevc_epel_uni_w_h12_8_lsx;
+ c->put_hevc_epel_uni_w[5][0][1] = ff_hevc_put_hevc_epel_uni_w_h16_8_lsx;
+ c->put_hevc_epel_uni_w[6][0][1] = ff_hevc_put_hevc_epel_uni_w_h24_8_lsx;
+ c->put_hevc_epel_uni_w[7][0][1] = ff_hevc_put_hevc_epel_uni_w_h32_8_lsx;
+ c->put_hevc_epel_uni_w[8][0][1] = ff_hevc_put_hevc_epel_uni_w_h48_8_lsx;
+ c->put_hevc_epel_uni_w[9][0][1] = ff_hevc_put_hevc_epel_uni_w_h64_8_lsx;
+
+ c->put_hevc_epel_uni_w[1][1][0] = ff_hevc_put_hevc_epel_uni_w_v4_8_lsx;
+ c->put_hevc_epel_uni_w[2][1][0] = ff_hevc_put_hevc_epel_uni_w_v6_8_lsx;
+ c->put_hevc_epel_uni_w[3][1][0] = ff_hevc_put_hevc_epel_uni_w_v8_8_lsx;
+ c->put_hevc_epel_uni_w[4][1][0] = ff_hevc_put_hevc_epel_uni_w_v12_8_lsx;
+ c->put_hevc_epel_uni_w[5][1][0] = ff_hevc_put_hevc_epel_uni_w_v16_8_lsx;
+ c->put_hevc_epel_uni_w[6][1][0] = ff_hevc_put_hevc_epel_uni_w_v24_8_lsx;
+ c->put_hevc_epel_uni_w[7][1][0] = ff_hevc_put_hevc_epel_uni_w_v32_8_lsx;
+ c->put_hevc_epel_uni_w[8][1][0] = ff_hevc_put_hevc_epel_uni_w_v48_8_lsx;
+ c->put_hevc_epel_uni_w[9][1][0] = ff_hevc_put_hevc_epel_uni_w_v64_8_lsx;
+
c->put_hevc_qpel_uni_w[3][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv8_8_lsx;
c->put_hevc_qpel_uni_w[5][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv16_8_lsx;
c->put_hevc_qpel_uni_w[6][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv24_8_lsx;
@@ -277,6 +312,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx;
c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx;
+ c->put_hevc_epel_uni_w[2][0][1] = ff_hevc_put_hevc_epel_uni_w_h6_8_lasx;
+ c->put_hevc_epel_uni_w[3][0][1] = ff_hevc_put_hevc_epel_uni_w_h8_8_lasx;
+ c->put_hevc_epel_uni_w[4][0][1] = ff_hevc_put_hevc_epel_uni_w_h12_8_lasx;
+ c->put_hevc_epel_uni_w[5][0][1] = ff_hevc_put_hevc_epel_uni_w_h16_8_lasx;
+ c->put_hevc_epel_uni_w[6][0][1] = ff_hevc_put_hevc_epel_uni_w_h24_8_lasx;
+ c->put_hevc_epel_uni_w[7][0][1] = ff_hevc_put_hevc_epel_uni_w_h32_8_lasx;
+ c->put_hevc_epel_uni_w[8][0][1] = ff_hevc_put_hevc_epel_uni_w_h48_8_lasx;
+ c->put_hevc_epel_uni_w[9][0][1] = ff_hevc_put_hevc_epel_uni_w_h64_8_lasx;
+
c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx;
c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx;
c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx;
@@ -285,6 +329,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx;
c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx;
+ c->put_hevc_epel_uni_w[2][1][0] = ff_hevc_put_hevc_epel_uni_w_v6_8_lasx;
+ c->put_hevc_epel_uni_w[3][1][0] = ff_hevc_put_hevc_epel_uni_w_v8_8_lasx;
+ c->put_hevc_epel_uni_w[4][1][0] = ff_hevc_put_hevc_epel_uni_w_v12_8_lasx;
+ c->put_hevc_epel_uni_w[5][1][0] = ff_hevc_put_hevc_epel_uni_w_v16_8_lasx;
+ c->put_hevc_epel_uni_w[6][1][0] = ff_hevc_put_hevc_epel_uni_w_v24_8_lasx;
+ c->put_hevc_epel_uni_w[7][1][0] = ff_hevc_put_hevc_epel_uni_w_v32_8_lasx;
+ c->put_hevc_epel_uni_w[8][1][0] = ff_hevc_put_hevc_epel_uni_w_v48_8_lasx;
+ c->put_hevc_epel_uni_w[9][1][0] = ff_hevc_put_hevc_epel_uni_w_v64_8_lasx;
+
c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx;
c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx;
c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx;
@@ -294,6 +347,19 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx;
c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx;
c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx;
+
+ c->put_hevc_qpel_uni[4][0][1] = ff_hevc_put_hevc_uni_qpel_h12_8_lasx;
+ c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_uni_qpel_h16_8_lasx;
+ c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_uni_qpel_h24_8_lasx;
+ c->put_hevc_qpel_uni[7][0][1] = ff_hevc_put_hevc_uni_qpel_h32_8_lasx;
+ c->put_hevc_qpel_uni[8][0][1] = ff_hevc_put_hevc_uni_qpel_h48_8_lasx;
+ c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_uni_qpel_h64_8_lasx;
+
+ c->put_hevc_epel_bi[4][0][1] = ff_hevc_put_hevc_bi_epel_h12_8_lasx;
+ c->put_hevc_epel_bi[5][0][1] = ff_hevc_put_hevc_bi_epel_h16_8_lasx;
+ c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lasx;
+ c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lasx;
+ c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lasx;
}
}
}
diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h
index 7f09d0943a..5db35eed47 100644
--- a/libavcodec/loongarch/hevcdsp_lasx.h
+++ b/libavcodec/loongarch/hevcdsp_lasx.h
@@ -75,6 +75,60 @@ PEL_UNI_W(epel, hv, 32);
PEL_UNI_W(epel, hv, 48);
PEL_UNI_W(epel, hv, 64);
+PEL_UNI_W(epel, v, 6);
+PEL_UNI_W(epel, v, 8);
+PEL_UNI_W(epel, v, 12);
+PEL_UNI_W(epel, v, 16);
+PEL_UNI_W(epel, v, 24);
+PEL_UNI_W(epel, v, 32);
+PEL_UNI_W(epel, v, 48);
+PEL_UNI_W(epel, v, 64);
+
+PEL_UNI_W(epel, h, 6);
+PEL_UNI_W(epel, h, 8);
+PEL_UNI_W(epel, h, 12);
+PEL_UNI_W(epel, h, 16);
+PEL_UNI_W(epel, h, 24);
+PEL_UNI_W(epel, h, 32);
+PEL_UNI_W(epel, h, 48);
+PEL_UNI_W(epel, h, 64);
+
#undef PEL_UNI_W
+#define UNI_MC(PEL, DIR, WIDTH) \
+void ff_hevc_put_hevc_uni_##PEL##_##DIR##WIDTH##_8_lasx(uint8_t *dst, \
+ ptrdiff_t dst_stride, \
+ const uint8_t *src, \
+ ptrdiff_t src_stride, \
+ int height, \
+ intptr_t mx, \
+ intptr_t my, \
+ int width)
+UNI_MC(qpel, h, 12);
+UNI_MC(qpel, h, 16);
+UNI_MC(qpel, h, 24);
+UNI_MC(qpel, h, 32);
+UNI_MC(qpel, h, 48);
+UNI_MC(qpel, h, 64);
+
+#undef UNI_MC
+
+#define BI_MC(PEL, DIR, WIDTH) \
+void ff_hevc_put_hevc_bi_##PEL##_##DIR##WIDTH##_8_lasx(uint8_t *dst, \
+ ptrdiff_t dst_stride, \
+ const uint8_t *src, \
+ ptrdiff_t src_stride, \
+ const int16_t *src_16bit, \
+ int height, \
+ intptr_t mx, \
+ intptr_t my, \
+ int width)
+BI_MC(epel, h, 12);
+BI_MC(epel, h, 16);
+BI_MC(epel, h, 32);
+BI_MC(epel, h, 48);
+BI_MC(epel, h, 64);
+
+#undef BI_MC
+
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H
diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h
index 7769cf25ae..a5ef237b5d 100644
--- a/libavcodec/loongarch/hevcdsp_lsx.h
+++ b/libavcodec/loongarch/hevcdsp_lsx.h
@@ -126,8 +126,15 @@ BI_MC(qpel, hv, 32);
BI_MC(qpel, hv, 48);
BI_MC(qpel, hv, 64);
+BI_MC(epel, h, 4);
+BI_MC(epel, h, 6);
+BI_MC(epel, h, 8);
+BI_MC(epel, h, 12);
+BI_MC(epel, h, 16);
BI_MC(epel, h, 24);
BI_MC(epel, h, 32);
+BI_MC(epel, h, 48);
+BI_MC(epel, h, 64);
BI_MC(epel, v, 12);
BI_MC(epel, v, 16);
@@ -151,7 +158,14 @@ void ff_hevc_put_hevc_uni_##PEL##_##DIR##WIDTH##_8_lsx(uint8_t *dst, \
intptr_t mx, \
intptr_t my, \
int width)
-
+UNI_MC(qpel, h, 4);
+UNI_MC(qpel, h, 6);
+UNI_MC(qpel, h, 8);
+UNI_MC(qpel, h, 12);
+UNI_MC(qpel, h, 16);
+UNI_MC(qpel, h, 24);
+UNI_MC(qpel, h, 32);
+UNI_MC(qpel, h, 48);
UNI_MC(qpel, h, 64);
UNI_MC(qpel, v, 24);
@@ -287,6 +301,26 @@ PEL_UNI_W(epel, hv, 32);
PEL_UNI_W(epel, hv, 48);
PEL_UNI_W(epel, hv, 64);
+PEL_UNI_W(epel, h, 4);
+PEL_UNI_W(epel, h, 6);
+PEL_UNI_W(epel, h, 8);
+PEL_UNI_W(epel, h, 12);
+PEL_UNI_W(epel, h, 16);
+PEL_UNI_W(epel, h, 24);
+PEL_UNI_W(epel, h, 32);
+PEL_UNI_W(epel, h, 48);
+PEL_UNI_W(epel, h, 64);
+
+PEL_UNI_W(epel, v, 4);
+PEL_UNI_W(epel, v, 6);
+PEL_UNI_W(epel, v, 8);
+PEL_UNI_W(epel, v, 12);
+PEL_UNI_W(epel, v, 16);
+PEL_UNI_W(epel, v, 24);
+PEL_UNI_W(epel, v, 32);
+PEL_UNI_W(epel, v, 48);
+PEL_UNI_W(epel, v, 64);
+
#undef PEL_UNI_W
#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-12-22 10:53 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-22 10:52 [FFmpeg-devel] [PATCH v1] [loongarch] Add hevc 128-bit & 256-bit asm optimizations jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 1/6] avcodec/hevc: Add init for sao_edge_filter jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 2/6] avcodec/hevc: Add add_residual_4/8/16/32 asm opt jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 3/6] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 " jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 4/6] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 " jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 5/6] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 " jinbo
2023-12-22 10:52 ` [FFmpeg-devel] [PATCH v1 6/6] avcodec/hevc: Add asm opt for the following functions jinbo
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git