* [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch
@ 2021-12-29 10:18 Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX Hao Chen
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Hao Chen @ 2021-12-29 10:18 UTC (permalink / raw)
To: ffmpeg-devel
./ffmpeg -i 8_mpeg4_1080p_24fps_12Mbps.avi -f rawvideo -y /dev/null -an
before:376fps
after :552fps
V2: Revised PATCH 1/3 according to the comments.
V3: Resubmit these patches due to miss PATCH v2 1/3.
[PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX.
[PATCH v3 2/3] avcodec: [loongarch] Optimize idctdstp with LASX.
[PATCH v3 3/3] avcodec: [loongarch] Optimize prefetch with loongarch.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX.
2021-12-29 10:18 [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch Hao Chen
@ 2021-12-29 10:18 ` Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 2/3] avcodec: [loongarch] Optimize idctdstp " Hao Chen
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Hao Chen @ 2021-12-29 10:18 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Shiyou Yin
From: Shiyou Yin <yinshiyou-hf@loongson.cn>
./ffmpeg -i 8_mpeg4_1080p_24fps_12Mbps.avi -f rawvideo -y /dev/null -an
before:376fps
after :433fps
---
libavcodec/hpeldsp.c | 2 +
libavcodec/hpeldsp.h | 1 +
libavcodec/loongarch/Makefile | 2 +
libavcodec/loongarch/hpeldsp_init_loongarch.c | 50 +
libavcodec/loongarch/hpeldsp_lasx.c | 1287 +++++++++++++++++
libavcodec/loongarch/hpeldsp_lasx.h | 58 +
6 files changed, 1400 insertions(+)
create mode 100644 libavcodec/loongarch/hpeldsp_init_loongarch.c
create mode 100644 libavcodec/loongarch/hpeldsp_lasx.c
create mode 100644 libavcodec/loongarch/hpeldsp_lasx.h
diff --git a/libavcodec/hpeldsp.c b/libavcodec/hpeldsp.c
index 8e2fd8fcf5..843ba399c5 100644
--- a/libavcodec/hpeldsp.c
+++ b/libavcodec/hpeldsp.c
@@ -367,4 +367,6 @@ av_cold void ff_hpeldsp_init(HpelDSPContext *c, int flags)
ff_hpeldsp_init_x86(c, flags);
if (ARCH_MIPS)
ff_hpeldsp_init_mips(c, flags);
+ if (ARCH_LOONGARCH64)
+ ff_hpeldsp_init_loongarch(c, flags);
}
diff --git a/libavcodec/hpeldsp.h b/libavcodec/hpeldsp.h
index 768139bfc9..45e81b10a5 100644
--- a/libavcodec/hpeldsp.h
+++ b/libavcodec/hpeldsp.h
@@ -102,5 +102,6 @@ void ff_hpeldsp_init_arm(HpelDSPContext *c, int flags);
void ff_hpeldsp_init_ppc(HpelDSPContext *c, int flags);
void ff_hpeldsp_init_x86(HpelDSPContext *c, int flags);
void ff_hpeldsp_init_mips(HpelDSPContext *c, int flags);
+void ff_hpeldsp_init_loongarch(HpelDSPContext *c, int flags);
#endif /* AVCODEC_HPELDSP_H */
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index baf5f92e84..07a401d883 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -5,6 +5,7 @@ OBJS-$(CONFIG_H264PRED) += loongarch/h264_intrapred_init_loongarch
OBJS-$(CONFIG_VP8_DECODER) += loongarch/vp8dsp_init_loongarch.o
OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9dsp_init_loongarch.o
OBJS-$(CONFIG_VC1DSP) += loongarch/vc1dsp_init_loongarch.o
+OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_init_loongarch.o
LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o
LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o
LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \
@@ -12,6 +13,7 @@ LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \
loongarch/h264_deblock_lasx.o
LASX-OBJS-$(CONFIG_H264PRED) += loongarch/h264_intrapred_lasx.o
LASX-OBJS-$(CONFIG_VC1_DECODER) += loongarch/vc1dsp_lasx.o
+LASX-OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_lasx.o
LSX-OBJS-$(CONFIG_VP8_DECODER) += loongarch/vp8_mc_lsx.o \
loongarch/vp8_lpf_lsx.o
LSX-OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9_mc_lsx.o \
diff --git a/libavcodec/loongarch/hpeldsp_init_loongarch.c b/libavcodec/loongarch/hpeldsp_init_loongarch.c
new file mode 100644
index 0000000000..1690be5438
--- /dev/null
+++ b/libavcodec/loongarch/hpeldsp_init_loongarch.c
@@ -0,0 +1,50 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/cpu.h"
+#include "libavcodec/hpeldsp.h"
+#include "libavcodec/loongarch/hpeldsp_lasx.h"
+
+void ff_hpeldsp_init_loongarch(HpelDSPContext *c, int flags)
+{
+ int cpu_flags = av_get_cpu_flags();
+
+ if (have_lasx(cpu_flags)) {
+ c->put_pixels_tab[0][0] = ff_put_pixels16_8_lsx;
+ c->put_pixels_tab[0][1] = ff_put_pixels16_x2_8_lasx;
+ c->put_pixels_tab[0][2] = ff_put_pixels16_y2_8_lasx;
+ c->put_pixels_tab[0][3] = ff_put_pixels16_xy2_8_lasx;
+
+ c->put_pixels_tab[1][0] = ff_put_pixels8_8_lasx;
+ c->put_pixels_tab[1][1] = ff_put_pixels8_x2_8_lasx;
+ c->put_pixels_tab[1][2] = ff_put_pixels8_y2_8_lasx;
+ c->put_pixels_tab[1][3] = ff_put_pixels8_xy2_8_lasx;
+ c->put_no_rnd_pixels_tab[0][0] = ff_put_pixels16_8_lsx;
+ c->put_no_rnd_pixels_tab[0][1] = ff_put_no_rnd_pixels16_x2_8_lasx;
+ c->put_no_rnd_pixels_tab[0][2] = ff_put_no_rnd_pixels16_y2_8_lasx;
+ c->put_no_rnd_pixels_tab[0][3] = ff_put_no_rnd_pixels16_xy2_8_lasx;
+
+ c->put_no_rnd_pixels_tab[1][0] = ff_put_pixels8_8_lasx;
+ c->put_no_rnd_pixels_tab[1][1] = ff_put_no_rnd_pixels8_x2_8_lasx;
+ c->put_no_rnd_pixels_tab[1][2] = ff_put_no_rnd_pixels8_y2_8_lasx;
+ c->put_no_rnd_pixels_tab[1][3] = ff_put_no_rnd_pixels8_xy2_8_lasx;
+ }
+}
diff --git a/libavcodec/loongarch/hpeldsp_lasx.c b/libavcodec/loongarch/hpeldsp_lasx.c
new file mode 100644
index 0000000000..dd2ae173da
--- /dev/null
+++ b/libavcodec/loongarch/hpeldsp_lasx.c
@@ -0,0 +1,1287 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/loongson_intrinsics.h"
+#include "hpeldsp_lasx.h"
+
+static av_always_inline void
+put_pixels8_l2_8_lsx(uint8_t *dst, const uint8_t *src1, const uint8_t *src2,
+ int dst_stride, int src_stride1, int src_stride2, int h)
+{
+ int stride1_2, stride1_3, stride1_4;
+ int stride2_2, stride2_3, stride2_4;
+ __asm__ volatile (
+ "slli.d %[stride1_2], %[srcStride1], 1 \n\t"
+ "slli.d %[stride2_2], %[srcStride2], 1 \n\t"
+ "add.d %[stride1_3], %[stride1_2], %[srcStride1] \n\t"
+ "add.d %[stride2_3], %[stride2_2], %[srcStride2] \n\t"
+ "slli.d %[stride1_4], %[stride1_2], 1 \n\t"
+ "slli.d %[stride2_4], %[stride2_2], 1 \n\t"
+ "1: \n\t"
+ "vld $vr0, %[src1], 0 \n\t"
+ "vldx $vr1, %[src1], %[srcStride1] \n\t"
+ "vldx $vr2, %[src1], %[stride1_2] \n\t"
+ "vldx $vr3, %[src1], %[stride1_3] \n\t"
+ "add.d %[src1], %[src1], %[stride1_4] \n\t"
+
+ "vld $vr4, %[src2], 0 \n\t"
+ "vldx $vr5, %[src2], %[srcStride2] \n\t"
+ "vldx $vr6, %[src2], %[stride2_2] \n\t"
+ "vldx $vr7, %[src2], %[stride2_3] \n\t"
+ "add.d %[src2], %[src2], %[stride2_4] \n\t"
+
+ "addi.d %[h], %[h], -4 \n\t"
+
+ "vavgr.bu $vr0, $vr4, $vr0 \n\t"
+ "vavgr.bu $vr1, $vr5, $vr1 \n\t"
+ "vavgr.bu $vr2, $vr6, $vr2 \n\t"
+ "vavgr.bu $vr3, $vr7, $vr3 \n\t"
+ "vstelm.d $vr0, %[dst], 0, 0 \n\t"
+ "add.d %[dst], %[dst], %[dstStride] \n\t"
+ "vstelm.d $vr1, %[dst], 0, 0 \n\t"
+ "add.d %[dst], %[dst], %[dstStride] \n\t"
+ "vstelm.d $vr2, %[dst], 0, 0 \n\t"
+ "add.d %[dst], %[dst], %[dstStride] \n\t"
+ "vstelm.d $vr3, %[dst], 0, 0 \n\t"
+ "add.d %[dst], %[dst], %[dstStride] \n\t"
+ "bnez %[h], 1b \n\t"
+
+ : [dst]"+&r"(dst), [src2]"+&r"(src2), [src1]"+&r"(src1),
+ [h]"+&r"(h), [stride1_2]"=&r"(stride1_2),
+ [stride1_3]"=&r"(stride1_3), [stride1_4]"=&r"(stride1_4),
+ [stride2_2]"=&r"(stride2_2), [stride2_3]"=&r"(stride2_3),
+ [stride2_4]"=&r"(stride2_4)
+ : [dstStride]"r"(dst_stride), [srcStride1]"r"(src_stride1),
+ [srcStride2]"r"(src_stride2)
+ : "memory"
+ );
+}
+
+static av_always_inline void
+put_pixels16_l2_8_lsx(uint8_t *dst, const uint8_t *src1, const uint8_t *src2,
+ int dst_stride, int src_stride1, int src_stride2, int h)
+{
+ int stride1_2, stride1_3, stride1_4;
+ int stride2_2, stride2_3, stride2_4;
+ int dststride2, dststride3, dststride4;
+ __asm__ volatile (
+ "slli.d %[stride1_2], %[srcStride1], 1 \n\t"
+ "slli.d %[stride2_2], %[srcStride2], 1 \n\t"
+ "slli.d %[dststride2], %[dstStride], 1 \n\t"
+ "add.d %[stride1_3], %[stride1_2], %[srcStride1] \n\t"
+ "add.d %[stride2_3], %[stride2_2], %[srcStride2] \n\t"
+ "add.d %[dststride3], %[dststride2], %[dstStride] \n\t"
+ "slli.d %[stride1_4], %[stride1_2], 1 \n\t"
+ "slli.d %[stride2_4], %[stride2_2], 1 \n\t"
+ "slli.d %[dststride4], %[dststride2], 1 \n\t"
+ "1: \n\t"
+ "vld $vr0, %[src1], 0 \n\t"
+ "vldx $vr1, %[src1], %[srcStride1] \n\t"
+ "vldx $vr2, %[src1], %[stride1_2] \n\t"
+ "vldx $vr3, %[src1], %[stride1_3] \n\t"
+ "add.d %[src1], %[src1], %[stride1_4] \n\t"
+
+ "vld $vr4, %[src2], 0 \n\t"
+ "vldx $vr5, %[src2], %[srcStride2] \n\t"
+ "vldx $vr6, %[src2], %[stride2_2] \n\t"
+ "vldx $vr7, %[src2], %[stride2_3] \n\t"
+ "add.d %[src2], %[src2], %[stride2_4] \n\t"
+
+ "addi.d %[h], %[h], -4 \n\t"
+
+ "vavgr.bu $vr0, $vr4, $vr0 \n\t"
+ "vavgr.bu $vr1, $vr5, $vr1 \n\t"
+ "vavgr.bu $vr2, $vr6, $vr2 \n\t"
+ "vavgr.bu $vr3, $vr7, $vr3 \n\t"
+ "vst $vr0, %[dst], 0 \n\t"
+ "vstx $vr1, %[dst], %[dstStride] \n\t"
+ "vstx $vr2, %[dst], %[dststride2] \n\t"
+ "vstx $vr3, %[dst], %[dststride3] \n\t"
+ "add.d %[dst], %[dst], %[dststride4] \n\t"
+ "bnez %[h], 1b \n\t"
+
+ : [dst]"+&r"(dst), [src2]"+&r"(src2), [src1]"+&r"(src1),
+ [h]"+&r"(h), [stride1_2]"=&r"(stride1_2),
+ [stride1_3]"=&r"(stride1_3), [stride1_4]"=&r"(stride1_4),
+ [stride2_2]"=&r"(stride2_2), [stride2_3]"=&r"(stride2_3),
+ [stride2_4]"=&r"(stride2_4), [dststride2]"=&r"(dststride2),
+ [dststride3]"=&r"(dststride3), [dststride4]"=&r"(dststride4)
+ : [dstStride]"r"(dst_stride), [srcStride1]"r"(src_stride1),
+ [srcStride2]"r"(src_stride2)
+ : "memory"
+ );
+}
+
+void ff_put_pixels8_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ uint64_t tmp[8];
+ int h_8 = h >> 3;
+ int res = h & 7;
+ ptrdiff_t stride2, stride3, stride4;
+
+ __asm__ volatile (
+ "beqz %[h_8], 2f \n\t"
+ "slli.d %[stride2], %[stride], 1 \n\t"
+ "add.d %[stride3], %[stride2], %[stride] \n\t"
+ "slli.d %[stride4], %[stride2], 1 \n\t"
+ "1: \n\t"
+ "ld.d %[tmp0], %[src], 0x0 \n\t"
+ "ldx.d %[tmp1], %[src], %[stride] \n\t"
+ "ldx.d %[tmp2], %[src], %[stride2] \n\t"
+ "ldx.d %[tmp3], %[src], %[stride3] \n\t"
+ "add.d %[src], %[src], %[stride4] \n\t"
+ "ld.d %[tmp4], %[src], 0x0 \n\t"
+ "ldx.d %[tmp5], %[src], %[stride] \n\t"
+ "ldx.d %[tmp6], %[src], %[stride2] \n\t"
+ "ldx.d %[tmp7], %[src], %[stride3] \n\t"
+ "add.d %[src], %[src], %[stride4] \n\t"
+
+ "addi.d %[h_8], %[h_8], -1 \n\t"
+
+ "st.d %[tmp0], %[dst], 0x0 \n\t"
+ "stx.d %[tmp1], %[dst], %[stride] \n\t"
+ "stx.d %[tmp2], %[dst], %[stride2] \n\t"
+ "stx.d %[tmp3], %[dst], %[stride3] \n\t"
+ "add.d %[dst], %[dst], %[stride4] \n\t"
+ "st.d %[tmp4], %[dst], 0x0 \n\t"
+ "stx.d %[tmp5], %[dst], %[stride] \n\t"
+ "stx.d %[tmp6], %[dst], %[stride2] \n\t"
+ "stx.d %[tmp7], %[dst], %[stride3] \n\t"
+ "add.d %[dst], %[dst], %[stride4] \n\t"
+ "bnez %[h_8], 1b \n\t"
+
+ "2: \n\t"
+ "beqz %[res], 4f \n\t"
+ "3: \n\t"
+ "ld.d %[tmp0], %[src], 0x0 \n\t"
+ "add.d %[src], %[src], %[stride] \n\t"
+ "addi.d %[res], %[res], -1 \n\t"
+ "st.d %[tmp0], %[dst], 0x0 \n\t"
+ "add.d %[dst], %[dst], %[stride] \n\t"
+ "bnez %[res], 3b \n\t"
+ "4: \n\t"
+ : [tmp0]"=&r"(tmp[0]), [tmp1]"=&r"(tmp[1]),
+ [tmp2]"=&r"(tmp[2]), [tmp3]"=&r"(tmp[3]),
+ [tmp4]"=&r"(tmp[4]), [tmp5]"=&r"(tmp[5]),
+ [tmp6]"=&r"(tmp[6]), [tmp7]"=&r"(tmp[7]),
+ [dst]"+&r"(block), [src]"+&r"(pixels),
+ [h_8]"+&r"(h_8), [res]"+&r"(res),
+ [stride2]"=&r"(stride2), [stride3]"=&r"(stride3),
+ [stride4]"=&r"(stride4)
+ : [stride]"r"(line_size)
+ : "memory"
+ );
+}
+
+void ff_put_pixels16_8_lsx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ int h_8 = h >> 3;
+ int res = h & 7;
+ ptrdiff_t stride2, stride3, stride4;
+
+ __asm__ volatile (
+ "beqz %[h_8], 2f \n\t"
+ "slli.d %[stride2], %[stride], 1 \n\t"
+ "add.d %[stride3], %[stride2], %[stride] \n\t"
+ "slli.d %[stride4], %[stride2], 1 \n\t"
+ "1: \n\t"
+ "vld $vr0, %[src], 0x0 \n\t"
+ "vldx $vr1, %[src], %[stride] \n\t"
+ "vldx $vr2, %[src], %[stride2] \n\t"
+ "vldx $vr3, %[src], %[stride3] \n\t"
+ "add.d %[src], %[src], %[stride4] \n\t"
+ "vld $vr4, %[src], 0x0 \n\t"
+ "vldx $vr5, %[src], %[stride] \n\t"
+ "vldx $vr6, %[src], %[stride2] \n\t"
+ "vldx $vr7, %[src], %[stride3] \n\t"
+ "add.d %[src], %[src], %[stride4] \n\t"
+
+ "addi.d %[h_8], %[h_8], -1 \n\t"
+
+ "vst $vr0, %[dst], 0x0 \n\t"
+ "vstx $vr1, %[dst], %[stride] \n\t"
+ "vstx $vr2, %[dst], %[stride2] \n\t"
+ "vstx $vr3, %[dst], %[stride3] \n\t"
+ "add.d %[dst], %[dst], %[stride4] \n\t"
+ "vst $vr4, %[dst], 0x0 \n\t"
+ "vstx $vr5, %[dst], %[stride] \n\t"
+ "vstx $vr6, %[dst], %[stride2] \n\t"
+ "vstx $vr7, %[dst], %[stride3] \n\t"
+ "add.d %[dst], %[dst], %[stride4] \n\t"
+ "bnez %[h_8], 1b \n\t"
+
+ "2: \n\t"
+ "beqz %[res], 4f \n\t"
+ "3: \n\t"
+ "vld $vr0, %[src], 0x0 \n\t"
+ "add.d %[src], %[src], %[stride] \n\t"
+ "addi.d %[res], %[res], -1 \n\t"
+ "vst $vr0, %[dst], 0x0 \n\t"
+ "add.d %[dst], %[dst], %[stride] \n\t"
+ "bnez %[res], 3b \n\t"
+ "4: \n\t"
+ : [dst]"+&r"(block), [src]"+&r"(pixels),
+ [h_8]"+&r"(h_8), [res]"+&r"(res),
+ [stride2]"=&r"(stride2), [stride3]"=&r"(stride3),
+ [stride4]"=&r"(stride4)
+ : [stride]"r"(line_size)
+ : "memory"
+ );
+}
+
+void ff_put_pixels8_x2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ put_pixels8_l2_8_lsx(block, pixels, pixels + 1, line_size, line_size,
+ line_size, h);
+}
+
+void ff_put_pixels8_y2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ put_pixels8_l2_8_lsx(block, pixels, pixels + line_size, line_size,
+ line_size, line_size, h);
+}
+
+void ff_put_pixels16_x2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ put_pixels16_l2_8_lsx(block, pixels, pixels + 1, line_size, line_size,
+ line_size, h);
+}
+
+void ff_put_pixels16_y2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ put_pixels16_l2_8_lsx(block, pixels, pixels + line_size, line_size,
+ line_size, line_size, h);
+}
+
+static void common_hz_bil_no_rnd_16x16_lasx(const uint8_t *src,
+ int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t *_src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x -1);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5,
+ src4, 0x20, src7, src6, 0x20, src0, src1, src2, src3);
+ src0 = __lasx_xvavg_bu(src0, src2);
+ src1 = __lasx_xvavg_bu(src1, src3);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src0, dst, 0, 2);
+ __lasx_xvstelm_d(src0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 0);
+ __lasx_xvstelm_d(src1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 2);
+ __lasx_xvstelm_d(src1, dst, 8, 3);
+ dst += dst_stride;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4,
+ 0x20, src7, src6, 0x20, src0, src1, src2, src3);
+ src0 = __lasx_xvavg_bu(src0, src2);
+ src1 = __lasx_xvavg_bu(src1, src3);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src0, dst, 0, 2);
+ __lasx_xvstelm_d(src0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 0);
+ __lasx_xvstelm_d(src1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 2);
+ __lasx_xvstelm_d(src1, dst, 8, 3);
+ dst += dst_stride;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4,
+ 0x20, src7, src6, 0x20, src0, src1, src2, src3);
+ src0 = __lasx_xvavg_bu(src0, src2);
+ src1 = __lasx_xvavg_bu(src1, src3);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src0, dst, 0, 2);
+ __lasx_xvstelm_d(src0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 0);
+ __lasx_xvstelm_d(src1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 2);
+ __lasx_xvstelm_d(src1, dst, 8, 3);
+ dst += dst_stride;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4,
+ 0x20, src7, src6, 0x20, src0, src1, src2, src3);
+ src0 = __lasx_xvavg_bu(src0, src2);
+ src1 = __lasx_xvavg_bu(src1, src3);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src0, dst, 0, 2);
+ __lasx_xvstelm_d(src0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 0);
+ __lasx_xvstelm_d(src1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 2);
+ __lasx_xvstelm_d(src1, dst, 8, 3);
+}
+
+static void common_hz_bil_no_rnd_8x16_lasx(const uint8_t *src,
+ int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4,
+ 0x20, src7, src6, 0x20, src0, src1, src2, src3);
+ src0 = __lasx_xvavg_bu(src0, src2);
+ src1 = __lasx_xvavg_bu(src1, src3);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src0, dst, 0, 2);
+ __lasx_xvstelm_d(src0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 0);
+ __lasx_xvstelm_d(src1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 2);
+ __lasx_xvstelm_d(src1, dst, 8, 3);
+ dst += dst_stride;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4,
+ 0x20, src7, src6, 0x20, src0, src1, src2, src3);
+ src0 = __lasx_xvavg_bu(src0, src2);
+ src1 = __lasx_xvavg_bu(src1, src3);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src0, dst, 0, 2);
+ __lasx_xvstelm_d(src0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 0);
+ __lasx_xvstelm_d(src1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src1, dst, 0, 2);
+ __lasx_xvstelm_d(src1, dst, 8, 3);
+}
+
+void ff_put_no_rnd_pixels16_x2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ if (h == 16) {
+ common_hz_bil_no_rnd_16x16_lasx(pixels, line_size, block, line_size);
+ } else if (h == 8) {
+ common_hz_bil_no_rnd_8x16_lasx(pixels, line_size, block, line_size);
+ }
+}
+
+static void common_vt_bil_no_rnd_16x16_lasx(const uint8_t *src,
+ int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+ __m256i src9, src10, src11, src12, src13, src14, src15, src16;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src8 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src9, src10);
+ src11 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src12 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src13, src14);
+ src15 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src16 = __lasx_xvld(_src, 0);
+
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2,
+ 0x20, src4, src3, 0x20, src0, src1, src2, src3);
+ DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6,
+ 0x20, src8, src7, 0x20, src4, src5, src6, src7);
+ DUP4_ARG3(__lasx_xvpermi_q, src9, src8, 0x20, src10, src9, 0x20, src11,
+ src10, 0x20, src12, src11, 0x20, src8, src9, src10, src11);
+ DUP4_ARG3(__lasx_xvpermi_q, src13, src12, 0x20, src14, src13, 0x20, src15,
+ src14, 0x20, src16, src15, 0x20, src12, src13, src14, src15);
+ DUP4_ARG2(__lasx_xvavg_bu, src0, src1, src2, src3, src4, src5, src6, src7,
+ src0, src2, src4, src6);
+ DUP4_ARG2(__lasx_xvavg_bu, src8, src9, src10, src11, src12, src13, src14,
+ src15, src8, src10, src12, src14);
+
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src0, dst, 0, 2);
+ __lasx_xvstelm_d(src0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src2, dst, 0, 0);
+ __lasx_xvstelm_d(src2, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src2, dst, 0, 2);
+ __lasx_xvstelm_d(src2, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src4, dst, 0, 0);
+ __lasx_xvstelm_d(src4, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src4, dst, 0, 2);
+ __lasx_xvstelm_d(src4, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src6, dst, 0, 0);
+ __lasx_xvstelm_d(src6, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src6, dst, 0, 2);
+ __lasx_xvstelm_d(src6, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src8, dst, 0, 0);
+ __lasx_xvstelm_d(src8, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src8, dst, 0, 2);
+ __lasx_xvstelm_d(src8, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src10, dst, 0, 0);
+ __lasx_xvstelm_d(src10, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src10, dst, 0, 2);
+ __lasx_xvstelm_d(src10, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src12, dst, 0, 0);
+ __lasx_xvstelm_d(src12, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src12, dst, 0, 2);
+ __lasx_xvstelm_d(src12, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src14, dst, 0, 0);
+ __lasx_xvstelm_d(src14, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src14, dst, 0, 2);
+ __lasx_xvstelm_d(src14, dst, 8, 3);
+}
+
+static void common_vt_bil_no_rnd_8x16_lasx(const uint8_t *src,
+ int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src8 = __lasx_xvld(_src, 0);
+
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2,
+ 0x20, src4, src3, 0x20, src0, src1, src2, src3);
+ DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6,
+ 0x20, src8, src7, 0x20, src4, src5, src6, src7);
+ DUP4_ARG2(__lasx_xvavg_bu, src0, src1, src2, src3, src4, src5, src6, src7,
+ src0, src2, src4, src6);
+
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src0, dst, 0, 2);
+ __lasx_xvstelm_d(src0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src2, dst, 0, 0);
+ __lasx_xvstelm_d(src2, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src2, dst, 0, 2);
+ __lasx_xvstelm_d(src2, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src4, dst, 0, 0);
+ __lasx_xvstelm_d(src4, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src4, dst, 0, 2);
+ __lasx_xvstelm_d(src4, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src6, dst, 0, 0);
+ __lasx_xvstelm_d(src6, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(src6, dst, 0, 2);
+ __lasx_xvstelm_d(src6, dst, 8, 3);
+}
+
+void ff_put_no_rnd_pixels16_y2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ if (h == 16) {
+ common_vt_bil_no_rnd_16x16_lasx(pixels, line_size, block, line_size);
+ } else if (h == 8) {
+ common_vt_bil_no_rnd_8x16_lasx(pixels, line_size, block, line_size);
+ }
+}
+
+static void common_hv_bil_no_rnd_16x16_lasx(const uint8_t *src,
+ int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8, src9;
+ __m256i src10, src11, src12, src13, src14, src15, src16, src17;
+ __m256i sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (1 - src_stride_4x);
+ src9 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src10, src11);
+ src12 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src13 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src14, src15);
+ src16 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17);
+
+ DUP4_ARG3(__lasx_xvpermi_q, src0, src4, 0x02, src1, src5, 0x02, src2,
+ src6, 0x02, src3, src7, 0x02, src0, src1, src2, src3);
+ DUP4_ARG3(__lasx_xvpermi_q, src4, src8, 0x02, src9, src13, 0x02, src10,
+ src14, 0x02, src11, src15, 0x02, src4, src5, src6, src7);
+ DUP2_ARG3(__lasx_xvpermi_q, src12, src16, 0x02, src13, src17, 0x02,
+ src8, src9);
+ DUP4_ARG2(__lasx_xvilvl_h, src5, src0, src6, src1, src7, src2, src8, src3,
+ sum0, sum2, sum4, sum6);
+ DUP4_ARG2(__lasx_xvilvh_h, src5, src0, src6, src1, src7, src2, src8, src3,
+ sum1, sum3, sum5, sum7);
+ src8 = __lasx_xvilvl_h(src9, src4);
+ src9 = __lasx_xvilvh_h(src9, src4);
+
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum0, sum0, sum1, sum1, sum2, sum2,
+ sum3, sum3, src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum4, sum4, sum5, sum5, sum6, sum6,
+ sum7, sum7, src4, src5, src6, src7);
+ DUP2_ARG2(__lasx_xvhaddw_hu_bu, src8, src8, src9, src9, src8, src9);
+
+ DUP4_ARG2(__lasx_xvadd_h, src0, src2, src1, src3, src2, src4, src3, src5,
+ sum0, sum1, sum2, sum3);
+ DUP4_ARG2(__lasx_xvadd_h, src4, src6, src5, src7, src6, src8, src7, src9,
+ sum4, sum5, sum6, sum7);
+ DUP4_ARG2(__lasx_xvaddi_hu, sum0, 1, sum1, 1, sum2, 1, sum3, 1,
+ sum0, sum1, sum2, sum3);
+ DUP4_ARG2(__lasx_xvaddi_hu, sum4, 1, sum5, 1, sum6, 1, sum7, 1,
+ sum4, sum5, sum6, sum7);
+ DUP4_ARG3(__lasx_xvsrani_b_h, sum1, sum0, 2, sum3, sum2, 2, sum5, sum4, 2,
+ sum7, sum6, 2, sum0, sum1, sum2, sum3);
+ __lasx_xvstelm_d(sum0, dst, 0, 0);
+ __lasx_xvstelm_d(sum0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum1, dst, 0, 0);
+ __lasx_xvstelm_d(sum1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum2, dst, 0, 0);
+ __lasx_xvstelm_d(sum2, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum3, dst, 0, 0);
+ __lasx_xvstelm_d(sum3, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum0, dst, 0, 2);
+ __lasx_xvstelm_d(sum0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum1, dst, 0, 2);
+ __lasx_xvstelm_d(sum1, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum2, dst, 0, 2);
+ __lasx_xvstelm_d(sum2, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum3, dst, 0, 2);
+ __lasx_xvstelm_d(sum3, dst, 8, 3);
+ dst += dst_stride;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (1 - src_stride_4x);
+ src9 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src10, src11);
+ src12 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src13 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src14, src15);
+ src16 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17);
+
+ DUP4_ARG3(__lasx_xvpermi_q, src0, src4, 0x02, src1, src5, 0x02, src2, src6, 0x02,
+ src3, src7, 0x02, src0, src1, src2, src3);
+ DUP4_ARG3(__lasx_xvpermi_q, src4, src8, 0x02, src9, src13, 0x02, src10, src14, 0x02,
+ src11, src15, 0x02, src4, src5, src6, src7);
+ DUP2_ARG3(__lasx_xvpermi_q, src12, src16, 0x02, src13, src17, 0x02, src8, src9);
+
+ DUP4_ARG2(__lasx_xvilvl_h, src5, src0, src6, src1, src7, src2, src8, src3,
+ sum0, sum2, sum4, sum6);
+ DUP4_ARG2(__lasx_xvilvh_h, src5, src0, src6, src1, src7, src2, src8, src3,
+ sum1, sum3, sum5, sum7);
+ src8 = __lasx_xvilvl_h(src9, src4);
+ src9 = __lasx_xvilvh_h(src9, src4);
+
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum0, sum0, sum1, sum1, sum2, sum2,
+ sum3, sum3, src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum4, sum4, sum5, sum5, sum6, sum6,
+ sum7, sum7, src4, src5, src6, src7);
+ DUP2_ARG2(__lasx_xvhaddw_hu_bu, src8, src8, src9, src9, src8, src9);
+
+ DUP4_ARG2(__lasx_xvadd_h, src0, src2, src1, src3, src2, src4, src3, src5,
+ sum0, sum1, sum2, sum3);
+ DUP4_ARG2(__lasx_xvadd_h, src4, src6, src5, src7, src6, src8, src7, src9,
+ sum4, sum5, sum6, sum7);
+ DUP4_ARG2(__lasx_xvaddi_hu, sum0, 1, sum1, 1, sum2, 1, sum3, 1,
+ sum0, sum1, sum2, sum3);
+ DUP4_ARG2(__lasx_xvaddi_hu, sum4, 1, sum5, 1, sum6, 1, sum7, 1,
+ sum4, sum5, sum6, sum7);
+ DUP4_ARG3(__lasx_xvsrani_b_h, sum1, sum0, 2, sum3, sum2, 2, sum5, sum4, 2,
+ sum7, sum6, 2, sum0, sum1, sum2, sum3);
+ __lasx_xvstelm_d(sum0, dst, 0, 0);
+ __lasx_xvstelm_d(sum0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum1, dst, 0, 0);
+ __lasx_xvstelm_d(sum1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum2, dst, 0, 0);
+ __lasx_xvstelm_d(sum2, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum3, dst, 0, 0);
+ __lasx_xvstelm_d(sum3, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum0, dst, 0, 2);
+ __lasx_xvstelm_d(sum0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum1, dst, 0, 2);
+ __lasx_xvstelm_d(sum1, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum2, dst, 0, 2);
+ __lasx_xvstelm_d(sum2, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum3, dst, 0, 2);
+ __lasx_xvstelm_d(sum3, dst, 8, 3);
+}
+
+static void common_hv_bil_no_rnd_8x16_lasx(const uint8_t *src,
+ int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8, src9;
+ __m256i src10, src11, src12, src13, src14, src15, src16, src17;
+ __m256i sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (1 - src_stride_4x);
+ src9 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src10, src11);
+ src12 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src13 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src14, src15);
+ src16 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17);
+
+ DUP4_ARG3(__lasx_xvpermi_q, src0, src4, 0x02, src1, src5, 0x02, src2,
+ src6, 0x02, src3, src7, 0x02, src0, src1, src2, src3);
+ DUP4_ARG3(__lasx_xvpermi_q, src4, src8, 0x02, src9, src13, 0x02, src10,
+ src14, 0x02, src11, src15, 0x02, src4, src5, src6, src7);
+ DUP2_ARG3(__lasx_xvpermi_q, src12, src16, 0x02, src13, src17, 0x02, src8, src9);
+
+ DUP4_ARG2(__lasx_xvilvl_h, src5, src0, src6, src1, src7, src2, src8, src3,
+ sum0, sum2, sum4, sum6);
+ DUP4_ARG2(__lasx_xvilvh_h, src5, src0, src6, src1, src7, src2, src8, src3,
+ sum1, sum3, sum5, sum7);
+ src8 = __lasx_xvilvl_h(src9, src4);
+ src9 = __lasx_xvilvh_h(src9, src4);
+
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum0, sum0, sum1, sum1, sum2, sum2,
+ sum3, sum3, src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum4, sum4, sum5, sum5, sum6, sum6,
+ sum7, sum7, src4, src5, src6, src7);
+ DUP2_ARG2(__lasx_xvhaddw_hu_bu, src8, src8, src9, src9, src8, src9);
+
+ DUP4_ARG2(__lasx_xvadd_h, src0, src2, src1, src3, src2, src4, src3, src5,
+ sum0, sum1, sum2, sum3);
+ DUP4_ARG2(__lasx_xvadd_h, src4, src6, src5, src7, src6, src8, src7, src9,
+ sum4, sum5, sum6, sum7);
+ DUP4_ARG2(__lasx_xvaddi_hu, sum0, 1, sum1, 1, sum2, 1, sum3, 1,
+ sum0, sum1, sum2, sum3);
+ DUP4_ARG2(__lasx_xvaddi_hu, sum4, 1, sum5, 1, sum6, 1, sum7, 1,
+ sum4, sum5, sum6, sum7);
+ DUP4_ARG3(__lasx_xvsrani_b_h, sum1, sum0, 2, sum3, sum2, 2, sum5, sum4, 2,
+ sum7, sum6, 2, sum0, sum1, sum2, sum3);
+ __lasx_xvstelm_d(sum0, dst, 0, 0);
+ __lasx_xvstelm_d(sum0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum1, dst, 0, 0);
+ __lasx_xvstelm_d(sum1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum2, dst, 0, 0);
+ __lasx_xvstelm_d(sum2, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum3, dst, 0, 0);
+ __lasx_xvstelm_d(sum3, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum0, dst, 0, 2);
+ __lasx_xvstelm_d(sum0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum1, dst, 0, 2);
+ __lasx_xvstelm_d(sum1, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum2, dst, 0, 2);
+ __lasx_xvstelm_d(sum2, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum3, dst, 0, 2);
+ __lasx_xvstelm_d(sum3, dst, 8, 3);
+}
+
+void ff_put_no_rnd_pixels16_xy2_8_lasx(uint8_t *block,
+ const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ if (h == 16) {
+ common_hv_bil_no_rnd_16x16_lasx(pixels, line_size, block, line_size);
+ } else if (h == 8) {
+ common_hv_bil_no_rnd_8x16_lasx(pixels, line_size, block, line_size);
+ }
+}
+
+static void common_hz_bil_no_rnd_8x8_lasx(const uint8_t *src,
+ int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+ __m256i src8, src9, src10, src11, src12, src13, src14, src15;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t dst_stride_2x = dst_stride << 1;
+ int32_t dst_stride_4x = dst_stride << 2;
+ int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (1 - src_stride_4x);
+ src8 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src9, src10);
+ src11 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src12 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src13, src14);
+ src15 = __lasx_xvldx(_src, src_stride_3x);
+
+ DUP4_ARG2(__lasx_xvpickev_d, src1, src0, src3, src2, src5, src4, src7,
+ src6, src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvpickev_d, src9, src8, src11, src10, src13, src12, src15,
+ src14, src4, src5, src6, src7);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4,
+ 0x20, src7, src6, 0x20, src0, src1, src2, src3);
+ src0 = __lasx_xvavg_bu(src0, src2);
+ src1 = __lasx_xvavg_bu(src1, src3);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst + dst_stride, 0, 1);
+ __lasx_xvstelm_d(src0, dst + dst_stride_2x, 0, 2);
+ __lasx_xvstelm_d(src0, dst + dst_stride_3x, 0, 3);
+ dst += dst_stride_4x;
+ __lasx_xvstelm_d(src1, dst, 0, 0);
+ __lasx_xvstelm_d(src1, dst + dst_stride, 0, 1);
+ __lasx_xvstelm_d(src1, dst + dst_stride_2x, 0, 2);
+ __lasx_xvstelm_d(src1, dst + dst_stride_3x, 0, 3);
+}
+
+static void common_hz_bil_no_rnd_4x8_lasx(const uint8_t *src,
+ int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ int32_t dst_stride_2x = dst_stride << 1;
+ int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+ uint8_t *_src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ DUP4_ARG2(__lasx_xvpickev_d, src1, src0, src3, src2, src5, src4, src7, src6,
+ src0, src1, src2, src3);
+ DUP2_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src0, src1);
+ src0 = __lasx_xvavg_bu(src0, src1);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst + dst_stride, 0, 1);
+ __lasx_xvstelm_d(src0, dst + dst_stride_2x, 0, 2);
+ __lasx_xvstelm_d(src0, dst + dst_stride_3x, 0, 3);
+}
+
+void ff_put_no_rnd_pixels8_x2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ if (h == 8) {
+ common_hz_bil_no_rnd_8x8_lasx(pixels, line_size, block, line_size);
+ } else if (h == 4) {
+ common_hz_bil_no_rnd_4x8_lasx(pixels, line_size, block, line_size);
+ }
+}
+
+static void common_vt_bil_no_rnd_8x8_lasx(const uint8_t *src, int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t dst_stride_2x = dst_stride << 1;
+ int32_t dst_stride_4x = dst_stride << 2;
+ int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src8 = __lasx_xvld(_src, 0);
+
+ DUP4_ARG2(__lasx_xvpickev_d, src1, src0, src2, src1, src3, src2, src4, src3,
+ src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvpickev_d, src5, src4, src6, src5, src7, src6, src8, src7,
+ src4, src5, src6, src7);
+ DUP4_ARG3(__lasx_xvpermi_q, src2, src0, 0x20, src3, src1, 0x20, src6, src4,
+ 0x20, src7, src5, 0x20, src0, src1, src2, src3);
+ src0 = __lasx_xvavg_bu(src0, src1);
+ src1 = __lasx_xvavg_bu(src2, src3);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst + dst_stride, 0, 1);
+ __lasx_xvstelm_d(src0, dst + dst_stride_2x, 0, 2);
+ __lasx_xvstelm_d(src0, dst + dst_stride_3x, 0, 3);
+ dst += dst_stride_4x;
+ __lasx_xvstelm_d(src1, dst, 0, 0);
+ __lasx_xvstelm_d(src1, dst + dst_stride, 0, 1);
+ __lasx_xvstelm_d(src1, dst + dst_stride_2x, 0, 2);
+ __lasx_xvstelm_d(src1, dst + dst_stride_3x, 0, 3);
+}
+
+static void common_vt_bil_no_rnd_4x8_lasx(const uint8_t *src, int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t dst_stride_2x = dst_stride << 1;
+ int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP4_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, _src,
+ src_stride_3x, _src, src_stride_4x, src1, src2, src3, src4);
+ DUP4_ARG2(__lasx_xvpickev_d, src1, src0, src2, src1, src3, src2, src4, src3,
+ src0, src1, src2, src3);
+ DUP2_ARG3(__lasx_xvpermi_q, src2, src0, 0x20, src3, src1, 0x20, src0, src1);
+ src0 = __lasx_xvavg_bu(src0, src1);
+ __lasx_xvstelm_d(src0, dst, 0, 0);
+ __lasx_xvstelm_d(src0, dst + dst_stride, 0, 1);
+ __lasx_xvstelm_d(src0, dst + dst_stride_2x, 0, 2);
+ __lasx_xvstelm_d(src0, dst + dst_stride_3x, 0, 3);
+}
+
+void ff_put_no_rnd_pixels8_y2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ if (h == 8) {
+ common_vt_bil_no_rnd_8x8_lasx(pixels, line_size, block, line_size);
+ } else if (h == 4) {
+ common_vt_bil_no_rnd_4x8_lasx(pixels, line_size, block, line_size);
+ }
+}
+
+static void common_hv_bil_no_rnd_8x8_lasx(const uint8_t *src, int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+ __m256i src8, src9, src10, src11, src12, src13, src14, src15, src16, src17;
+ __m256i sum0, sum1, sum2, sum3;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t dst_stride_2x = dst_stride << 1;
+ int32_t dst_stride_4x = dst_stride << 2;
+ int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (1 - src_stride_4x);
+ src9 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src10, src11);
+ src12 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src13 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src14, src15);
+ src16 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17);
+
+ DUP4_ARG2(__lasx_xvilvl_b, src9, src0, src10, src1, src11, src2, src12, src3,
+ src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvilvl_b, src13, src4, src14, src5, src15, src6, src16, src7,
+ src4, src5, src6, src7);
+ src8 = __lasx_xvilvl_b(src17, src8);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2,
+ 0x20, src4, src3, 0x20, src0, src1, src2, src3);
+ DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6,
+ 0x20, src8, src7, 0x20, src4, src5, src6, src7);
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, src0, src0, src1, src1, src2, src2,
+ src3, src3, src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, src4, src4, src5, src5, src6, src6,
+ src7, src7, src4, src5, src6, src7);
+ DUP4_ARG2(__lasx_xvadd_h, src0, src1, src2, src3, src4, src5, src6, src7,
+ sum0, sum1, sum2, sum3);
+ DUP4_ARG2(__lasx_xvaddi_hu, sum0, 1, sum1, 1, sum2, 1, sum3, 1,
+ sum0, sum1, sum2, sum3);
+ DUP2_ARG3(__lasx_xvsrani_b_h, sum1, sum0, 2, sum3, sum2, 2, sum0, sum1);
+ __lasx_xvstelm_d(sum0, dst, 0, 0);
+ __lasx_xvstelm_d(sum0, dst + dst_stride, 0, 2);
+ __lasx_xvstelm_d(sum0, dst + dst_stride_2x, 0, 1);
+ __lasx_xvstelm_d(sum0, dst + dst_stride_3x, 0, 3);
+ dst += dst_stride_4x;
+ __lasx_xvstelm_d(sum1, dst, 0, 0);
+ __lasx_xvstelm_d(sum1, dst + dst_stride, 0, 2);
+ __lasx_xvstelm_d(sum1, dst + dst_stride_2x, 0, 1);
+ __lasx_xvstelm_d(sum1, dst + dst_stride_3x, 0, 3);
+}
+
+static void common_hv_bil_no_rnd_4x8_lasx(const uint8_t *src, int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+ __m256i src8, src9, sum0, sum1;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t dst_stride_2x = dst_stride << 1;
+ int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t *_src = (uint8_t*)src;
+
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src5 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src6, src7);
+ src8 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src4, src9);
+
+ DUP4_ARG2(__lasx_xvilvl_b, src5, src0, src6, src1, src7, src2, src8, src3,
+ src0, src1, src2, src3);
+ src4 = __lasx_xvilvl_b(src9, src4);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2,
+ 0x20, src4, src3, 0x20, src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, src0, src0, src1, src1, src2, src2,
+ src3, src3, src0, src1, src2, src3);
+ DUP2_ARG2(__lasx_xvadd_h, src0, src1, src2, src3, sum0, sum1);
+ sum0 = __lasx_xvaddi_hu(sum0, 1);
+ sum1 = __lasx_xvaddi_hu(sum1, 1);
+ sum0 = __lasx_xvsrani_b_h(sum1, sum0, 2);
+ __lasx_xvstelm_d(sum0, dst, 0, 0);
+ __lasx_xvstelm_d(sum0, dst + dst_stride, 0, 2);
+ __lasx_xvstelm_d(sum0, dst + dst_stride_2x, 0, 1);
+ __lasx_xvstelm_d(sum0, dst + dst_stride_3x, 0, 3);
+}
+
+void ff_put_no_rnd_pixels8_xy2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ if (h == 8) {
+ common_hv_bil_no_rnd_8x8_lasx(pixels, line_size, block, line_size);
+ } else if (h == 4) {
+ common_hv_bil_no_rnd_4x8_lasx(pixels, line_size, block, line_size);
+ }
+}
+
+static void common_hv_bil_16w_lasx(const uint8_t *src, int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride,
+ uint8_t height)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8, src9;
+ __m256i src10, src11, src12, src13, src14, src15, src16, src17;
+ __m256i sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7;
+ uint8_t loop_cnt;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ for (loop_cnt = (height >> 3); loop_cnt--;) {
+ src0 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2);
+ src3 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src4 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6);
+ src7 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (1 - src_stride_4x);
+ src9 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src10, src11);
+ src12 = __lasx_xvldx(_src, src_stride_3x);
+ _src += src_stride_4x;
+ src13 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x,
+ src14, src15);
+ src16 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17);
+
+ DUP4_ARG3(__lasx_xvpermi_q, src0, src4, 0x02, src1, src5, 0x02, src2,
+ src6, 0x02, src3, src7, 0x02, src0, src1, src2, src3);
+ DUP4_ARG3(__lasx_xvpermi_q, src4, src8, 0x02, src9, src13, 0x02, src10,
+ src14, 0x02, src11, src15, 0x02, src4, src5, src6, src7);
+ DUP2_ARG3(__lasx_xvpermi_q, src12, src16, 0x02, src13, src17, 0x02,
+ src8, src9);
+
+ DUP4_ARG2(__lasx_xvilvl_h, src5, src0, src6, src1, src7, src2, src8,
+ src3, sum0, sum2, sum4, sum6);
+ DUP4_ARG2(__lasx_xvilvh_h, src5, src0, src6, src1, src7, src2, src8,
+ src3, sum1, sum3, sum5, sum7);
+ src8 = __lasx_xvilvl_h(src9, src4);
+ src9 = __lasx_xvilvh_h(src9, src4);
+
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum0, sum0, sum1, sum1, sum2, sum2,
+ sum3, sum3, src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum4, sum4, sum5, sum5, sum6, sum6,
+ sum7, sum7, src4, src5, src6, src7);
+ DUP2_ARG2(__lasx_xvhaddw_hu_bu, src8, src8, src9, src9, src8, src9);
+
+ DUP4_ARG2(__lasx_xvadd_h, src0, src2, src1, src3, src2, src4, src3,
+ src5, sum0, sum1, sum2, sum3);
+ DUP4_ARG2(__lasx_xvadd_h, src4, src6, src5, src7, src6, src8, src7,
+ src9, sum4, sum5, sum6, sum7);
+ DUP4_ARG3(__lasx_xvsrarni_b_h, sum1, sum0, 2, sum3, sum2, 2, sum5,
+ sum4, 2, sum7, sum6, 2, sum0, sum1, sum2, sum3);
+ __lasx_xvstelm_d(sum0, dst, 0, 0);
+ __lasx_xvstelm_d(sum0, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum1, dst, 0, 0);
+ __lasx_xvstelm_d(sum1, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum2, dst, 0, 0);
+ __lasx_xvstelm_d(sum2, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum3, dst, 0, 0);
+ __lasx_xvstelm_d(sum3, dst, 8, 1);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum0, dst, 0, 2);
+ __lasx_xvstelm_d(sum0, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum1, dst, 0, 2);
+ __lasx_xvstelm_d(sum1, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum2, dst, 0, 2);
+ __lasx_xvstelm_d(sum2, dst, 8, 3);
+ dst += dst_stride;
+ __lasx_xvstelm_d(sum3, dst, 0, 2);
+ __lasx_xvstelm_d(sum3, dst, 8, 3);
+ dst += dst_stride;
+ }
+}
+
+void ff_put_pixels16_xy2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ common_hv_bil_16w_lasx(pixels, line_size, block, line_size, h);
+}
+
+static void common_hv_bil_8w_lasx(const uint8_t *src, int32_t src_stride,
+ uint8_t *dst, int32_t dst_stride,
+ uint8_t height)
+{
+ __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+ __m256i src8, src9, sum0, sum1;
+ uint8_t loop_cnt;
+ int32_t src_stride_2x = src_stride << 1;
+ int32_t src_stride_4x = src_stride << 2;
+ int32_t dst_stride_2x = dst_stride << 1;
+ int32_t dst_stride_4x = dst_stride << 2;
+ int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+ int32_t src_stride_3x = src_stride_2x + src_stride;
+ uint8_t* _src = (uint8_t*)src;
+
+ DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src0, src5);
+ _src += src_stride;
+
+ for (loop_cnt = (height >> 2); loop_cnt--;) {
+ src1 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src2, src3);
+ src4 = __lasx_xvldx(_src, src_stride_3x);
+ _src += 1;
+ src6 = __lasx_xvld(_src, 0);
+ DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src7, src8);
+ src9 = __lasx_xvldx(_src, src_stride_3x);
+ _src += (src_stride_4x - 1);
+ DUP4_ARG2(__lasx_xvilvl_b, src5, src0, src6, src1, src7, src2, src8, src3,
+ src0, src1, src2, src3);
+ src5 = __lasx_xvilvl_b(src9, src4);
+ DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2,
+ 0x20, src5, src3, 0x20, src0, src1, src2, src3);
+ DUP4_ARG2(__lasx_xvhaddw_hu_bu, src0, src0, src1, src1, src2, src2,
+ src3, src3, src0, src1, src2, src3);
+ DUP2_ARG2(__lasx_xvadd_h, src0, src1, src2, src3, sum0, sum1);
+ sum0 = __lasx_xvsrarni_b_h(sum1, sum0, 2);
+ __lasx_xvstelm_d(sum0, dst, 0, 0);
+ __lasx_xvstelm_d(sum0, dst + dst_stride, 0, 2);
+ __lasx_xvstelm_d(sum0, dst + dst_stride_2x, 0, 1);
+ __lasx_xvstelm_d(sum0, dst + dst_stride_3x, 0, 3);
+ dst += dst_stride_4x;
+ src0 = src4;
+ src5 = src9;
+ }
+}
+
+void ff_put_pixels8_xy2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h)
+{
+ common_hv_bil_8w_lasx(pixels, line_size, block, line_size, h);
+}
diff --git a/libavcodec/loongarch/hpeldsp_lasx.h b/libavcodec/loongarch/hpeldsp_lasx.h
new file mode 100644
index 0000000000..2e035eade8
--- /dev/null
+++ b/libavcodec/loongarch/hpeldsp_lasx.h
@@ -0,0 +1,58 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_HPELDSP_LASX_H
+#define AVCODEC_LOONGARCH_HPELDSP_LASX_H
+
+#include <stdint.h>
+#include <stddef.h>
+#include "libavutil/attributes.h"
+
+void ff_put_pixels8_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_pixels8_x2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int32_t h);
+void ff_put_pixels8_y2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int32_t h);
+void ff_put_pixels16_8_lsx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_pixels16_x2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int32_t h);
+void ff_put_pixels16_y2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int32_t h);
+void ff_put_no_rnd_pixels16_x2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_no_rnd_pixels16_y2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_no_rnd_pixels16_xy2_8_lasx(uint8_t *block,
+ const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_no_rnd_pixels8_x2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_no_rnd_pixels8_y2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_no_rnd_pixels8_xy2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_pixels8_xy2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+void ff_put_pixels16_xy2_8_lasx(uint8_t *block, const uint8_t *pixels,
+ ptrdiff_t line_size, int h);
+#endif
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v3 2/3] avcodec: [loongarch] Optimize idctdstp with LASX.
2021-12-29 10:18 [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX Hao Chen
@ 2021-12-29 10:18 ` Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 3/3] avcodec: [loongarch] Optimize prefetch with loongarch Hao Chen
2022-01-03 11:24 ` [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch 殷时友
3 siblings, 0 replies; 7+ messages in thread
From: Hao Chen @ 2021-12-29 10:18 UTC (permalink / raw)
To: ffmpeg-devel
./ffmpeg -i 8_mpeg4_1080p_24fps_12Mbps.avi -f rawvideo -y /dev/null -an
before:433fps
after :552fps
---
libavcodec/idctdsp.c | 2 +
libavcodec/idctdsp.h | 2 +
libavcodec/loongarch/Makefile | 3 +
libavcodec/loongarch/idctdsp_init_loongarch.c | 45 +++
libavcodec/loongarch/idctdsp_lasx.c | 124 ++++++++
libavcodec/loongarch/idctdsp_loongarch.h | 41 +++
libavcodec/loongarch/simple_idct_lasx.c | 297 ++++++++++++++++++
7 files changed, 514 insertions(+)
create mode 100644 libavcodec/loongarch/idctdsp_init_loongarch.c
create mode 100644 libavcodec/loongarch/idctdsp_lasx.c
create mode 100644 libavcodec/loongarch/idctdsp_loongarch.h
create mode 100644 libavcodec/loongarch/simple_idct_lasx.c
diff --git a/libavcodec/idctdsp.c b/libavcodec/idctdsp.c
index 846ed0b0f8..71bd03c606 100644
--- a/libavcodec/idctdsp.c
+++ b/libavcodec/idctdsp.c
@@ -315,6 +315,8 @@ av_cold void ff_idctdsp_init(IDCTDSPContext *c, AVCodecContext *avctx)
ff_idctdsp_init_x86(c, avctx, high_bit_depth);
if (ARCH_MIPS)
ff_idctdsp_init_mips(c, avctx, high_bit_depth);
+ if (ARCH_LOONGARCH)
+ ff_idctdsp_init_loongarch(c, avctx, high_bit_depth);
ff_init_scantable_permutation(c->idct_permutation,
c->perm_type);
diff --git a/libavcodec/idctdsp.h b/libavcodec/idctdsp.h
index ca21a31a02..014488aec3 100644
--- a/libavcodec/idctdsp.h
+++ b/libavcodec/idctdsp.h
@@ -118,5 +118,7 @@ void ff_idctdsp_init_x86(IDCTDSPContext *c, AVCodecContext *avctx,
unsigned high_bit_depth);
void ff_idctdsp_init_mips(IDCTDSPContext *c, AVCodecContext *avctx,
unsigned high_bit_depth);
+void ff_idctdsp_init_loongarch(IDCTDSPContext *c, AVCodecContext *avctx,
+ unsigned high_bit_depth);
#endif /* AVCODEC_IDCTDSP_H */
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 07a401d883..c4d71e801b 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -6,6 +6,7 @@ OBJS-$(CONFIG_VP8_DECODER) += loongarch/vp8dsp_init_loongarch.o
OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9dsp_init_loongarch.o
OBJS-$(CONFIG_VC1DSP) += loongarch/vc1dsp_init_loongarch.o
OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_init_loongarch.o
+OBJS-$(CONFIG_IDCTDSP) += loongarch/idctdsp_init_loongarch.o
LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o
LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o
LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \
@@ -14,6 +15,8 @@ LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \
LASX-OBJS-$(CONFIG_H264PRED) += loongarch/h264_intrapred_lasx.o
LASX-OBJS-$(CONFIG_VC1_DECODER) += loongarch/vc1dsp_lasx.o
LASX-OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_lasx.o
+LASX-OBJS-$(CONFIG_IDCTDSP) += loongarch/simple_idct_lasx.o \
+ loongarch/idctdsp_lasx.o
LSX-OBJS-$(CONFIG_VP8_DECODER) += loongarch/vp8_mc_lsx.o \
loongarch/vp8_lpf_lsx.o
LSX-OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9_mc_lsx.o \
diff --git a/libavcodec/loongarch/idctdsp_init_loongarch.c b/libavcodec/loongarch/idctdsp_init_loongarch.c
new file mode 100644
index 0000000000..9d1d21cc18
--- /dev/null
+++ b/libavcodec/loongarch/idctdsp_init_loongarch.c
@@ -0,0 +1,45 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Hao Chen <chenhao@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/cpu.h"
+#include "idctdsp_loongarch.h"
+#include "libavcodec/xvididct.h"
+
+av_cold void ff_idctdsp_init_loongarch(IDCTDSPContext *c, AVCodecContext *avctx,
+ unsigned high_bit_depth)
+{
+ int cpu_flags = av_get_cpu_flags();
+
+ if (have_lasx(cpu_flags)) {
+ if ((avctx->lowres != 1) && (avctx->lowres != 2) && (avctx->lowres != 3) &&
+ (avctx->bits_per_raw_sample != 10) &&
+ (avctx->bits_per_raw_sample != 12) &&
+ (avctx->idct_algo == FF_IDCT_AUTO)) {
+ c->idct_put = ff_simple_idct_put_lasx;
+ c->idct_add = ff_simple_idct_add_lasx;
+ c->idct = ff_simple_idct_lasx;
+ c->perm_type = FF_IDCT_PERM_NONE;
+ }
+ c->put_pixels_clamped = ff_put_pixels_clamped_lasx;
+ c->put_signed_pixels_clamped = ff_put_signed_pixels_clamped_lasx;
+ c->add_pixels_clamped = ff_add_pixels_clamped_lasx;
+ }
+}
diff --git a/libavcodec/loongarch/idctdsp_lasx.c b/libavcodec/loongarch/idctdsp_lasx.c
new file mode 100644
index 0000000000..1cfab0e028
--- /dev/null
+++ b/libavcodec/loongarch/idctdsp_lasx.c
@@ -0,0 +1,124 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Hao Chen <chenhao@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "idctdsp_loongarch.h"
+#include "libavutil/loongarch/loongson_intrinsics.h"
+
+void ff_put_pixels_clamped_lasx(const int16_t *block,
+ uint8_t *av_restrict pixels,
+ ptrdiff_t stride)
+{
+ __m256i b0, b1, b2, b3;
+ __m256i temp0, temp1;
+ ptrdiff_t stride_2x = stride << 1;
+ ptrdiff_t stride_4x = stride << 2;
+ ptrdiff_t stride_3x = stride_2x + stride;
+
+ DUP4_ARG2(__lasx_xvld, block, 0, block, 32, block, 64, block, 96,
+ b0, b1, b2, b3);
+ DUP4_ARG1(__lasx_xvclip255_h, b0, b1, b2, b3, b0, b1, b2, b3);
+ DUP2_ARG2(__lasx_xvpickev_b, b1, b0, b3, b2, temp0, temp1);
+ __lasx_xvstelm_d(temp0, pixels, 0, 0);
+ __lasx_xvstelm_d(temp0, pixels + stride, 0, 2);
+ __lasx_xvstelm_d(temp0, pixels + stride_2x, 0, 1);
+ __lasx_xvstelm_d(temp0, pixels + stride_3x, 0, 3);
+ pixels += stride_4x;
+ __lasx_xvstelm_d(temp1, pixels, 0, 0);
+ __lasx_xvstelm_d(temp1, pixels + stride, 0, 2);
+ __lasx_xvstelm_d(temp1, pixels + stride_2x, 0, 1);
+ __lasx_xvstelm_d(temp1, pixels + stride_3x, 0, 3);
+}
+
+void ff_put_signed_pixels_clamped_lasx(const int16_t *block,
+ uint8_t *av_restrict pixels,
+ ptrdiff_t stride)
+{
+ __m256i b0, b1, b2, b3;
+ __m256i temp0, temp1;
+ __m256i const_128 = {0x0080008000800080, 0x0080008000800080,
+ 0x0080008000800080, 0x0080008000800080};
+ ptrdiff_t stride_2x = stride << 1;
+ ptrdiff_t stride_4x = stride << 2;
+ ptrdiff_t stride_3x = stride_2x + stride;
+
+ DUP4_ARG2(__lasx_xvld, block, 0, block, 32, block, 64, block, 96,
+ b0, b1, b2, b3);
+ DUP4_ARG2(__lasx_xvadd_h, b0, const_128, b1, const_128, b2, const_128,
+ b3, const_128, b0, b1, b2, b3);
+ DUP4_ARG1(__lasx_xvclip255_h, b0, b1, b2, b3, b0, b1, b2, b3);
+ DUP2_ARG2(__lasx_xvpickev_b, b1, b0, b3, b2, temp0, temp1);
+ __lasx_xvstelm_d(temp0, pixels, 0, 0);
+ __lasx_xvstelm_d(temp0, pixels + stride, 0, 2);
+ __lasx_xvstelm_d(temp0, pixels + stride_2x, 0, 1);
+ __lasx_xvstelm_d(temp0, pixels + stride_3x, 0, 3);
+ pixels += stride_4x;
+ __lasx_xvstelm_d(temp1, pixels, 0, 0);
+ __lasx_xvstelm_d(temp1, pixels + stride, 0, 2);
+ __lasx_xvstelm_d(temp1, pixels + stride_2x, 0, 1);
+ __lasx_xvstelm_d(temp1, pixels + stride_3x, 0, 3);
+}
+
+void ff_add_pixels_clamped_lasx(const int16_t *block,
+ uint8_t *av_restrict pixels,
+ ptrdiff_t stride)
+{
+ __m256i b0, b1, b2, b3;
+ __m256i p0, p1, p2, p3, p4, p5, p6, p7;
+ __m256i temp0, temp1, temp2, temp3;
+ uint8_t *pix = pixels;
+ ptrdiff_t stride_2x = stride << 1;
+ ptrdiff_t stride_4x = stride << 2;
+ ptrdiff_t stride_3x = stride_2x + stride;
+
+ DUP4_ARG2(__lasx_xvld, block, 0, block, 32, block, 64, block, 96,
+ b0, b1, b2, b3);
+ p0 = __lasx_xvldrepl_d(pix, 0);
+ pix += stride;
+ p1 = __lasx_xvldrepl_d(pix, 0);
+ pix += stride;
+ p2 = __lasx_xvldrepl_d(pix, 0);
+ pix += stride;
+ p3 = __lasx_xvldrepl_d(pix, 0);
+ pix += stride;
+ p4 = __lasx_xvldrepl_d(pix, 0);
+ pix += stride;
+ p5 = __lasx_xvldrepl_d(pix, 0);
+ pix += stride;
+ p6 = __lasx_xvldrepl_d(pix, 0);
+ pix += stride;
+ p7 = __lasx_xvldrepl_d(pix, 0);
+ DUP4_ARG3(__lasx_xvpermi_q, p1, p0, 0x20, p3, p2, 0x20, p5, p4, 0x20,
+ p7, p6, 0x20, temp0, temp1, temp2, temp3);
+ DUP4_ARG2(__lasx_xvaddw_h_h_bu, b0, temp0, b1, temp1, b2, temp2, b3, temp3,
+ temp0, temp1, temp2, temp3);
+ DUP4_ARG1(__lasx_xvclip255_h, temp0, temp1, temp2, temp3,
+ temp0, temp1, temp2, temp3);
+ DUP2_ARG2(__lasx_xvpickev_b, temp1, temp0, temp3, temp2, temp0, temp1);
+ __lasx_xvstelm_d(temp0, pixels, 0, 0);
+ __lasx_xvstelm_d(temp0, pixels + stride, 0, 2);
+ __lasx_xvstelm_d(temp0, pixels + stride_2x, 0, 1);
+ __lasx_xvstelm_d(temp0, pixels + stride_3x, 0, 3);
+ pixels += stride_4x;
+ __lasx_xvstelm_d(temp1, pixels, 0, 0);
+ __lasx_xvstelm_d(temp1, pixels + stride, 0, 2);
+ __lasx_xvstelm_d(temp1, pixels + stride_2x, 0, 1);
+ __lasx_xvstelm_d(temp1, pixels + stride_3x, 0, 3);
+}
diff --git a/libavcodec/loongarch/idctdsp_loongarch.h b/libavcodec/loongarch/idctdsp_loongarch.h
new file mode 100644
index 0000000000..cae8e7af58
--- /dev/null
+++ b/libavcodec/loongarch/idctdsp_loongarch.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Hao Chen <chenhao@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_IDCTDSP_LOONGARCH_H
+#define AVCODEC_LOONGARCH_IDCTDSP_LOONGARCH_H
+
+#include <stdint.h>
+#include "libavcodec/mpegvideo.h"
+
+void ff_simple_idct_lasx(int16_t *block);
+void ff_simple_idct_put_lasx(uint8_t *dest, ptrdiff_t stride_dst, int16_t *block);
+void ff_simple_idct_add_lasx(uint8_t *dest, ptrdiff_t stride_dst, int16_t *block);
+void ff_put_pixels_clamped_lasx(const int16_t *block,
+ uint8_t *av_restrict pixels,
+ ptrdiff_t line_size);
+void ff_put_signed_pixels_clamped_lasx(const int16_t *block,
+ uint8_t *av_restrict pixels,
+ ptrdiff_t line_size);
+void ff_add_pixels_clamped_lasx(const int16_t *block,
+ uint8_t *av_restrict pixels,
+ ptrdiff_t line_size);
+
+#endif /* AVCODEC_LOONGARCH_IDCTDSP_LOONGARCH_H */
diff --git a/libavcodec/loongarch/simple_idct_lasx.c b/libavcodec/loongarch/simple_idct_lasx.c
new file mode 100644
index 0000000000..a0d936b666
--- /dev/null
+++ b/libavcodec/loongarch/simple_idct_lasx.c
@@ -0,0 +1,297 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Hao Chen <chenhao@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/loongson_intrinsics.h"
+#include "idctdsp_loongarch.h"
+
+#define LASX_TRANSPOSE4x16(in_0, in_1, in_2, in_3, out_0, out_1, out_2, out_3) \
+{ \
+ __m256i temp_0, temp_1, temp_2, temp_3; \
+ __m256i temp_4, temp_5, temp_6, temp_7; \
+ DUP4_ARG3(__lasx_xvpermi_q, in_2, in_0, 0x20, in_2, in_0, 0x31, in_3, in_1,\
+ 0x20, in_3, in_1, 0x31, temp_0, temp_1, temp_2, temp_3); \
+ DUP2_ARG2(__lasx_xvilvl_h, temp_1, temp_0, temp_3, temp_2, temp_4, temp_6);\
+ DUP2_ARG2(__lasx_xvilvh_h, temp_1, temp_0, temp_3, temp_2, temp_5, temp_7);\
+ DUP2_ARG2(__lasx_xvilvl_w, temp_6, temp_4, temp_7, temp_5, out_0, out_2); \
+ DUP2_ARG2(__lasx_xvilvh_w, temp_6, temp_4, temp_7, temp_5, out_1, out_3); \
+}
+
+#define LASX_IDCTROWCONDDC \
+ const_val = 16383 * ((1 << 19) / 16383); \
+ const_val1 = __lasx_xvreplgr2vr_w(const_val); \
+ DUP4_ARG2(__lasx_xvld, block, 0, block, 32, block, 64, block, 96, \
+ in0, in1, in2, in3); \
+ LASX_TRANSPOSE4x16(in0, in1, in2, in3, in0, in1, in2, in3); \
+ a0 = __lasx_xvpermi_d(in0, 0xD8); \
+ a0 = __lasx_vext2xv_w_h(a0); \
+ temp = __lasx_xvslli_w(a0, 3); \
+ a1 = __lasx_xvpermi_d(in0, 0x8D); \
+ a1 = __lasx_vext2xv_w_h(a1); \
+ a2 = __lasx_xvpermi_d(in1, 0xD8); \
+ a2 = __lasx_vext2xv_w_h(a2); \
+ a3 = __lasx_xvpermi_d(in1, 0x8D); \
+ a3 = __lasx_vext2xv_w_h(a3); \
+ b0 = __lasx_xvpermi_d(in2, 0xD8); \
+ b0 = __lasx_vext2xv_w_h(b0); \
+ b1 = __lasx_xvpermi_d(in2, 0x8D); \
+ b1 = __lasx_vext2xv_w_h(b1); \
+ b2 = __lasx_xvpermi_d(in3, 0xD8); \
+ b2 = __lasx_vext2xv_w_h(b2); \
+ b3 = __lasx_xvpermi_d(in3, 0x8D); \
+ b3 = __lasx_vext2xv_w_h(b3); \
+ select_vec = a0 | a1 | a2 | a3 | b0 | b1 | b2 | b3; \
+ select_vec = __lasx_xvslti_wu(select_vec, 1); \
+ \
+ DUP4_ARG2(__lasx_xvrepl128vei_h, w1, 2, w1, 3, w1, 4, w1, 5, \
+ w2, w3, w4, w5); \
+ DUP2_ARG2(__lasx_xvrepl128vei_h, w1, 6, w1, 7, w6, w7); \
+ w1 = __lasx_xvrepl128vei_h(w1, 1); \
+ \
+ /* part of FUNC6(idctRowCondDC) */ \
+ temp0 = __lasx_xvmaddwl_w_h(const_val0, in0, w4); \
+ DUP2_ARG2(__lasx_xvmulwl_w_h, in1, w2, in1, w6, temp1, temp2); \
+ a0 = __lasx_xvadd_w(temp0, temp1); \
+ a1 = __lasx_xvadd_w(temp0, temp2); \
+ a2 = __lasx_xvsub_w(temp0, temp2); \
+ a3 = __lasx_xvsub_w(temp0, temp1); \
+ \
+ DUP2_ARG2(__lasx_xvilvh_h, in1, in0, w3, w1, temp0, temp1); \
+ b0 = __lasx_xvdp2_w_h(temp0, temp1); \
+ temp1 = __lasx_xvneg_h(w7); \
+ temp2 = __lasx_xvilvl_h(temp1, w3); \
+ b1 = __lasx_xvdp2_w_h(temp0, temp2); \
+ temp1 = __lasx_xvneg_h(w1); \
+ temp2 = __lasx_xvilvl_h(temp1, w5); \
+ b2 = __lasx_xvdp2_w_h(temp0, temp2); \
+ temp1 = __lasx_xvneg_h(w5); \
+ temp2 = __lasx_xvilvl_h(temp1, w7); \
+ b3 = __lasx_xvdp2_w_h(temp0, temp2); \
+ \
+ /* if (AV_RAN64A(row + 4)) */ \
+ DUP2_ARG2(__lasx_xvilvl_h, in3, in2, w6, w4, temp0, temp1); \
+ a0 = __lasx_xvdp2add_w_h(a0, temp0, temp1); \
+ temp1 = __lasx_xvilvl_h(w2, w4); \
+ a1 = __lasx_xvdp2sub_w_h(a1, temp0, temp1); \
+ temp1 = __lasx_xvneg_h(w4); \
+ temp2 = __lasx_xvilvl_h(w2, temp1); \
+ a2 = __lasx_xvdp2add_w_h(a2, temp0, temp2); \
+ temp1 = __lasx_xvneg_h(w6); \
+ temp2 = __lasx_xvilvl_h(temp1, w4); \
+ a3 = __lasx_xvdp2add_w_h(a3, temp0, temp2); \
+ \
+ DUP2_ARG2(__lasx_xvilvh_h, in3, in2, w7, w5, temp0, temp1); \
+ b0 = __lasx_xvdp2add_w_h(b0, temp0, temp1); \
+ DUP2_ARG2(__lasx_xvilvl_h, w5, w1, w3, w7, temp1, temp2); \
+ b1 = __lasx_xvdp2sub_w_h(b1, temp0, temp1); \
+ b2 = __lasx_xvdp2add_w_h(b2, temp0, temp2); \
+ temp1 = __lasx_xvneg_h(w1); \
+ temp2 = __lasx_xvilvl_h(temp1, w3); \
+ b3 = __lasx_xvdp2add_w_h(b3, temp0, temp2); \
+ \
+ DUP4_ARG2(__lasx_xvadd_w, a0, b0, a1, b1, a2, b2, a3, b3, \
+ temp0, temp1, temp2, temp3); \
+ DUP4_ARG2(__lasx_xvsub_w, a0, b0, a1, b1, a2, b2, a3, b3, \
+ a0, a1, a2, a3); \
+ DUP4_ARG2(__lasx_xvsrai_w, temp0, 11, temp1, 11, temp2, 11, temp3, 11, \
+ temp0, temp1, temp2, temp3); \
+ DUP4_ARG2(__lasx_xvsrai_w, a0, 11, a1, 11, a2, 11, a3, 11, a0, a1, a2, a3);\
+ DUP4_ARG3(__lasx_xvbitsel_v, temp0, temp, select_vec, temp1, temp, \
+ select_vec, temp2, temp, select_vec, temp3, temp, select_vec, \
+ in0, in1, in2, in3); \
+ DUP4_ARG3(__lasx_xvbitsel_v, a0, temp, select_vec, a1, temp, \
+ select_vec, a2, temp, select_vec, a3, temp, select_vec, \
+ a0, a1, a2, a3); \
+ DUP4_ARG2(__lasx_xvpickev_h, in1, in0, in3, in2, a2, a3, a0, a1, \
+ in0, in1, in2, in3); \
+ DUP4_ARG2(__lasx_xvpermi_d, in0, 0xD8, in1, 0xD8, in2, 0xD8, in3, 0xD8, \
+ in0, in1, in2, in3); \
+
+#define LASX_IDCTCOLS \
+ /* part of FUNC6(idctSparaseCol) */ \
+ LASX_TRANSPOSE4x16(in0, in1, in2, in3, in0, in1, in2, in3); \
+ temp0 = __lasx_xvmaddwl_w_h(const_val1, in0, w4); \
+ DUP2_ARG2(__lasx_xvmulwl_w_h, in1, w2, in1, w6, temp1, temp2); \
+ a0 = __lasx_xvadd_w(temp0, temp1); \
+ a1 = __lasx_xvadd_w(temp0, temp2); \
+ a2 = __lasx_xvsub_w(temp0, temp2); \
+ a3 = __lasx_xvsub_w(temp0, temp1); \
+ \
+ DUP2_ARG2(__lasx_xvilvh_h, in1, in0, w3, w1, temp0, temp1); \
+ b0 = __lasx_xvdp2_w_h(temp0, temp1); \
+ temp1 = __lasx_xvneg_h(w7); \
+ temp2 = __lasx_xvilvl_h(temp1, w3); \
+ b1 = __lasx_xvdp2_w_h(temp0, temp2); \
+ temp1 = __lasx_xvneg_h(w1); \
+ temp2 = __lasx_xvilvl_h(temp1, w5); \
+ b2 = __lasx_xvdp2_w_h(temp0, temp2); \
+ temp1 = __lasx_xvneg_h(w5); \
+ temp2 = __lasx_xvilvl_h(temp1, w7); \
+ b3 = __lasx_xvdp2_w_h(temp0, temp2); \
+ \
+ /* if (AV_RAN64A(row + 4)) */ \
+ DUP2_ARG2(__lasx_xvilvl_h, in3, in2, w6, w4, temp0, temp1); \
+ a0 = __lasx_xvdp2add_w_h(a0, temp0, temp1); \
+ temp1 = __lasx_xvilvl_h(w2, w4); \
+ a1 = __lasx_xvdp2sub_w_h(a1, temp0, temp1); \
+ temp1 = __lasx_xvneg_h(w4); \
+ temp2 = __lasx_xvilvl_h(w2, temp1); \
+ a2 = __lasx_xvdp2add_w_h(a2, temp0, temp2); \
+ temp1 = __lasx_xvneg_h(w6); \
+ temp2 = __lasx_xvilvl_h(temp1, w4); \
+ a3 = __lasx_xvdp2add_w_h(a3, temp0, temp2); \
+ \
+ DUP2_ARG2(__lasx_xvilvh_h, in3, in2, w7, w5, temp0, temp1); \
+ b0 = __lasx_xvdp2add_w_h(b0, temp0, temp1); \
+ DUP2_ARG2(__lasx_xvilvl_h, w5, w1, w3, w7, temp1, temp2); \
+ b1 = __lasx_xvdp2sub_w_h(b1, temp0, temp1); \
+ b2 = __lasx_xvdp2add_w_h(b2, temp0, temp2); \
+ temp1 = __lasx_xvneg_h(w1); \
+ temp2 = __lasx_xvilvl_h(temp1, w3); \
+ b3 = __lasx_xvdp2add_w_h(b3, temp0, temp2); \
+ \
+ DUP4_ARG2(__lasx_xvadd_w, a0, b0, a1, b1, a2, b2, a3, b3, \
+ temp0, temp1, temp2, temp3); \
+ DUP4_ARG2(__lasx_xvsub_w, a3, b3, a2, b2, a1, b1, a0, b0, \
+ a3, a2, a1, a0); \
+ DUP4_ARG3(__lasx_xvsrani_h_w, temp1, temp0, 20, temp3, temp2, 20, a2, a3, \
+ 20, a0, a1, 20, in0, in1, in2, in3); \
+
+void ff_simple_idct_lasx(int16_t *block)
+{
+ int32_t const_val = 1 << 10;
+ __m256i w1 = {0x4B42539F58C50000, 0x11A822A332493FFF,
+ 0x4B42539F58C50000, 0x11A822A332493FFF};
+ __m256i in0, in1, in2, in3;
+ __m256i w2, w3, w4, w5, w6, w7;
+ __m256i a0, a1, a2, a3;
+ __m256i b0, b1, b2, b3;
+ __m256i temp0, temp1, temp2, temp3;
+ __m256i const_val0 = __lasx_xvreplgr2vr_w(const_val);
+ __m256i const_val1, select_vec, temp;
+
+ LASX_IDCTROWCONDDC
+ LASX_IDCTCOLS
+ DUP4_ARG2(__lasx_xvpermi_d, in0, 0xD8, in1, 0xD8, in2, 0xD8, in3, 0xD8,
+ in0, in1, in2, in3);
+ __lasx_xvst(in0, block, 0);
+ __lasx_xvst(in1, block, 32);
+ __lasx_xvst(in2, block, 64);
+ __lasx_xvst(in3, block, 96);
+}
+
+void ff_simple_idct_put_lasx(uint8_t *dst, ptrdiff_t dst_stride,
+ int16_t *block)
+{
+ int32_t const_val = 1 << 10;
+ ptrdiff_t dst_stride_2x = dst_stride << 1;
+ ptrdiff_t dst_stride_4x = dst_stride << 2;
+ ptrdiff_t dst_stride_3x = dst_stride_2x + dst_stride;
+ __m256i w1 = {0x4B42539F58C50000, 0x11A822A332493FFF,
+ 0x4B42539F58C50000, 0x11A822A332493FFF};
+ __m256i in0, in1, in2, in3;
+ __m256i w2, w3, w4, w5, w6, w7;
+ __m256i a0, a1, a2, a3;
+ __m256i b0, b1, b2, b3;
+ __m256i temp0, temp1, temp2, temp3;
+ __m256i const_val0 = __lasx_xvreplgr2vr_w(const_val);
+ __m256i const_val1, select_vec, temp;
+
+ LASX_IDCTROWCONDDC
+ LASX_IDCTCOLS
+ DUP4_ARG2(__lasx_xvpermi_d, in0, 0xD8, in1, 0xD8, in2, 0xD8, in3, 0xD8,
+ in0, in1, in2, in3);
+ DUP4_ARG1(__lasx_xvclip255_h, in0, in1, in2, in3, in0, in1, in2, in3);
+ DUP2_ARG2(__lasx_xvpickev_b, in1, in0, in3, in2, in0, in1);
+ __lasx_xvstelm_d(in0, dst, 0, 0);
+ __lasx_xvstelm_d(in0, dst + dst_stride, 0, 2);
+ __lasx_xvstelm_d(in0, dst + dst_stride_2x, 0, 1);
+ __lasx_xvstelm_d(in0, dst + dst_stride_3x, 0, 3);
+ dst += dst_stride_4x;
+ __lasx_xvstelm_d(in1, dst, 0, 0);
+ __lasx_xvstelm_d(in1, dst + dst_stride, 0, 2);
+ __lasx_xvstelm_d(in1, dst + dst_stride_2x, 0, 1);
+ __lasx_xvstelm_d(in1, dst + dst_stride_3x, 0, 3);
+}
+
+void ff_simple_idct_add_lasx(uint8_t *dst, ptrdiff_t dst_stride,
+ int16_t *block)
+{
+ int32_t const_val = 1 << 10;
+ uint8_t *dst1 = dst;
+ ptrdiff_t dst_stride_2x = dst_stride << 1;
+ ptrdiff_t dst_stride_4x = dst_stride << 2;
+ ptrdiff_t dst_stride_3x = dst_stride_2x + dst_stride;
+
+ __m256i w1 = {0x4B42539F58C50000, 0x11A822A332493FFF,
+ 0x4B42539F58C50000, 0x11A822A332493FFF};
+ __m256i sh = {0x0003000200010000, 0x000B000A00090008,
+ 0x0007000600050004, 0x000F000E000D000C};
+ __m256i in0, in1, in2, in3;
+ __m256i w2, w3, w4, w5, w6, w7;
+ __m256i a0, a1, a2, a3;
+ __m256i b0, b1, b2, b3;
+ __m256i temp0, temp1, temp2, temp3;
+ __m256i const_val0 = __lasx_xvreplgr2vr_w(const_val);
+ __m256i const_val1, select_vec, temp;
+
+ LASX_IDCTROWCONDDC
+ LASX_IDCTCOLS
+ a0 = __lasx_xvldrepl_d(dst1, 0);
+ a0 = __lasx_vext2xv_hu_bu(a0);
+ dst1 += dst_stride;
+ a1 = __lasx_xvldrepl_d(dst1, 0);
+ a1 = __lasx_vext2xv_hu_bu(a1);
+ dst1 += dst_stride;
+ a2 = __lasx_xvldrepl_d(dst1, 0);
+ a2 = __lasx_vext2xv_hu_bu(a2);
+ dst1 += dst_stride;
+ a3 = __lasx_xvldrepl_d(dst1, 0);
+ a3 = __lasx_vext2xv_hu_bu(a3);
+ dst1 += dst_stride;
+ b0 = __lasx_xvldrepl_d(dst1, 0);
+ b0 = __lasx_vext2xv_hu_bu(b0);
+ dst1 += dst_stride;
+ b1 = __lasx_xvldrepl_d(dst1, 0);
+ b1 = __lasx_vext2xv_hu_bu(b1);
+ dst1 += dst_stride;
+ b2 = __lasx_xvldrepl_d(dst1, 0);
+ b2 = __lasx_vext2xv_hu_bu(b2);
+ dst1 += dst_stride;
+ b3 = __lasx_xvldrepl_d(dst1, 0);
+ b3 = __lasx_vext2xv_hu_bu(b3);
+ DUP4_ARG3(__lasx_xvshuf_h, sh, a1, a0, sh, a3, a2, sh, b1, b0, sh, b3, b2,
+ temp0, temp1, temp2, temp3);
+ DUP4_ARG2(__lasx_xvadd_h, temp0, in0, temp1, in1, temp2, in2, temp3, in3,
+ in0, in1, in2, in3);
+ DUP4_ARG2(__lasx_xvpermi_d, in0, 0xD8, in1, 0xD8, in2, 0xD8, in3, 0xD8,
+ in0, in1, in2, in3);
+ DUP4_ARG1(__lasx_xvclip255_h, in0, in1, in2, in3, in0, in1, in2, in3);
+ DUP2_ARG2(__lasx_xvpickev_b, in1, in0, in3, in2, in0, in1);
+ __lasx_xvstelm_d(in0, dst, 0, 0);
+ __lasx_xvstelm_d(in0, dst + dst_stride, 0, 2);
+ __lasx_xvstelm_d(in0, dst + dst_stride_2x, 0, 1);
+ __lasx_xvstelm_d(in0, dst + dst_stride_3x, 0, 3);
+ dst += dst_stride_4x;
+ __lasx_xvstelm_d(in1, dst, 0, 0);
+ __lasx_xvstelm_d(in1, dst + dst_stride, 0, 2);
+ __lasx_xvstelm_d(in1, dst + dst_stride_2x, 0, 1);
+ __lasx_xvstelm_d(in1, dst + dst_stride_3x, 0, 3);
+}
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] [PATCH v3 3/3] avcodec: [loongarch] Optimize prefetch with loongarch.
2021-12-29 10:18 [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 2/3] avcodec: [loongarch] Optimize idctdstp " Hao Chen
@ 2021-12-29 10:18 ` Hao Chen
2022-01-03 11:24 ` [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch 殷时友
3 siblings, 0 replies; 7+ messages in thread
From: Hao Chen @ 2021-12-29 10:18 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: gxw
From: gxw <guxiwei-hf@loongson.cn>
./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an
before:296
after :308
---
libavcodec/loongarch/Makefile | 1 +
libavcodec/loongarch/videodsp_init.c | 45 ++++++++++++++++++++++++++++
libavcodec/videodsp.c | 2 ++
libavcodec/videodsp.h | 1 +
4 files changed, 49 insertions(+)
create mode 100644 libavcodec/loongarch/videodsp_init.c
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index c4d71e801b..3c15c2edeb 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -7,6 +7,7 @@ OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9dsp_init_loongarch.o
OBJS-$(CONFIG_VC1DSP) += loongarch/vc1dsp_init_loongarch.o
OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_init_loongarch.o
OBJS-$(CONFIG_IDCTDSP) += loongarch/idctdsp_init_loongarch.o
+OBJS-$(CONFIG_VIDEODSP) += loongarch/videodsp_init.o
LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o
LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o
LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \
diff --git a/libavcodec/loongarch/videodsp_init.c b/libavcodec/loongarch/videodsp_init.c
new file mode 100644
index 0000000000..6cbb7763ff
--- /dev/null
+++ b/libavcodec/loongarch/videodsp_init.c
@@ -0,0 +1,45 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Xiwei Gu <guxiwei-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavcodec/videodsp.h"
+#include "libavutil/attributes.h"
+
+static void prefetch_loongarch(uint8_t *mem, ptrdiff_t stride, int h)
+{
+ register const uint8_t *p = mem;
+
+ __asm__ volatile (
+ "1: \n\t"
+ "preld 0, %[p], 0 \n\t"
+ "preld 0, %[p], 32 \n\t"
+ "addi.d %[h], %[h], -1 \n\t"
+ "add.d %[p], %[p], %[stride] \n\t"
+
+ "blt $r0, %[h], 1b \n\t"
+ : [p] "+r" (p), [h] "+r" (h)
+ : [stride] "r" (stride)
+ );
+}
+
+av_cold void ff_videodsp_init_loongarch(VideoDSPContext *ctx, int bpc)
+{
+ ctx->prefetch = prefetch_loongarch;
+}
diff --git a/libavcodec/videodsp.c b/libavcodec/videodsp.c
index ce9e9eb143..212147984f 100644
--- a/libavcodec/videodsp.c
+++ b/libavcodec/videodsp.c
@@ -54,4 +54,6 @@ av_cold void ff_videodsp_init(VideoDSPContext *ctx, int bpc)
ff_videodsp_init_x86(ctx, bpc);
if (ARCH_MIPS)
ff_videodsp_init_mips(ctx, bpc);
+ if (ARCH_LOONGARCH64)
+ ff_videodsp_init_loongarch(ctx, bpc);
}
diff --git a/libavcodec/videodsp.h b/libavcodec/videodsp.h
index c0545f22b0..ac971dc57f 100644
--- a/libavcodec/videodsp.h
+++ b/libavcodec/videodsp.h
@@ -84,5 +84,6 @@ void ff_videodsp_init_arm(VideoDSPContext *ctx, int bpc);
void ff_videodsp_init_ppc(VideoDSPContext *ctx, int bpc);
void ff_videodsp_init_x86(VideoDSPContext *ctx, int bpc);
void ff_videodsp_init_mips(VideoDSPContext *ctx, int bpc);
+void ff_videodsp_init_loongarch(VideoDSPContext *ctx, int bpc);
#endif /* AVCODEC_VIDEODSP_H */
--
2.20.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch
2021-12-29 10:18 [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch Hao Chen
` (2 preceding siblings ...)
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 3/3] avcodec: [loongarch] Optimize prefetch with loongarch Hao Chen
@ 2022-01-03 11:24 ` 殷时友
2022-01-04 14:54 ` Michael Niedermayer
3 siblings, 1 reply; 7+ messages in thread
From: 殷时友 @ 2022-01-03 11:24 UTC (permalink / raw)
To: FFmpeg development discussions and patches
> 2021年12月29日 下午6:18,Hao Chen <chenhao@loongson.cn> 写道:
>
> ./ffmpeg -i 8_mpeg4_1080p_24fps_12Mbps.avi -f rawvideo -y /dev/null -an
> before:376fps
> after :552fps
>
> V2: Revised PATCH 1/3 according to the comments.
> V3: Resubmit these patches due to miss PATCH v2 1/3.
>
> [PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX.
> [PATCH v3 2/3] avcodec: [loongarch] Optimize idctdstp with LASX.
> [PATCH v3 3/3] avcodec: [loongarch] Optimize prefetch with loongarch.
>
LGTM.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch
2022-01-03 11:24 ` [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch 殷时友
@ 2022-01-04 14:54 ` Michael Niedermayer
0 siblings, 0 replies; 7+ messages in thread
From: Michael Niedermayer @ 2022-01-04 14:54 UTC (permalink / raw)
To: FFmpeg development discussions and patches
[-- Attachment #1.1: Type: text/plain, Size: 870 bytes --]
On Mon, Jan 03, 2022 at 07:24:32PM +0800, 殷时友 wrote:
>
> > 2021年12月29日 下午6:18,Hao Chen <chenhao@loongson.cn> 写道:
> >
> > ./ffmpeg -i 8_mpeg4_1080p_24fps_12Mbps.avi -f rawvideo -y /dev/null -an
> > before:376fps
> > after :552fps
> >
> > V2: Revised PATCH 1/3 according to the comments.
> > V3: Resubmit these patches due to miss PATCH v2 1/3.
> >
> > [PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX.
> > [PATCH v3 2/3] avcodec: [loongarch] Optimize idctdstp with LASX.
> > [PATCH v3 3/3] avcodec: [loongarch] Optimize prefetch with loongarch.
> >
>
> LGTM.
will apply
thx
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
The smallest minority on earth is the individual. Those who deny
individual rights cannot claim to be defenders of minorities. - Ayn Rand
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
[-- Attachment #2: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch.
@ 2021-12-24 9:49 Hao Chen
0 siblings, 0 replies; 7+ messages in thread
From: Hao Chen @ 2021-12-24 9:49 UTC (permalink / raw)
To: ffmpeg-devel
./ffmpeg -i 8_mpeg4_1080p_24fps_12Mbps.avi -f rawvideo -y /dev/null -an
before:376fps
after :552fps
avcodec: [loongarch] Optimize hpeldsp with LASX.
avcodec: [loongarch] Optimize idctdstp with LASX.
avcodec: [loongarch] Optimize prefetch with loongarch.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-01-04 14:54 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-29 10:18 [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 2/3] avcodec: [loongarch] Optimize idctdstp " Hao Chen
2021-12-29 10:18 ` [FFmpeg-devel] [PATCH v3 3/3] avcodec: [loongarch] Optimize prefetch with loongarch Hao Chen
2022-01-03 11:24 ` [FFmpeg-devel] Optimize Mpeg4 decoding for loongarch 殷时友
2022-01-04 14:54 ` Michael Niedermayer
-- strict thread matches above, loose matches on Subject: below --
2021-12-24 9:49 Hao Chen
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git