From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTP id A3C9A461E1
	for <ffmpegdev@gitmailbox.com>; Tue,  9 May 2023 09:51:15 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6617768C1F4;
	Tue,  9 May 2023 12:51:01 +0300 (EEST)
Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com
 [209.85.210.179])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id B645A68C1E7
 for <ffmpeg-devel@ffmpeg.org>; Tue,  9 May 2023 12:50:54 +0300 (EEST)
Received: by mail-pf1-f179.google.com with SMTP id
 d2e1a72fcca58-6436e075166so4219228b3a.0
 for <ffmpeg-devel@ffmpeg.org>; Tue, 09 May 2023 02:50:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=sifive.com; s=google; t=1683625853; x=1686217853;
 h=references:in-reply-to:message-id:date:subject:cc:to:from:from:to
 :cc:subject:date:message-id:reply-to;
 bh=VlD+mLnXaylz3QMUeLJfpWFYVyGMHeA63flMIo8KvN4=;
 b=WwFDc4g3i5L6Gy8JLWa2yXy8Q/0CUsOoMI+OH+CBckrauKjZgfcns6NBgIviylq8mF
 IHEAzQ3UnfX5rgYl0sausNzEpnaUQFiU7nzcPyQKmEv3FY1FVGRm1rY+RDAoNhz79FIE
 xnkF507CSrAxXgdsKK5INsP21rbjebxsfHGUyda0tW47+QZVcFV7tk/1jwpXScYQvoGU
 ikFypMy8FtZCXrG8xO1s88Wpf2qB4RJCJXn1gXY7NnGuhYSIWR8W22V2kxrgRKblMqkk
 ghKVCaZHyxxmcBzQ/MlUS/+xruxgbY+eW9xlOO3E83oTdm/KFpefzS487qe7z0PAYlJz
 97zA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1683625853; x=1686217853;
 h=references:in-reply-to:message-id:date:subject:cc:to:from
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=VlD+mLnXaylz3QMUeLJfpWFYVyGMHeA63flMIo8KvN4=;
 b=fH9T82qS0lQx2hEqinwIHdSOG+DF85mxVLrr8o1bL8Mc3K5HBM6KpmU+JbET8r7/vx
 a+f9N8HltSaOqFSmYp4MMx9T5DcISJh15f0chNpHUwYfzhIqWmDjsjRFrYaOJphsv0+B
 IJ3n0FyBOrnmhVTaeluQ3wnKviTwgf8sUld1+0Zx8s8jX0ekglXiB3VaXfPZEqoQDXQo
 0qSHNobdoUHQxV0mX7hUJzahXfUAyJV2ioKNCF1cUMg7vneJocOwJx1dIrto34IP04pw
 8DzcRxUOcZ1RqnF6Y/7QYypzZt3bxQl3wSt/ChgDHvl0F5gVXTafKqugwcXMUIxLF5/N
 ZvZg==
X-Gm-Message-State: AC+VfDwR2/3BF7eH3yMrcOCrRs8K8ARjs4STGhbHp10/2AYP8kqnBgA6
 7NCjvD83Y42rgClX6KG9bdQA65XE+TgV+wB2ddhB5V/7lfnn2AWCCmPGapnZ5Z8uoXpt0ko1GbC
 evy9JkXOoZwpYRzbJP3ANprRG46TgbrRHfwU1MJWtcKHR2H1fxaSwhVJ1UKrZCmLdcjk0Qy2w1i
 /OzcRV
X-Google-Smtp-Source: ACHHUZ4HgtizaE2GPaM3SnZBsSD4uTzYcLmPCnfLEL27Ram6o4oKK5xgXBgQ1AXBN18Q9/W3dACt4A==
X-Received: by 2002:a05:6a00:cca:b0:63d:5de3:b3f2 with SMTP id
 b10-20020a056a000cca00b0063d5de3b3f2mr19393212pfv.18.1683625852257; 
 Tue, 09 May 2023 02:50:52 -0700 (PDT)
Received: from arnie-ThinkPad-T480s.localdomain
 (61-230-13-76.dynamic-ip.hinet.net. [61.230.13.76])
 by smtp.gmail.com with ESMTPSA id
 x10-20020aa784ca000000b0064394d63458sm465875pfn.78.2023.05.09.02.50.51
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 09 May 2023 02:50:51 -0700 (PDT)
From: Arnie Chang <arnie.chang@sifive.com>
To: ffmpeg-devel@ffmpeg.org
Date: Tue,  9 May 2023 17:50:27 +0800
Message-Id: <20230509095030.25506-3-arnie.chang@sifive.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20230509095030.25506-1-arnie.chang@sifive.com>
References: <20230509095030.25506-1-arnie.chang@sifive.com>
Subject: [FFmpeg-devel] [PATCH 2/5] lavc/h264chroma: Add vectorized
 implementation of chroma MC for RISC-V
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Arnie Chang <arnie.chang@sifive.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/20230509095030.25506-3-arnie.chang@sifive.com/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

Optimize chroma motion compensation using RISC-V vector intrinsics,
resulting in an average 13% FPS improvement on 720P videos.

Signed-off-by: Arnie Chang <arnie.chang@sifive.com>
---
 libavcodec/h264chroma.c                   |   2 +
 libavcodec/h264chroma.h                   |   1 +
 libavcodec/riscv/Makefile                 |   3 +
 libavcodec/riscv/h264_chroma_init_riscv.c |  45 ++
 libavcodec/riscv/h264_mc_chroma.c         | 821 ++++++++++++++++++++++
 libavcodec/riscv/h264_mc_chroma.h         |  40 ++
 6 files changed, 912 insertions(+)
 create mode 100644 libavcodec/riscv/h264_chroma_init_riscv.c
 create mode 100644 libavcodec/riscv/h264_mc_chroma.c
 create mode 100644 libavcodec/riscv/h264_mc_chroma.h

diff --git a/libavcodec/h264chroma.c b/libavcodec/h264chroma.c
index 60b86b6fba..1eeab7bc40 100644
--- a/libavcodec/h264chroma.c
+++ b/libavcodec/h264chroma.c
@@ -58,5 +58,7 @@ av_cold void ff_h264chroma_init(H264ChromaContext *c, int bit_depth)
     ff_h264chroma_init_mips(c, bit_depth);
 #elif ARCH_LOONGARCH64
     ff_h264chroma_init_loongarch(c, bit_depth);
+#elif ARCH_RISCV
+    ff_h264chroma_init_riscv(c, bit_depth);
 #endif
 }
diff --git a/libavcodec/h264chroma.h b/libavcodec/h264chroma.h
index b8f9c8f4fc..9c81c18a76 100644
--- a/libavcodec/h264chroma.h
+++ b/libavcodec/h264chroma.h
@@ -37,5 +37,6 @@ void ff_h264chroma_init_ppc(H264ChromaContext *c, int bit_depth);
 void ff_h264chroma_init_x86(H264ChromaContext *c, int bit_depth);
 void ff_h264chroma_init_mips(H264ChromaContext *c, int bit_depth);
 void ff_h264chroma_init_loongarch(H264ChromaContext *c, int bit_depth);
+void ff_h264chroma_init_riscv(H264ChromaContext *c, int bit_depth);
 
 #endif /* AVCODEC_H264CHROMA_H */
diff --git a/libavcodec/riscv/Makefile b/libavcodec/riscv/Makefile
index 965942f4df..08b76c93cb 100644
--- a/libavcodec/riscv/Makefile
+++ b/libavcodec/riscv/Makefile
@@ -19,3 +19,6 @@ OBJS-$(CONFIG_PIXBLOCKDSP) += riscv/pixblockdsp_init.o \
 RVV-OBJS-$(CONFIG_PIXBLOCKDSP) += riscv/pixblockdsp_rvv.o
 OBJS-$(CONFIG_VORBIS_DECODER) += riscv/vorbisdsp_init.o
 RVV-OBJS-$(CONFIG_VORBIS_DECODER) += riscv/vorbisdsp_rvv.o
+
+OBJS-$(CONFIG_H264CHROMA) += riscv/h264_chroma_init_riscv.o
+RVV-OBJS-$(CONFIG_H264CHROMA) += riscv/h264_mc_chroma.o
diff --git a/libavcodec/riscv/h264_chroma_init_riscv.c b/libavcodec/riscv/h264_chroma_init_riscv.c
new file mode 100644
index 0000000000..daeca01fa2
--- /dev/null
+++ b/libavcodec/riscv/h264_chroma_init_riscv.c
@@ -0,0 +1,45 @@
+/*
+ * Copyright (c) 2023 SiFive, Inc. All rights reserved.
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <stdint.h>
+
+#include "libavutil/attributes.h"
+#include "libavutil/cpu.h"
+#include "libavcodec/h264chroma.h"
+#include "config.h"
+#include "h264_mc_chroma.h"
+
+av_cold void ff_h264chroma_init_riscv(H264ChromaContext *c, int bit_depth)
+{
+#if HAVE_INTRINSICS_RVV
+    const int high_bit_depth = bit_depth > 8;
+
+    if (!high_bit_depth) {
+        c->put_h264_chroma_pixels_tab[0] = h264_put_chroma_mc8_rvv;
+        c->avg_h264_chroma_pixels_tab[0] = h264_avg_chroma_mc8_rvv;
+
+        c->put_h264_chroma_pixels_tab[1] = h264_put_chroma_mc4_rvv;
+        c->avg_h264_chroma_pixels_tab[1] = h264_avg_chroma_mc4_rvv;
+
+        c->put_h264_chroma_pixels_tab[2] = h264_put_chroma_mc2_rvv;
+        c->avg_h264_chroma_pixels_tab[2] = h264_avg_chroma_mc2_rvv;
+    }
+#endif
+}
\ No newline at end of file
diff --git a/libavcodec/riscv/h264_mc_chroma.c b/libavcodec/riscv/h264_mc_chroma.c
new file mode 100644
index 0000000000..64b13ec3b8
--- /dev/null
+++ b/libavcodec/riscv/h264_mc_chroma.c
@@ -0,0 +1,821 @@
+/*
+ * Copyright (c) 2023 SiFive, Inc. All rights reserved.
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "h264_mc_chroma.h"
+#if HAVE_INTRINSICS_RVV
+#include <riscv_vector.h>
+typedef unsigned char pixel;
+
+__attribute__((always_inline)) static void h264_put_chroma_unroll4(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int w, int h, int x, int y)
+{
+    uint8_t *p_dst_iter = p_dst;
+    uint8_t *p_src_iter = p_src;
+
+    const int xy = x * y;
+    const int x8 = x << 3;
+    const int y8 = y << 3;
+    const int a = 64 - x8 - y8 + xy;
+    const int b = x8 - xy;
+    const int c = y8 -xy;
+    const int d = xy;
+
+    int vl = __riscv_vsetvl_e8m1(w);
+
+    if (d != 0)
+    {
+        for (int j = 0; j < h; j += 4)
+        {
+            // dst 1st row
+            vuint8m1_t row00 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+
+            vuint8m1_t row01;
+            row01 = __riscv_vslidedown_vx_u8m1(row00, 1, vl + 1);
+
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row00, a, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, b, row01, vl);
+
+            vuint8m1_t row10 = __riscv_vle8_v_u8m1(p_src_iter + stride, vl + 1);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, c, row10, vl);
+
+            vuint8m1_t row11;
+            row11 = __riscv_vslidedown_vx_u8m1(row10, 1, vl + 1);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, d, row11, vl);
+
+            // dst 2nd row
+            p_src_iter += (stride << 1);
+
+            vuint16m2_t dst1 =  __riscv_vwmulu_vx_u16m2(row10, a, vl);
+            dst1 =  __riscv_vwmaccu_vx_u16m2(dst1, b, row11, vl);
+
+            vuint8m1_t row20 =  __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            dst1 =  __riscv_vwmaccu_vx_u16m2(dst1, c, row20, vl);
+
+            vuint8m1_t row21;
+            row21 =  __riscv_vslidedown_vx_u8m1(row20, 1, vl + 1);
+            dst1 =  __riscv_vwmaccu_vx_u16m2(dst1, d, row21, vl);
+
+            // dst 3rd row
+            p_src_iter += stride;
+
+            vuint16m2_t dst2 = __riscv_vwmulu_vx_u16m2(row20, a, vl);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, b, row21, vl);
+
+            vuint8m1_t row30 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, c, row30, vl);
+
+            vuint8m1_t row31;
+            row31 = __riscv_vslidedown_vx_u8m1(row30, 1, vl + 1);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, d, row31, vl);
+
+            // dst 4rd row
+            p_src_iter += stride;
+
+            vuint16m2_t dst3 = __riscv_vwmulu_vx_u16m2(row30, a, vl);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, b, row31, vl);
+
+            vuint8m1_t row40 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, c, row40, vl);
+
+            vuint8m1_t row41;
+            row41 = __riscv_vslidedown_vx_u8m1(row40, 1, vl + 1);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, d, row41, vl);
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst0, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst1, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst2, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst3, 6, vl), vl);
+            p_dst_iter += stride;
+        }
+    }
+    else if (b == 0 && c != 0)
+    {
+        const unsigned short e = b + c;
+
+        for (int j = 0; j < h; j += 4)
+        {
+            vuint8m1_t row0 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row0, a, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row1 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, e, row1, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row2 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row1, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, e, row2, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row3 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst2 = __riscv_vwmulu_vx_u16m2(row2, a, vl);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, e, row3, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row4 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst3 = __riscv_vwmulu_vx_u16m2(row3, a, vl);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, e, row4, vl);
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst0, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst1, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst2, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst3, 6, vl), vl);
+            p_dst_iter += stride;
+        }
+    }
+    else if (b !=0 && c == 0)
+    {
+        const unsigned short e = b + c;
+
+        for (int j = 0; j < h; j += 4)
+        {
+            // 1st
+            vuint8m1_t row00 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row01;
+            row01 = __riscv_vslidedown_vx_u8m1(row00, 1, vl + 1);
+
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row00, a, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, e, row01, vl);
+
+            // 2nd
+            vuint8m1_t row10 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row11;
+            row11 = __riscv_vslidedown_vx_u8m1(row10, 1, vl + 1);
+
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row10, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, e, row11, vl);
+
+            // 3rd
+            vuint8m1_t row20 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row21;
+            row21 = __riscv_vslidedown_vx_u8m1(row20, 1, vl + 1);
+
+            vuint16m2_t dst2 = __riscv_vwmulu_vx_u16m2(row20, a, vl);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, e, row21, vl);
+
+            // 3rd
+            vuint8m1_t row30 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row31;
+            row31 = __riscv_vslidedown_vx_u8m1(row30, 1, vl + 1);
+
+            vuint16m2_t dst3 = __riscv_vwmulu_vx_u16m2(row30, a, vl);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, e, row31, vl);
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst0, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst1, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst2, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst3, 6, vl), vl);
+            p_dst_iter += stride;
+        }
+    }
+    else
+    {
+        for (int j = 0; j < h; j += 4)
+        {
+            vuint8m1_t row0 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row0, a, vl);
+
+            vuint8m1_t row1 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row1, a, vl);
+
+            vuint8m1_t row2 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst2 = __riscv_vwmulu_vx_u16m2(row2, a, vl);
+
+            vuint8m1_t row3 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst3 = __riscv_vwmulu_vx_u16m2(row3, a, vl);
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst0, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst1, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst2, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst3, 6, vl), vl);
+            p_dst_iter += stride;
+        }
+    }
+}
+
+__attribute__((always_inline)) static void h264_put_chroma_unroll2(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int w, int h, int x, int y)
+{
+    uint8_t *p_dst_iter = p_dst;
+    uint8_t *p_src_iter = p_src;
+
+    const int xy = x * y;
+    const int x8 = x << 3;
+    const int y8 = y << 3;
+    const int a = 64 - x8 - y8 + xy;
+    const int b = x8 - xy;
+    const int c = y8 -xy;
+    const int d = xy;
+
+    int vl = __riscv_vsetvl_e8m1(w);
+
+    if (d != 0)
+    {
+        for (int j = 0; j < h; j += 2)
+        {
+            // dst 1st row
+            vuint8m1_t row00 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+
+            vuint8m1_t row01;
+            row01 = __riscv_vslidedown_vx_u8m1(row00, 1, vl + 1);
+
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row00, a, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, b, row01, vl);
+
+            vuint8m1_t row10 = __riscv_vle8_v_u8m1(p_src_iter + stride, vl + 1);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, c, row10, vl);
+
+            vuint8m1_t row11;
+            row11 = __riscv_vslidedown_vx_u8m1(row10, 1, vl + 1);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, d, row11, vl);
+
+            // dst 2nd row
+            p_src_iter += (stride << 1);
+
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row10, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, b, row11, vl);
+
+            vuint8m1_t row20 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, c, row20, vl);
+
+            vuint8m1_t row21;
+            row21 = __riscv_vslidedown_vx_u8m1(row20, 1, vl + 1);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, d, row21, vl);
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst0, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst1, 6, vl), vl);
+            p_dst_iter += stride;
+        }
+    }
+    else if (b == 0 && c != 0)
+    {
+        const unsigned short e = b + c;
+
+        for (int j = 0; j < h; j += 2)
+        {
+            vuint8m1_t row0 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row0, a, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row1 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, e, row1, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row2 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row1, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, e, row2, vl);
+            p_src_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst0, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst1, 6, vl), vl);
+            p_dst_iter += stride;
+        }
+    }
+    else if (b !=0 && c == 0)
+    {
+        const unsigned short e = b + c;
+
+        for (int j = 0; j < h; j += 2)
+        {
+            // 1st
+            vuint8m1_t row00 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row01;
+            row01 = __riscv_vslidedown_vx_u8m1(row00, 1, vl + 1);
+
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row00, a, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, e, row01, vl);
+
+            // 2nd
+            vuint8m1_t row10 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row11;
+            row11 = __riscv_vslidedown_vx_u8m1(row10, 1, vl + 1);
+
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row10, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, e, row11, vl);
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst0, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst1, 6, vl), vl);
+            p_dst_iter += stride;
+        }
+    }
+    else
+    {
+        for (int j = 0; j < h; j += 2)
+        {
+            vuint8m1_t row0 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row0, a, vl);
+
+            vuint8m1_t row1 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row1, a, vl);
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst0, 6, vl), vl);
+            p_dst_iter += stride;
+
+            __riscv_vse8_v_u8m1(p_dst_iter, __riscv_vnclipu_wx_u8m1(dst1, 6, vl), vl);
+            p_dst_iter += stride;
+        }
+    }
+}
+
+__attribute__((always_inline)) static void h264_avg_chroma_unroll4(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int w, int h, int x, int y)
+{
+    uint8_t *p_dst_iter = p_dst;
+    uint8_t *p_src_iter = p_src;
+
+    const int xy = x * y;
+    const int x8 = x << 3;
+    const int y8 = y << 3;
+    const int a = 64 - x8 - y8 + xy;
+    const int b = x8 - xy;
+    const int c = y8 - xy;
+    const int d = xy;
+
+    int vl = __riscv_vsetvl_e8m1(w);
+
+    if (d != 0)
+    {
+        for (int j = 0; j < h; j += 4)
+        {
+            // dst 1st row
+            vuint8m1_t row00 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+
+            vuint8m1_t row01;
+            row01 = __riscv_vslidedown_vx_u8m1(row00, 1, vl + 1);
+
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row00, a, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, b, row01, vl);
+
+            vuint8m1_t row10 = __riscv_vle8_v_u8m1(p_src_iter + stride, vl + 1);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, c, row10, vl);
+
+            vuint8m1_t row11;
+            row11 = __riscv_vslidedown_vx_u8m1(row10, 1, vl + 1);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, d, row11, vl);
+
+            // dst 2nd row
+            p_src_iter += (stride << 1);
+
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row10, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, b, row11, vl);
+
+            vuint8m1_t row20 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, c, row20, vl);
+
+            vuint8m1_t row21;
+            row21 = __riscv_vslidedown_vx_u8m1(row20, 1, vl + 1);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, d, row21, vl);
+
+            // dst 3rd row
+            p_src_iter += stride;
+
+            vuint16m2_t dst2 = __riscv_vwmulu_vx_u16m2(row20, a, vl);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, b, row21, vl);
+
+            vuint8m1_t row30 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, c, row30, vl);
+
+            vuint8m1_t row31;
+            row31 = __riscv_vslidedown_vx_u8m1(row30, 1, vl + 1);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, d, row31, vl);
+
+            // dst 4rd row
+            p_src_iter += stride;
+
+            vuint16m2_t dst3 = __riscv_vwmulu_vx_u16m2(row30, a, vl);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, b, row31, vl);
+
+            vuint8m1_t row40 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, c, row40, vl);
+
+            vuint8m1_t row41;
+            row41 = __riscv_vslidedown_vx_u8m1(row40, 1, vl + 1);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, d, row41, vl);
+
+            vuint8m1_t avg0 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst0, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg0, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg1 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst1, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg1, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg2 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst2, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg2, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg3 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst3, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg3, vl);
+            p_dst_iter += stride;
+        }
+    }
+    else if (b == 0 && c != 0)
+    {
+        const unsigned short e = b + c;
+
+        for (int j = 0; j < h; j += 4)
+        {
+            vuint8m1_t row0 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row0, a, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row1 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, e, row1, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row2 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row1, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, e, row2, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row3 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst2 = __riscv_vwmulu_vx_u16m2(row2, a, vl);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, e, row3, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row4 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst3 = __riscv_vwmulu_vx_u16m2(row3, a, vl);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, e, row4, vl);
+
+            vuint8m1_t avg0 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst0, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg0, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg1 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst1, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg1, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg2 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst2, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg2, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg3 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst3, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg3, vl);
+            p_dst_iter += stride;
+        }
+    }
+    else if (b != 0 && c == 0)
+    {
+        const unsigned short e = b + c;
+
+        for (int j = 0; j < h; j += 4)
+        {
+            // 1st
+            vuint8m1_t row00 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row01;
+            row01 = __riscv_vslidedown_vx_u8m1(row00, 1, vl + 1);
+
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row00, a, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, e, row01, vl);
+
+            // 2nd
+            vuint8m1_t row10 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row11;
+            row11 = __riscv_vslidedown_vx_u8m1(row10, 1, vl + 1);
+
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row10, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, e, row11, vl);
+
+            // 3rd
+            vuint8m1_t row20 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row21;
+            row21 = __riscv_vslidedown_vx_u8m1(row20, 1, vl + 1);
+
+            vuint16m2_t dst2 = __riscv_vwmulu_vx_u16m2(row20, a, vl);
+            dst2 = __riscv_vwmaccu_vx_u16m2(dst2, e, row21, vl);
+
+            // 4th
+            vuint8m1_t row30 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row31;
+            row31 = __riscv_vslidedown_vx_u8m1(row30, 1, vl + 1);
+
+            vuint16m2_t dst3 = __riscv_vwmulu_vx_u16m2(row30, a, vl);
+            dst3 = __riscv_vwmaccu_vx_u16m2(dst3, e, row31, vl);
+
+            vuint8m1_t avg0 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst0, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg0, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg1 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst1, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg1, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg2 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst2, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg2, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg3 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst3, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg3, vl);
+            p_dst_iter += stride;
+        }
+    }
+    else
+    {
+        for (int j = 0; j < h; j += 4)
+        {
+            vuint8m1_t row0 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row0, a, vl);
+
+            vuint8m1_t row1 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row1, a, vl);
+
+            vuint8m1_t row2 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst2 = __riscv_vwmulu_vx_u16m2(row2, a, vl);
+
+            vuint8m1_t row3 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst3 = __riscv_vwmulu_vx_u16m2(row3, a, vl);
+
+            vuint8m1_t avg0 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst0, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg0, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg1 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst1, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg1, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg2 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst2, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg2, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg3 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst3, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg3, vl);
+            p_dst_iter += stride;
+        }
+    }
+}
+
+__attribute__((always_inline)) static void h264_avg_chroma_unroll2(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int w, int h, int x, int y)
+{
+    uint8_t *p_dst_iter = p_dst;
+    uint8_t *p_src_iter = p_src;
+
+    const int xy = x * y;
+    const int x8 = x << 3;
+    const int y8 = y << 3;
+    const int a = 64 - x8 - y8 + xy;
+    const int b = x8 - xy;
+    const int c = y8 - xy;
+    const int d = xy;
+
+    int vl = __riscv_vsetvl_e8m1(w);
+
+    if (d != 0)
+    {
+        for (int j = 0; j < h; j += 2)
+        {
+            // dst 1st row
+            vuint8m1_t row00 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+
+            vuint8m1_t row01;
+            row01 = __riscv_vslidedown_vx_u8m1(row00, 1, vl + 1);
+
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row00, a, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, b, row01, vl);
+
+            vuint8m1_t row10 = __riscv_vle8_v_u8m1(p_src_iter + stride, vl + 1);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, c, row10, vl);
+
+            vuint8m1_t row11;
+            row11 = __riscv_vslidedown_vx_u8m1(row10, 1, vl + 1);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, d, row11, vl);
+
+            // dst 2nd row
+            p_src_iter += (stride << 1);
+
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row10, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, b, row11, vl);
+
+            vuint8m1_t row20 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, c, row20, vl);
+
+            vuint8m1_t row21;
+            row21 = __riscv_vslidedown_vx_u8m1(row20, 1, vl + 1);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, d, row21, vl);
+
+            vuint8m1_t avg0 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst0, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg0, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg1 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst1, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg1, vl);
+            p_dst_iter += stride;
+        }
+    }
+    else if (b == 0 && c != 0)
+    {
+        const unsigned short e = b + c;
+
+        for (int j = 0; j < h; j += 2)
+        {
+            vuint8m1_t row0 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row0, a, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row1 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, e, row1, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t row2 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            vuint16m2_t dst1 =__riscv_vwmulu_vx_u16m2(row1, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, e, row2, vl);
+            p_src_iter += stride;
+
+            vuint8m1_t avg0 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst0, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg0, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg1 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst1, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg1, vl);
+            p_dst_iter += stride;
+        }
+    }
+    else if (b != 0 && c == 0)
+    {
+        const unsigned short e = b + c;
+
+        for (int j = 0; j < h; j += 2)
+        {
+            // 1st
+            vuint8m1_t row00 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row01;
+            row01 = __riscv_vslidedown_vx_u8m1(row00, 1, vl + 1);
+
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row00, a, vl);
+            dst0 = __riscv_vwmaccu_vx_u16m2(dst0, e, row01, vl);
+
+            // 2nd
+            vuint8m1_t row10 = __riscv_vle8_v_u8m1(p_src_iter, vl + 1);
+            p_src_iter += stride;
+
+            vuint8m1_t row11;
+            row11 = __riscv_vslidedown_vx_u8m1(row10, 1, vl + 1);
+
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row10, a, vl);
+            dst1 = __riscv_vwmaccu_vx_u16m2(dst1, e, row11, vl);
+
+            vuint8m1_t avg0 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst0, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg0, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg1 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst1, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg1, vl);
+            p_dst_iter += stride;
+        }
+    }
+    else
+    {
+        for (int j = 0; j < h; j += 2)
+        {
+            vuint8m1_t row0 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst0 = __riscv_vwmulu_vx_u16m2(row0, a, vl);
+
+            vuint8m1_t row1 = __riscv_vle8_v_u8m1(p_src_iter, vl);
+            p_src_iter += stride;
+            vuint16m2_t dst1 = __riscv_vwmulu_vx_u16m2(row1, a, vl);
+
+            vuint8m1_t avg0 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst0, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg0, vl);
+            p_dst_iter += stride;
+
+            vuint8m1_t avg1 = __riscv_vaaddu_vv_u8m1(__riscv_vnclipu_wx_u8m1(dst1, 6, vl), __riscv_vle8_v_u8m1(p_dst_iter, vl), vl);
+            __riscv_vse8_v_u8m1(p_dst_iter, avg1, vl);
+            p_dst_iter += stride;
+        }
+    }
+}
+
+void h264_put_chroma_mc8_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y)
+{
+    h264_put_chroma_unroll4(p_dst, p_src, stride, 8, h, x, y);
+}
+
+void h264_avg_chroma_mc8_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y)
+{
+    h264_avg_chroma_unroll4(p_dst, p_src, stride, 8, h, x, y);
+}
+
+void h264_put_chroma_mc4_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y)
+{
+    if (h >= 4)
+    {
+        h264_put_chroma_unroll4(p_dst, p_src, stride, 4, h, x, y);
+    }
+    else
+    {
+        h264_put_chroma_unroll2(p_dst, p_src, stride, 4, h, x, y);
+    }
+}
+
+void h264_avg_chroma_mc4_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y)
+{
+    if (h >= 4)
+    {
+        h264_avg_chroma_unroll4(p_dst, p_src, stride, 4, h, x, y);
+    }
+    else
+    {
+        h264_avg_chroma_unroll2(p_dst, p_src, stride, 4, h, x, y);
+    }
+}
+
+void h264_put_chroma_mc2_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y)
+{
+    if (h >= 4)
+    {
+        h264_put_chroma_unroll4(p_dst, p_src, stride, 2, h, x, y);
+    }
+    else
+    {
+        h264_put_chroma_unroll2(p_dst, p_src, stride, 2, h, x, y);
+    }
+}
+
+void h264_avg_chroma_mc2_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y)
+{
+    if (h >= 4)
+    {
+        h264_avg_chroma_unroll4(p_dst, p_src, stride, 2, h, x, y);
+    }
+    else
+    {
+        h264_avg_chroma_unroll2(p_dst, p_src, stride, 2, h, x, y);
+    }
+}
+#endif
diff --git a/libavcodec/riscv/h264_mc_chroma.h b/libavcodec/riscv/h264_mc_chroma.h
new file mode 100644
index 0000000000..ec9fef6672
--- /dev/null
+++ b/libavcodec/riscv/h264_mc_chroma.h
@@ -0,0 +1,40 @@
+/*
+ * Copyright (c) 2023 SiFive, Inc. All rights reserved.
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_RISCV_H264_MC_CHROMA_H
+#define AVCODEC_RISCV_H264_MC_CHROMA_H
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stddef.h>
+#include "config.h"
+
+#if HAVE_INTRINSICS_RVV
+typedef unsigned char pixel;
+
+void h264_put_chroma_mc8_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y);
+void h264_avg_chroma_mc8_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y);
+void h264_put_chroma_mc4_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y);
+void h264_avg_chroma_mc4_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y);
+void h264_put_chroma_mc2_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y);
+void h264_avg_chroma_mc2_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y);
+#endif
+#endif
\ No newline at end of file
-- 
2.17.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".