From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id 38C0E4873E for ; Sat, 16 Dec 2023 10:45:03 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 5B99A68D0A6; Sat, 16 Dec 2023 12:45:00 +0200 (EET) Received: from ursule.remlab.net (vps-a2bccee9.vps.ovh.net [51.75.19.47]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 520E768CF9D for ; Sat, 16 Dec 2023 12:44:54 +0200 (EET) Received: from basile.remlab.net (localhost [IPv6:::1]) by ursule.remlab.net (Postfix) with ESMTP id E2977C000E for ; Sat, 16 Dec 2023 12:44:53 +0200 (EET) From: =?UTF-8?q?R=C3=A9mi=20Denis-Courmont?= To: ffmpeg-devel@ffmpeg.org Date: Sat, 16 Dec 2023 12:44:53 +0200 Message-ID: <20231216104453.13092-1-remi@remlab.net> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] lavc/opusdsp: simplify R-V V postfilter X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: This skips the round-trip to scalar register for the sliding 'x' coefficients, improving performance by about 5%. The trick here is that the vector slide-up instruction preserves elements in destination vector until the slide offset. The switch from vfslide1up.vf to vslideup.vi also allows the elimination of data dependencies on consecutive slides. Since the specifications recommend sticking to power of two offsets, we could slide as follows: vslideup.vi v8, v0, 2 vslideup.vi v4, v0, 1 vslideup.vi v12, v8, 1 vslideup.vi v16, v8, 2 However in the device under test, this seems to make performance slightly worse, so this is left for (in)validation with future better hardware. --- libavcodec/riscv/opusdsp_rvv.S | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-) diff --git a/libavcodec/riscv/opusdsp_rvv.S b/libavcodec/riscv/opusdsp_rvv.S index 79ae86c30e..9a8914c78d 100644 --- a/libavcodec/riscv/opusdsp_rvv.S +++ b/libavcodec/riscv/opusdsp_rvv.S @@ -26,40 +26,34 @@ func ff_opus_postfilter_rvv, zve32f flw fa1, 4(a2) // g1 sub t0, a0, t1 flw fa2, 8(a2) // g2 + addi t1, t0, -2 * 4 // data - (period + 2) = initial &x4 + vsetivli zero, 4, e32, m4, ta, ma addi t0, t0, 2 * 4 // data - (period - 2) = initial &x0 - - flw ft4, -16(t0) + vle32.v v16, (t1) addi t3, a1, -2 // maximum parallelism w/o stepping our tail - flw ft3, -12(t0) - flw ft2, -8(t0) - flw ft1, -4(t0) 1: + vslidedown.vi v8, v16, 2 min t1, a3, t3 + vslide1down.vx v12, v16, zero vsetvli t1, t1, e32, m4, ta, ma vle32.v v0, (t0) // x0 sub a3, a3, t1 - vle32.v v28, (a0) + vslide1down.vx v4, v8, zero sh2add t0, t1, t0 - vfslide1up.vf v4, v0, ft1 + vle32.v v28, (a0) addi t2, t1, -4 - vfslide1up.vf v8, v4, ft2 - vfslide1up.vf v12, v8, ft3 - vfslide1up.vf v16, v12, ft4 + vslideup.vi v4, v0, 1 + vslideup.vi v8, v4, 1 + vslideup.vi v12, v8, 1 + vslideup.vi v16, v12, 1 vfadd.vv v20, v4, v12 vfadd.vv v24, v0, v16 - vslidedown.vx v12, v0, t2 + vslidedown.vx v16, v0, t2 vfmacc.vf v28, fa0, v8 - vslidedown.vi v4, v12, 2 vfmacc.vf v28, fa1, v20 - vslide1down.vx v8, v12, zero vfmacc.vf v28, fa2, v24 - vslide1down.vx v0, v4, zero vse32.v v28, (a0) - vfmv.f.s ft4, v12 sh2add a0, t1, a0 - vfmv.f.s ft2, v4 - vfmv.f.s ft3, v8 - vfmv.f.s ft1, v0 bnez a3, 1b ret -- 2.43.0 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".