From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id DC4BA4904A for ; Wed, 1 May 2024 22:41:42 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D3F1B68D749; Thu, 2 May 2024 01:41:41 +0300 (EEST) Received: from mail-oo1-f42.google.com (mail-oo1-f42.google.com [209.85.161.42]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 59AD868D55C for ; Thu, 2 May 2024 01:41:36 +0300 (EEST) Received: by mail-oo1-f42.google.com with SMTP id 006d021491bc7-5acb90b2a82so4860733eaf.0 for ; Wed, 01 May 2024 15:41:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1714603294; x=1715208094; darn=ffmpeg.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=xY46cuM6sfTfY0AaGZP0zPSQDSc13IBsaqZdVwrAB9s=; b=Yz1RneU+xAfiU7AqQInKecL+b6tnar5YrUYXgWYzrfNtgGLUWCNBTBDIUc7ze5Nq7w 7r/aXDpdVWBoAFjyzcIx2FyglyaLOZ+T2Bl0/6jp7H9wIT/7hIWzDGK2qF9ifjkcthfE y9rl8M7PoLBZsKNsXU5pco3h4PGqgAkmWuVvfNm/9W2T71WSHx2eUvxi4kGnCbEbDEnm hyMn6p14pKgHmkpAF602xmymTzkhxiNPqWFAMNeMnCdT9Xd5fu4SFWhY6KWHYWeODUGX vx7oDliRO33U6nOj1l9MUojbqb8Vk/msDo0guG4Fs6deAdbhHAzpmy6bEsCmU3XPfX/g Yn2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714603294; x=1715208094; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xY46cuM6sfTfY0AaGZP0zPSQDSc13IBsaqZdVwrAB9s=; b=PgWGYiNi8rXcwu8LIPVeJJApwazf17FxPstlSO1nqqY9V3wskmv7H2u300f2LInhi2 ujFcqj3hXCW3KOOEkciGI1AF4+yJ4BfQVTULyDcM9tegNNmo6aQEVhqXzVXZkiB7Hs4X BfUMLFArWRF+5q63q7mlqzjbLb3O73H/k8sLkc1LeCXyYaQXExdT6vgKJexw0P8/9f9F Er13H+UF9zxBU92SULWlBTR/GicInMi2sMxzUhzOWjR4U4ypyI12X3PuakZ234dvKsin 5HkbYwgJcT+7opiZY7QIc4w5P1Ib+HFB0H81r5UaMpBTb2iV726csGhWEu/bdqrYVwEy M1sA== X-Gm-Message-State: AOJu0YxMO6iwt7QDuI1dz7jOLRdU2LgQZ9nhT3ZCnlvoieOw36YZvXKd aoTShBAUKmh9IkVD+eaRuobA/JfGF6ErNq3/yxEJ7AAQbT7leRbnTjHIwg== X-Google-Smtp-Source: AGHT+IEe06XXFp34Ph3kmdCm+9JuqLpGf1LY+qlmCQCqlbt0JHBU00sJqRwjSC+9emMuQjGHMJ1ZLw== X-Received: by 2002:a4a:1a03:0:b0:5aa:344e:f41a with SMTP id 3-20020a4a1a03000000b005aa344ef41amr4037270oof.1.1714603290817; Wed, 01 May 2024 15:41:30 -0700 (PDT) Received: from fedora.tailc94c2.ts.net (209-6-133-125.s1659.c3-0.bkl-cbr1.sbo-bkl.ma.cable.rcncustomer.com. [209.6.133.125]) by smtp.gmail.com with ESMTPSA id a5-20020ac81085000000b00434a165d45asm12633222qtj.38.2024.05.01.15.41.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 01 May 2024 15:41:30 -0700 (PDT) From: Stone Chen To: ffmpeg-devel@ffmpeg.org Date: Wed, 1 May 2024 18:39:58 -0400 Message-ID: <20240501224031.109294-4-chen.stonechen@gmail.com> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240501224031.109294-2-chen.stonechen@gmail.com> References: <20240501224031.109294-2-chen.stonechen@gmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 2/3][GSoC 2024] libavcodec/x86/vvc: Add AVX2 DMVR SAD functions for VVC X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Stone Chen Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: Implements AVX2 DMVR (decoder-side motion vector refinement) SAD functions. DMVR SAD is only calculated if w >= 8, h >= 8, and w * h > 128. To reduce complexity, SAD is only calculated on even rows. This is calculated for all video bitdepths, but the values passed to the function are always 16bit (even if the original video bitdepth is 8). The AVX2 implementation uses min/max/sub. Before: BQTerrace_1920x1080_60_10_420_22_RA.vvc | 80.7 | Chimera_8bit_1080P_1000_frames.vvc | 158.0 | NovosobornayaSquare_1920x1080.bin | 159.7 | RitualDance_1920x1080_60_10_420_37_RA.266 | 146.3 | After: BQTerrace_1920x1080_60_10_420_22_RA.vvc | 82.7 | Chimera_8bit_1080P_1000_frames.vvc | 167.0 | NovosobornayaSquare_1920x1080.bin | 166.3 | RitualDance_1920x1080_60_10_420_37_RA.266 | 154.0 | --- libavcodec/x86/vvc/Makefile | 3 +- libavcodec/x86/vvc/vvc_sad.asm | 193 +++++++++++++++++++++++++++++++ libavcodec/x86/vvc/vvcdsp_init.c | 15 +++ 3 files changed, 210 insertions(+), 1 deletion(-) create mode 100644 libavcodec/x86/vvc/vvc_sad.asm diff --git a/libavcodec/x86/vvc/Makefile b/libavcodec/x86/vvc/Makefile index d1623bd46a..9eb5f65c7c 100644 --- a/libavcodec/x86/vvc/Makefile +++ b/libavcodec/x86/vvc/Makefile @@ -4,4 +4,5 @@ clean:: OBJS-$(CONFIG_VVC_DECODER) += x86/vvc/vvcdsp_init.o \ x86/h26x/h2656dsp.o X86ASM-OBJS-$(CONFIG_VVC_DECODER) += x86/vvc/vvc_mc.o \ - x86/h26x/h2656_inter.o + x86/h26x/h2656_inter.o \ + x86/vvc/vvc_sad.o diff --git a/libavcodec/x86/vvc/vvc_sad.asm b/libavcodec/x86/vvc/vvc_sad.asm new file mode 100644 index 0000000000..06be3a936a --- /dev/null +++ b/libavcodec/x86/vvc/vvc_sad.asm @@ -0,0 +1,193 @@ +; /* +; * Provide SIMD DMVR SAD functions for VVC decoding +; * +; * Copyright (c) 2024 Stone Chen +; * +; * This file is part of FFmpeg. +; * +; * FFmpeg is free software; you can redistribute it and/or +; * modify it under the terms of the GNU Lesser General Public +; * License as published by the Free Software Foundation; either +; * version 2.1 of the License, or (at your option) any later version. +; * +; * FFmpeg is distributed in the hope that it will be useful, +; * but WITHOUT ANY WARRANTY; without even the implied warranty of +; * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +; * Lesser General Public License for more details. +; * +; * You should have received a copy of the GNU Lesser General Public +; * License along with FFmpeg; if not, write to the Free Software +; * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +; */ + +%include "libavutil/x86/x86util.asm" + +%define MAX_PB_SIZE 128 +%define ROWS 2 ; DMVR SAD is only calculated on even rows to reduce complexity + +SECTION .text + +%macro MIN_MAX_SAD 3 ; + vpminuw %1, %2, %3 + vpmaxuw %3, %2, %3 + vpsubusw %3, %3, %1 +%endmacro + +%macro HORIZ_ADD 3 ; xm0, xm1, m1 + vextracti128 %1, %3, q0001 ; 3 2 1 0 + vpaddd %1, %2 ; xm0 (7 + 3) (6 + 2) (5 + 1) (4 + 0) + vpshufd %2, %1, q0032 ; xm1 - - (7 + 3) (6 + 2) + vpaddd %1, %1, %2 ; xm0 _ _ (5 1 7 3) (4 0 6 2) + vpshufd %2, %1, q0001 ; xm1 _ _ (5 1 7 3) (5 1 7 3) + vpaddd %1, %1, %2 ; (01234567) +%endmacro + +%macro INIT_OFFSET 6 ; src1, src2, dxq, dyq, off1, off2 + sub %3, 2 + sub %4, 2 + + mov %5, 2 + mov %6, 2 + + add %5, %4 + sub %6, %4 + + imul %5, 128 + imul %6, 128 + + add %5, 2 + add %6, 2 + + add %5, %3 + sub %6, %3 + + lea %1, [%1 + %5 * 2] + lea %2, [%2 + %6 * 2] +%endmacro + +%if ARCH_X86_64 +%if HAVE_AVX2_EXTERNAL + +INIT_YMM avx2 + +cglobal vvc_sad_8, 6, 8, 13, src1, src2, dx, dy, block_w, block_h, off1, off2 + + INIT_OFFSET src1q, src2q, dxq, dyq, off1q, off2q + pxor m3, m3 + + .loop_height: + movu xm0, [src1q] + movu xm1, [src2q] + MIN_MAX_SAD xm2, xm0, xm1 + vpmovzxwd m1, xm1 + vpaddd m3, m1 + + movu xm5, [src1q + MAX_PB_SIZE * ROWS * 2] + movu xm6, [src2q + MAX_PB_SIZE * ROWS * 2] + MIN_MAX_SAD xm7, xm5, xm6 + vpmovzxwd m6, xm6 + vpaddd m3, m6 + + movu xm8, [src1q + MAX_PB_SIZE * 2 * ROWS * 2] + movu xm9, [src2q + MAX_PB_SIZE * 2 * ROWS * 2] + MIN_MAX_SAD xm10, xm8, xm9 + vpmovzxwd m9, xm9 + vpaddd m3, m9 + + movu xm11, [src1q + MAX_PB_SIZE * 3 * ROWS * 2] + movu xm12, [src2q + MAX_PB_SIZE * 3 * ROWS * 2] + MIN_MAX_SAD xm13, xm11, xm12 + vpmovzxwd m12, xm12 + + vpaddd m3, m12 + + add src1q, MAX_PB_SIZE * 4 * ROWS * 2 + add src2q, MAX_PB_SIZE * 4 * ROWS * 2 + + sub block_hd, 8 + jg .loop_height + + HORIZ_ADD xm0, xm3, m3 + movd eax, xm0 + RET + +cglobal vvc_sad_16, 6, 8, 13, src1, src2, dx, dy, block_w, block_h, off1, off2 + INIT_OFFSET src1q, src2q, dxq, dyq, off1q, off2q + pxor m8, m8 +.load_pixels: + movu xm0, [src1q] + movu xm1, [src2q] + MIN_MAX_SAD xm2, xm0, xm1 + vpmovzxwd m1, xm1 + vpaddd m8, m1 + + movu xm5, [src1q + 16] + movu xm6, [src2q + 16] + MIN_MAX_SAD xm7, xm5, xm6 + vpmovzxwd m6, xm6 + vpaddd m8, m6 + + add src1q, ROWS * MAX_PB_SIZE * 2 + add src2q, ROWS * MAX_PB_SIZE * 2 + + sub block_hd, 2 + jg .load_pixels + + HORIZ_ADD xm0, xm8, m8 + movd eax, xm0 + + RET + +cglobal vvc_sad_32_128, 6, 9, 13, src1, src2, dx, dy, block_w, block_h, off1, off2, row_idx + INIT_OFFSET src1q, src2q, dxq, dyq, off1q, off2q + pxor m3, m3 + +.loop_height: + mov off1q, src1q + mov off2q, src2q + mov row_idxd, block_wd + sar row_idxd, 5 + + .loop_width: + movu xm0, [src1q] + movu xm1, [src2q] + MIN_MAX_SAD xm2, xm0, xm1 + vpmovzxwd m1, xm1 + vpaddd m3, m1 + + movu xm5, [src1q + 16] + movu xm6, [src2q + 16] + MIN_MAX_SAD xm7, xm5, xm6 + vpmovzxwd m6, xm6 + vpaddd m3, m6 + + movu xm8, [src1q + 32] + movu xm9, [src2q + 32] + MIN_MAX_SAD xm10, xm8, xm9 + vpmovzxwd m9, xm9 + vpaddd m3, m9 + + movu xm11, [src1q + 48] + movu xm12, [src2q + 48] + MIN_MAX_SAD xm13, xm11, xm12 + vpmovzxwd m12, xm12 + vpaddd m3, m12 + + add src1q, 64 + add src2q, 64 + dec row_idxd + jg .loop_width + + lea src1q, [off1q + ROWS * MAX_PB_SIZE * 2] + lea src2q, [off2q + ROWS * MAX_PB_SIZE * 2] + + sub block_hq, 2 + jg .loop_height + + HORIZ_ADD xm0, xm3, m3 + movd eax, xm0 + + RET + +%endif +%endif diff --git a/libavcodec/x86/vvc/vvcdsp_init.c b/libavcodec/x86/vvc/vvcdsp_init.c index 985d750472..7760372176 100644 --- a/libavcodec/x86/vvc/vvcdsp_init.c +++ b/libavcodec/x86/vvc/vvcdsp_init.c @@ -252,6 +252,18 @@ AVG_FUNCS(16, 12, avx2) c->inter.avg = bf(ff_vvc_avg, bd, opt); \ c->inter.w_avg = bf(ff_vvc_w_avg, bd, opt); \ } while (0) + +int ff_vvc_sad_8_avx2(const int16_t *src0, const int16_t *src1, int dx, int dy, int block_w, int block_h); +int ff_vvc_sad_16_avx2(const int16_t *src0, const int16_t *src1, int dx, int dy, int block_w, int block_h); +int ff_vvc_sad_32_128_avx2(const int16_t *src0, const int16_t *src1, int dx, int dy, int block_w, int block_h); + +#define SAD_INIT() do { \ + c->inter.sad[1] = ff_vvc_sad_8_avx2; \ + c->inter.sad[2] = ff_vvc_sad_16_avx2; \ + c->inter.sad[3] = ff_vvc_sad_32_128_avx2; \ + c->inter.sad[4] = ff_vvc_sad_32_128_avx2; \ + c->inter.sad[5] = ff_vvc_sad_32_128_avx2; \ +} while (0) #endif void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd) @@ -265,6 +277,7 @@ void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd) } if (EXTERNAL_AVX2_FAST(cpu_flags)) { MC_LINKS_AVX2(8); + SAD_INIT(); } } else if (bd == 10) { if (EXTERNAL_SSE4(cpu_flags)) { @@ -273,6 +286,7 @@ void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd) if (EXTERNAL_AVX2_FAST(cpu_flags)) { MC_LINKS_AVX2(10); MC_LINKS_16BPC_AVX2(10); + SAD_INIT(); } } else if (bd == 12) { if (EXTERNAL_SSE4(cpu_flags)) { @@ -281,6 +295,7 @@ void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd) if (EXTERNAL_AVX2_FAST(cpu_flags)) { MC_LINKS_AVX2(12); MC_LINKS_16BPC_AVX2(12); + SAD_INIT(); } } -- 2.44.0 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".