From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id F26F2410B1 for ; Sun, 12 Oct 2025 07:11:23 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'XhpfIOxE4wE+hYdszfZkG4g1Pj/jb48VsakFceCWyag=', expected b'07B07vx+Ri0Hk4hXsAMU/AeNssSo0BAMDiSbu/znJ7Q=')) header.d=ffmpeg.org header.i=@ffmpeg.org header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1760253057; h=mime-version : to : date : message-id : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=XhpfIOxE4wE+hYdszfZkG4g1Pj/jb48VsakFceCWyag=; b=iW0uBkVszFsYgOObYlOFrgv0CMmMUcRkaNGLAItIvss7JIBs5ZbU4xv2sBu96I3ReWnTW zlJMBA0l/vrDXNCGzBBrxQDeqS7KsxXuEnyNQWfIfuD956Y3bWvIFi6c9X3QqfspwqRJ4so Lwx7KdPp4HROS1TJU1HyjLd1rw3450pdLnZJwo/bYOohjYfzWg/Z7HEOfYd1ngI2DTdhxrT E+eX+z9+UxxKuDH122K4EeNtn+1RMzLonn/1pbNRDEwlpjnFfap6KcbpwpwqeI29fyyGPwv gXxCyt5Suk+ZSlswYILkcc8QvAgRA66Wh5OVBHP6mmn/5ZUabjIc5zA+Dmyg== Received: from [172.19.0.2] (unknown [172.19.0.2]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id E975B68F180; Sun, 12 Oct 2025 10:10:57 +0300 (EEST) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1760253053; b=RirVrBmApEguybRoxYAMjs+cQGuWA+wSArlZRGI/wK+sJ4V0fPgAk+QZg93VvoM0WG+U6 gL+BckgJDo0sbkUsQdhZSc6YJmm+nPQHXWu9ZuloJTV/pvGNkUyG6UbK8tVo7heFkyGwkHy T6R6raAnQj7X52PAKjvxLRbkZt7wPTt3Sx+Xt6jbVNDAIqrF2S1J7yfw2qrlNFAIlABD0ld wpIbrmty/rCurOEDSiFT/Q8RJNXQF1CLa1MSuYtdG369AwkqvZiWshUm3Ev0tn+tEDNsNAX dWmQJv91ghGA4tz7a8iYs4xTEuKSj/fRU5qiVfPgbuUFsY/64VTf3zEAUcoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1760253053; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=kOVJVkxXqGfr1ycZxDJrf3UJcQ3fpsIuaFIPIA/F4rU=; b=dl07EjLPAmOGcuSCUs0DCPJqx6hECDfdMxbE1bjCYAaLRtpam3BPwu0D0TF3vEe+E2Nrm 6P6d0TgfUXs0g/mvE23UF97A9GisRiKnssIDianlcFYwG3LBlT7bkJLgW+mVIyyZ4fI6Axi g0hcdVHC/vRWQaJg3nG/6WUFB8Qfc8kpfdnRskiU2gZnZMFAjSh/G4JY+JhpPc3Yqnf9XZY GB5gDdrjZ1yoK9bTFpxMsbbPx3aUNju0DsZLRhucOHFCjNR8YBHhs3cLtuZOH7Eji7j4RZU 4/GLHw4kH3yh5bCC4IhIDXKQKVQSLAePWCkRp8FqK4Cs+lE1GNsoYv6dbyHw== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none; dmarc=pass header.from=ffmpeg.org policy.dmarc=quarantine Authentication-Results: ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=ffmpeg.org policy.dmarc=quarantine DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1760253045; h=content-type : mime-version : content-transfer-encoding : from : to : reply-to : subject : date : from; bh=07B07vx+Ri0Hk4hXsAMU/AeNssSo0BAMDiSbu/znJ7Q=; b=LcrNQXSuDrfyHFfsIqGCGliar+KHQSteuxzjdWKmWSkDtf1Kg2sDAKtRwhcsDYTg6Ac1j k4CsWgkeq0IG6PPb1+tPq9PUFW/ZiVDmQi7GKkZpRm5yXHyN+D7veE1YCZovN/TKOY8FhD1 5Ql4jJdKuQdoKs9H2JjniYt5Bi6J58RWkuFHDLcUCDyXHQwuzBoNQt8nRRfxBY7xGBKhJfl zcRQp7PHxGz9uLSt3CPHoCC6wJ/FSasHz6cvlsOw8e9F6UNrlY3oW51Uy6zKrknCwTOz7PZ qhf2qhOD91P8DWUDCNtTYtRV1v5gRIvL9cZZdeXWWd8g85aD+9Cy6LgFhpPg== Received: from be50bb5a3685 (code.ffmpeg.org [188.245.149.3]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 057BE68F078 for ; Sun, 12 Oct 2025 10:10:44 +0300 (EEST) MIME-Version: 1.0 To: ffmpeg-devel@ffmpeg.org Date: Sun, 12 Oct 2025 07:10:44 -0000 Message-ID: <176025304521.52.4016490393094737049@bf249f23a2c8> Message-ID-Hash: RKBIXXUEYP77LLTEQWGCONAS6FWSCDFY X-Message-ID-Hash: RKBIXXUEYP77LLTEQWGCONAS6FWSCDFY X-MailFrom: code@ffmpeg.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] avcodec/x86/mpegvideoencdsp_init: Use xmm registers in SSSE3 functions (PR #20692) List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: mkver via ffmpeg-devel Cc: mkver Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: PR #20692 opened by mkver URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20692 Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20692.patch >>From eb12812e4c6a0a9dd781ff1f721e512e7702f3f1 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 12 Oct 2025 07:18:24 +0200 Subject: [PATCH 1/3] tests/checkasm/mpegvideoencdsp: Add test for add_8x8basis Signed-off-by: Andreas Rheinhardt --- tests/checkasm/mpegvideoencdsp.c | 40 ++++++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 5 deletions(-) diff --git a/tests/checkasm/mpegvideoencdsp.c b/tests/checkasm/mpegvideoencdsp.c index 24791d113d..281195cd5f 100644 --- a/tests/checkasm/mpegvideoencdsp.c +++ b/tests/checkasm/mpegvideoencdsp.c @@ -16,20 +16,48 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ +#include "libavutil/common.h" #include "libavutil/intreadwrite.h" -#include "libavutil/mem.h" #include "libavutil/mem_internal.h" +#include "libavcodec/mathops.h" #include "libavcodec/mpegvideoencdsp.h" #include "checkasm.h" -#define randomize_buffers(buf, size) \ - do { \ - for (int j = 0; j < size; j += 4) \ - AV_WN32(buf + j, rnd()); \ +#define randomize_buffers(buf, size) \ + do { \ + for (int j = 0; j < size; j += 4) \ + AV_WN32((char*)buf + j, rnd()); \ } while (0) +#define randomize_buffer_clipped(buf, min, max) \ + do { \ + for (size_t j = 0; j < FF_ARRAY_ELEMS(buf); ++j) \ + buf[j] = rnd() % (max - min + 1) + min; \ + } while (0) + +static void check_add_8x8basis(MpegvideoEncDSPContext *c) +{ + declare_func_emms(AV_CPU_FLAG_SSSE3, void, int16_t rem[64], const int16_t basis[64], int scale); + if (check_func(c->add_8x8basis, "add_8x8basis")) { + // FIXME: What are the actual ranges for these values? + int scale = sign_extend(rnd(), 12); + int16_t rem1[64]; + int16_t rem2[64]; + int16_t basis[64]; + + randomize_buffer_clipped(basis, -15760, 15760); + randomize_buffers(rem1, sizeof(rem1)); + memcpy(rem2, rem1, sizeof(rem2)); + call_ref(rem1, basis, scale); + call_new(rem2, basis, scale); + if (memcmp(rem1, rem2, sizeof(rem1))) + fail(); + bench_new(rem1, basis, scale); + } +} + static void check_pix_sum(MpegvideoEncDSPContext *c) { LOCAL_ALIGNED_16(uint8_t, src, [16 * 16]); @@ -144,4 +172,6 @@ void checkasm_check_mpegvideoencdsp(void) report("pix_norm1"); check_draw_edges(&c); report("draw_edges"); + check_add_8x8basis(&c); + report("add_8x8basis"); } -- 2.49.1 >>From 77e38557e5d27c2cc698d1c07b0e6311a89cf113 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 12 Oct 2025 07:29:04 +0200 Subject: [PATCH 2/3] avcodec/x86/mpegvideoencdsp_init: Don't use slow path unnecessarily The only requirement of this code (and essentially the pmulhrsw instruction) is that the scaled scale fits into an int16_t. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/mpegvideoencdsp_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/x86/mpegvideoencdsp_init.c b/libavcodec/x86/mpegvideoencdsp_init.c index 78c2ef87b8..dc8fcd8833 100644 --- a/libavcodec/x86/mpegvideoencdsp_init.c +++ b/libavcodec/x86/mpegvideoencdsp_init.c @@ -90,7 +90,7 @@ static void add_8x8basis_ssse3(int16_t rem[64], const int16_t basis[64], int sca { x86_reg i=0; - if (FFABS(scale) < MAX_ABS) { + if (FFABS(scale) < 1024) { scale <<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT; __asm__ volatile( "movd %3, %%mm5 \n\t" -- 2.49.1 >>From 803493e80ce2a887f1e1b67e51f15f8b548dbe0b Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 12 Oct 2025 08:04:11 +0200 Subject: [PATCH 3/3] avcodec/x86/mpegvideoencdsp_init: Use xmm registers in SSSE3 functions Improves performance and no longer breaks the ABI (by forgetting to call emms). Old benchmarks: add_8x8basis_c: 43.6 ( 1.00x) add_8x8basis_ssse3: 12.3 ( 3.55x) New benchmarks: add_8x8basis_c: 43.0 ( 1.00x) add_8x8basis_ssse3: 6.3 ( 6.79x) Notice that the output of try_8x8basis_ssse3 changes a bit: Before this commit, it computes certain values and adds the values for i,i+1,i+4 and i+5 before right shifting them; now it adds the values for i,i+1,i+8,i+9. The second pair in these lists could be avoided (by shifting xmm0 and xmm1 before adding both together instead of only shifting xmm0 after adding them), but the former i,i+1 is inherent in using pmaddwd. This is the reason that this function is not bitexact. Signed-off-by: Andreas Rheinhardt --- libavcodec/mpegvideo_enc.c | 6 +- libavcodec/x86/mpegvideoencdsp_init.c | 99 +++++++++++++-------------- tests/checkasm/mpegvideoencdsp.c | 8 +-- 3 files changed, 55 insertions(+), 58 deletions(-) diff --git a/libavcodec/mpegvideo_enc.c b/libavcodec/mpegvideo_enc.c index dbf4d25136..9f5da254bf 100644 --- a/libavcodec/mpegvideo_enc.c +++ b/libavcodec/mpegvideo_enc.c @@ -2296,7 +2296,7 @@ static av_always_inline void encode_mb_internal(MPVEncContext *const s, * and neither of these encoders currently supports 444. */ #define INTERLACED_DCT(s) ((chroma_format == CHROMA_420 || chroma_format == CHROMA_422) && \ (s)->c.avctx->flags & AV_CODEC_FLAG_INTERLACED_DCT) - int16_t weight[12][64]; + DECLARE_ALIGNED(16, int16_t, weight)[12][64]; int16_t orig[12][64]; const int mb_x = s->c.mb_x; const int mb_y = s->c.mb_y; @@ -4293,7 +4293,7 @@ static int dct_quantize_trellis_c(MPVEncContext *const s, return last_non_zero; } -static int16_t basis[64][64]; +static DECLARE_ALIGNED(16, int16_t, basis)[64][64]; static void build_basis(uint8_t *perm){ int i, j, x, y; @@ -4317,7 +4317,7 @@ static void build_basis(uint8_t *perm){ static int dct_quantize_refine(MPVEncContext *const s, //FIXME breaks denoise? int16_t *block, int16_t *weight, int16_t *orig, int n, int qscale){ - int16_t rem[64]; + DECLARE_ALIGNED(16, int16_t, rem)[64]; LOCAL_ALIGNED_16(int16_t, d1, [64]); const uint8_t *scantable; const uint8_t *perm_scantable; diff --git a/libavcodec/x86/mpegvideoencdsp_init.c b/libavcodec/x86/mpegvideoencdsp_init.c index dc8fcd8833..3cd16fefbf 100644 --- a/libavcodec/x86/mpegvideoencdsp_init.c +++ b/libavcodec/x86/mpegvideoencdsp_init.c @@ -35,13 +35,6 @@ int ff_pix_norm1_sse2(const uint8_t *pix, ptrdiff_t line_size); #if HAVE_SSSE3_INLINE #define SCALE_OFFSET -1 -/* - * pmulhrsw: dst[0 - 15] = (src[0 - 15] * dst[0 - 15] + 0x4000)[15 - 30] - */ -#define PMULHRW(x, y, s, o) \ - "pmulhrsw " #s ", " #x " \n\t" \ - "pmulhrsw " #s ", " #y " \n\t" - #define MAX_ABS 512 static int try_8x8basis_ssse3(const int16_t rem[64], const int16_t weight[64], const int16_t basis[64], int scale) @@ -52,36 +45,39 @@ static int try_8x8basis_ssse3(const int16_t rem[64], const int16_t weight[64], c scale <<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT; __asm__ volatile( - "pxor %%mm7, %%mm7 \n\t" - "movd %4, %%mm5 \n\t" - "punpcklwd %%mm5, %%mm5 \n\t" - "punpcklwd %%mm5, %%mm5 \n\t" - ".p2align 4 \n\t" - "1: \n\t" - "movq (%1, %0), %%mm0 \n\t" - "movq 8(%1, %0), %%mm1 \n\t" - PMULHRW(%%mm0, %%mm1, %%mm5, %%mm6) - "paddw (%2, %0), %%mm0 \n\t" - "paddw 8(%2, %0), %%mm1 \n\t" - "psraw $6, %%mm0 \n\t" - "psraw $6, %%mm1 \n\t" - "pmullw (%3, %0), %%mm0 \n\t" - "pmullw 8(%3, %0), %%mm1 \n\t" - "pmaddwd %%mm0, %%mm0 \n\t" - "pmaddwd %%mm1, %%mm1 \n\t" - "paddd %%mm1, %%mm0 \n\t" - "psrld $4, %%mm0 \n\t" - "paddd %%mm0, %%mm7 \n\t" - "add $16, %0 \n\t" - "cmp $128, %0 \n\t" //FIXME optimize & bench - " jb 1b \n\t" - "pshufw $0x0E, %%mm7, %%mm6 \n\t" - "paddd %%mm6, %%mm7 \n\t" // faster than phaddd on core2 - "psrld $2, %%mm7 \n\t" - "movd %%mm7, %0 \n\t" - + "pxor %%xmm2, %%xmm2 \n\t" + "movd %4, %%xmm3 \n\t" + "punpcklwd %%xmm3, %%xmm3 \n\t" + "pshufd $0, %%xmm3, %%xmm3 \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "movdqa (%1, %0), %%xmm0 \n\t" + "movdqa 16(%1, %0), %%xmm1 \n\t" + "pmulhrsw %%xmm3, %%xmm0 \n\t" + "pmulhrsw %%xmm3, %%xmm1 \n\t" + "paddw (%2, %0), %%xmm0 \n\t" + "paddw 16(%2, %0), %%xmm1 \n\t" + "psraw $6, %%xmm0 \n\t" + "psraw $6, %%xmm1 \n\t" + "pmullw (%3, %0), %%xmm0 \n\t" + "pmullw 16(%3, %0), %%xmm1 \n\t" + "pmaddwd %%xmm0, %%xmm0 \n\t" + "pmaddwd %%xmm1, %%xmm1 \n\t" + "paddd %%xmm1, %%xmm0 \n\t" + "psrld $4, %%xmm0 \n\t" + "paddd %%xmm0, %%xmm2 \n\t" + "add $32, %0 \n\t" + "cmp $128, %0 \n\t" //FIXME optimize & bench + " jb 1b \n\t" + "pshufd $0x0E, %%xmm2, %%xmm0 \n\t" + "paddd %%xmm0, %%xmm2 \n\t" + "pshufd $0x01, %%xmm2, %%xmm0 \n\t" + "paddd %%xmm0, %%xmm2 \n\t" + "psrld $2, %%xmm2 \n\t" + "movd %%xmm2, %0 \n\t" : "+r" (i) : "r"(basis), "r"(rem), "r"(weight), "g"(scale) + XMM_CLOBBERS_ONLY("%xmm0", "%xmm1", "%xmm2", "%xmm3") ); return i; } @@ -93,24 +89,25 @@ static void add_8x8basis_ssse3(int16_t rem[64], const int16_t basis[64], int sca if (FFABS(scale) < 1024) { scale <<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT; __asm__ volatile( - "movd %3, %%mm5 \n\t" - "punpcklwd %%mm5, %%mm5 \n\t" - "punpcklwd %%mm5, %%mm5 \n\t" - ".p2align 4 \n\t" - "1: \n\t" - "movq (%1, %0), %%mm0 \n\t" - "movq 8(%1, %0), %%mm1 \n\t" - PMULHRW(%%mm0, %%mm1, %%mm5, %%mm6) - "paddw (%2, %0), %%mm0 \n\t" - "paddw 8(%2, %0), %%mm1 \n\t" - "movq %%mm0, (%2, %0) \n\t" - "movq %%mm1, 8(%2, %0) \n\t" - "add $16, %0 \n\t" - "cmp $128, %0 \n\t" // FIXME optimize & bench - " jb 1b \n\t" - + "movd %3, %%xmm2 \n\t" + "punpcklwd %%xmm2, %%xmm2 \n\t" + "pshufd $0, %%xmm2, %%xmm2 \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "movdqa (%1, %0), %%xmm0 \n\t" + "movdqa 16(%1, %0), %%xmm1 \n\t" + "pmulhrsw %%xmm2, %%xmm0 \n\t" + "pmulhrsw %%xmm2, %%xmm1 \n\t" + "paddw (%2, %0), %%xmm0 \n\t" + "paddw 16(%2, %0), %%xmm1 \n\t" + "movdqa %%xmm0, (%2, %0) \n\t" + "movdqa %%xmm1, 16(%2, %0) \n\t" + "add $32, %0 \n\t" + "cmp $128, %0 \n\t" // FIXME optimize & bench + " jb 1b \n\t" : "+r" (i) : "r"(basis), "r"(rem), "g"(scale) + XMM_CLOBBERS_ONLY("%xmm0", "%xmm1", "%xmm2") ); } else { for (i=0; i<8*8; i++) { diff --git a/tests/checkasm/mpegvideoencdsp.c b/tests/checkasm/mpegvideoencdsp.c index 281195cd5f..a4a4fa6f5c 100644 --- a/tests/checkasm/mpegvideoencdsp.c +++ b/tests/checkasm/mpegvideoencdsp.c @@ -39,13 +39,13 @@ static void check_add_8x8basis(MpegvideoEncDSPContext *c) { - declare_func_emms(AV_CPU_FLAG_SSSE3, void, int16_t rem[64], const int16_t basis[64], int scale); + declare_func(void, int16_t rem[64], const int16_t basis[64], int scale); if (check_func(c->add_8x8basis, "add_8x8basis")) { // FIXME: What are the actual ranges for these values? int scale = sign_extend(rnd(), 12); - int16_t rem1[64]; - int16_t rem2[64]; - int16_t basis[64]; + DECLARE_ALIGNED(16, int16_t, rem1)[64]; + DECLARE_ALIGNED(16, int16_t, rem2)[64]; + DECLARE_ALIGNED(16, int16_t, basis)[64]; randomize_buffer_clipped(basis, -15760, 15760); randomize_buffers(rem1, sizeof(rem1)); -- 2.49.1 _______________________________________________ ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org