From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id F26F2410B1
	for <ffmpegdev@gitmailbox.com>; Sun, 12 Oct 2025 07:11:23 +0000 (UTC)
Authentication-Results: ffbox; dkim=fail (body hash mismatch (got 
   b'XhpfIOxE4wE+hYdszfZkG4g1Pj/jb48VsakFceCWyag=', expected 
   b'07B07vx+Ri0Hk4hXsAMU/AeNssSo0BAMDiSbu/znJ7Q=')) header.d=ffmpeg.org 
   header.i=@ffmpeg.org header.a=rsa-sha256
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org;
 i=@ffmpeg.org; q=dns/txt; s=mail; t=1760253057; h=mime-version : to :
 date : message-id : reply-to : subject : list-id : list-archive :
 list-archive : list-help : list-owner : list-post : list-subscribe :
 list-unsubscribe : from : cc : content-type :
 content-transfer-encoding : from;
 bh=XhpfIOxE4wE+hYdszfZkG4g1Pj/jb48VsakFceCWyag=;
 b=iW0uBkVszFsYgOObYlOFrgv0CMmMUcRkaNGLAItIvss7JIBs5ZbU4xv2sBu96I3ReWnTW
 zlJMBA0l/vrDXNCGzBBrxQDeqS7KsxXuEnyNQWfIfuD956Y3bWvIFi6c9X3QqfspwqRJ4so
 Lwx7KdPp4HROS1TJU1HyjLd1rw3450pdLnZJwo/bYOohjYfzWg/Z7HEOfYd1ngI2DTdhxrT
 E+eX+z9+UxxKuDH122K4EeNtn+1RMzLonn/1pbNRDEwlpjnFfap6KcbpwpwqeI29fyyGPwv
 gXxCyt5Suk+ZSlswYILkcc8QvAgRA66Wh5OVBHP6mmn/5ZUabjIc5zA+Dmyg==
Received: from [172.19.0.2] (unknown [172.19.0.2])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id E975B68F180;
	Sun, 12 Oct 2025 10:10:57 +0300 (EEST)
ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1760253053;
 b=RirVrBmApEguybRoxYAMjs+cQGuWA+wSArlZRGI/wK+sJ4V0fPgAk+QZg93VvoM0WG+U6
 gL+BckgJDo0sbkUsQdhZSc6YJmm+nPQHXWu9ZuloJTV/pvGNkUyG6UbK8tVo7heFkyGwkHy
 T6R6raAnQj7X52PAKjvxLRbkZt7wPTt3Sx+Xt6jbVNDAIqrF2S1J7yfw2qrlNFAIlABD0ld
 wpIbrmty/rCurOEDSiFT/Q8RJNXQF1CLa1MSuYtdG369AwkqvZiWshUm3Ev0tn+tEDNsNAX
 dWmQJv91ghGA4tz7a8iYs4xTEuKSj/fRU5qiVfPgbuUFsY/64VTf3zEAUcoQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=ffmpeg.org; s=arc; t=1760253053; h=from : sender : reply-to :
 subject : date : message-id : to : cc : mime-version : content-type :
 content-transfer-encoding : content-id : content-description :
 resent-date : resent-from : resent-sender : resent-to : resent-cc :
 resent-message-id : in-reply-to : references : list-id : list-help :
 list-unsubscribe : list-subscribe : list-post : list-owner :
 list-archive; bh=kOVJVkxXqGfr1ycZxDJrf3UJcQ3fpsIuaFIPIA/F4rU=;
 b=dl07EjLPAmOGcuSCUs0DCPJqx6hECDfdMxbE1bjCYAaLRtpam3BPwu0D0TF3vEe+E2Nrm
 6P6d0TgfUXs0g/mvE23UF97A9GisRiKnssIDianlcFYwG3LBlT7bkJLgW+mVIyyZ4fI6Axi
 g0hcdVHC/vRWQaJg3nG/6WUFB8Qfc8kpfdnRskiU2gZnZMFAjSh/G4JY+JhpPc3Yqnf9XZY
 GB5gDdrjZ1yoK9bTFpxMsbbPx3aUNju0DsZLRhucOHFCjNR8YBHhs3cLtuZOH7Eji7j4RZU
 4/GLHw4kH3yh5bCC4IhIDXKQKVQSLAePWCkRp8FqK4Cs+lE1GNsoYv6dbyHw==
ARC-Authentication-Results: i=1; ffmpeg.org;
 dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org;
 arc=none;
 dmarc=pass header.from=ffmpeg.org policy.dmarc=quarantine
Authentication-Results: ffmpeg.org;
 dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org;
 arc=none (Message is not ARC signed);
 dmarc=pass (Used From Domain Record) header.from=ffmpeg.org
 policy.dmarc=quarantine
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org;
 i=@ffmpeg.org; q=dns/txt; s=mail; t=1760253045; h=content-type :
 mime-version : content-transfer-encoding : from : to : reply-to :
 subject : date : from;
 bh=07B07vx+Ri0Hk4hXsAMU/AeNssSo0BAMDiSbu/znJ7Q=;
 b=LcrNQXSuDrfyHFfsIqGCGliar+KHQSteuxzjdWKmWSkDtf1Kg2sDAKtRwhcsDYTg6Ac1j
 k4CsWgkeq0IG6PPb1+tPq9PUFW/ZiVDmQi7GKkZpRm5yXHyN+D7veE1YCZovN/TKOY8FhD1
 5Ql4jJdKuQdoKs9H2JjniYt5Bi6J58RWkuFHDLcUCDyXHQwuzBoNQt8nRRfxBY7xGBKhJfl
 zcRQp7PHxGz9uLSt3CPHoCC6wJ/FSasHz6cvlsOw8e9F6UNrlY3oW51Uy6zKrknCwTOz7PZ
 qhf2qhOD91P8DWUDCNtTYtRV1v5gRIvL9cZZdeXWWd8g85aD+9Cy6LgFhpPg==
Received: from be50bb5a3685 (code.ffmpeg.org [188.245.149.3])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 057BE68F078
	for <ffmpeg-devel@ffmpeg.org>; Sun, 12 Oct 2025 10:10:44 +0300 (EEST)
MIME-Version: 1.0
To: ffmpeg-devel@ffmpeg.org
Date: Sun, 12 Oct 2025 07:10:44 -0000
Message-ID: <176025304521.52.4016490393094737049@bf249f23a2c8>
Message-ID-Hash: RKBIXXUEYP77LLTEQWGCONAS6FWSCDFY
X-Message-ID-Hash: RKBIXXUEYP77LLTEQWGCONAS6FWSCDFY
X-MailFrom: code@ffmpeg.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; header-match-ffmpeg-devel.ffmpeg.org-0;
 header-match-ffmpeg-devel.ffmpeg.org-1;
 header-match-ffmpeg-devel.ffmpeg.org-2;
 header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation;
 nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size;
 news-moderation; no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: [FFmpeg-devel] [PATCH] avcodec/x86/mpegvideoencdsp_init: Use xmm registers in
 SSSE3 functions (PR #20692)
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
Archived-At: 
 <https://lists.ffmpeg.org/archives/list/ffmpeg-devel@ffmpeg.org/message/RKBIXXUEYP77LLTEQWGCONAS6FWSCDFY/>
Archived-At: 
 <https://lists.ffmpeg.org/lore/ffmpeg-devel/176025304521.52.4016490393094737049@bf249f23a2c8/>
List-Archive: 
 <https://lists.ffmpeg.org/archives/list/ffmpeg-devel@ffmpeg.org/>
List-Archive: <https://lists.ffmpeg.org/lore/ffmpeg-devel/>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Owner: <mailto:ffmpeg-devel-owner@ffmpeg.org>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Subscribe: <mailto:ffmpeg-devel-join@ffmpeg.org>
List-Unsubscribe: <mailto:ffmpeg-devel-leave@ffmpeg.org>
From: mkver via ffmpeg-devel <ffmpeg-devel@ffmpeg.org>
Cc: mkver <code@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Archived-At: <http://master.gitmailbox.com/ffmpegdev/176025304521.52.4016490393094737049@bf249f23a2c8/>
List-Archive: <http://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

PR #20692 opened by mkver
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20692
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20692.patch


>>From eb12812e4c6a0a9dd781ff1f721e512e7702f3f1 Mon Sep 17 00:00:00 2001
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Date: Sun, 12 Oct 2025 07:18:24 +0200
Subject: [PATCH 1/3] tests/checkasm/mpegvideoencdsp: Add test for add_8x8basis

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 tests/checkasm/mpegvideoencdsp.c | 40 ++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 5 deletions(-)

diff --git a/tests/checkasm/mpegvideoencdsp.c b/tests/checkasm/mpegvideoencdsp.c
index 24791d113d..281195cd5f 100644
--- a/tests/checkasm/mpegvideoencdsp.c
+++ b/tests/checkasm/mpegvideoencdsp.c
@@ -16,20 +16,48 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  */
 
+#include "libavutil/common.h"
 #include "libavutil/intreadwrite.h"
-#include "libavutil/mem.h"
 #include "libavutil/mem_internal.h"
 
+#include "libavcodec/mathops.h"
 #include "libavcodec/mpegvideoencdsp.h"
 
 #include "checkasm.h"
 
-#define randomize_buffers(buf, size)      \
-    do {                                  \
-        for (int j = 0; j < size; j += 4) \
-            AV_WN32(buf + j, rnd());      \
+#define randomize_buffers(buf, size)        \
+    do {                                    \
+        for (int j = 0; j < size; j += 4)   \
+            AV_WN32((char*)buf + j, rnd()); \
     } while (0)
 
+#define randomize_buffer_clipped(buf, min, max)          \
+    do {                                                 \
+        for (size_t j = 0; j < FF_ARRAY_ELEMS(buf); ++j) \
+            buf[j] = rnd() % (max - min + 1) + min;      \
+    } while (0)
+
+static void check_add_8x8basis(MpegvideoEncDSPContext *c)
+{
+    declare_func_emms(AV_CPU_FLAG_SSSE3, void, int16_t rem[64], const int16_t basis[64], int scale);
+    if (check_func(c->add_8x8basis, "add_8x8basis")) {
+        // FIXME: What are the actual ranges for these values?
+        int scale = sign_extend(rnd(), 12);
+        int16_t rem1[64];
+        int16_t rem2[64];
+        int16_t basis[64];
+
+        randomize_buffer_clipped(basis, -15760, 15760);
+        randomize_buffers(rem1, sizeof(rem1));
+        memcpy(rem2, rem1, sizeof(rem2));
+        call_ref(rem1, basis, scale);
+        call_new(rem2, basis, scale);
+        if (memcmp(rem1, rem2, sizeof(rem1)))
+            fail();
+        bench_new(rem1, basis, scale);
+    }
+}
+
 static void check_pix_sum(MpegvideoEncDSPContext *c)
 {
     LOCAL_ALIGNED_16(uint8_t, src, [16 * 16]);
@@ -144,4 +172,6 @@ void checkasm_check_mpegvideoencdsp(void)
     report("pix_norm1");
     check_draw_edges(&c);
     report("draw_edges");
+    check_add_8x8basis(&c);
+    report("add_8x8basis");
 }
-- 
2.49.1


>>From 77e38557e5d27c2cc698d1c07b0e6311a89cf113 Mon Sep 17 00:00:00 2001
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Date: Sun, 12 Oct 2025 07:29:04 +0200
Subject: [PATCH 2/3] avcodec/x86/mpegvideoencdsp_init: Don't use slow path
 unnecessarily

The only requirement of this code (and essentially the pmulhrsw
instruction) is that the scaled scale fits into an int16_t.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 libavcodec/x86/mpegvideoencdsp_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libavcodec/x86/mpegvideoencdsp_init.c b/libavcodec/x86/mpegvideoencdsp_init.c
index 78c2ef87b8..dc8fcd8833 100644
--- a/libavcodec/x86/mpegvideoencdsp_init.c
+++ b/libavcodec/x86/mpegvideoencdsp_init.c
@@ -90,7 +90,7 @@ static void add_8x8basis_ssse3(int16_t rem[64], const int16_t basis[64], int sca
 {
     x86_reg i=0;
 
-    if (FFABS(scale) < MAX_ABS) {
+    if (FFABS(scale) < 1024) {
         scale <<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT;
         __asm__ volatile(
                 "movd  %3, %%mm5        \n\t"
-- 
2.49.1


>>From 803493e80ce2a887f1e1b67e51f15f8b548dbe0b Mon Sep 17 00:00:00 2001
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Date: Sun, 12 Oct 2025 08:04:11 +0200
Subject: [PATCH 3/3] avcodec/x86/mpegvideoencdsp_init: Use xmm registers in
 SSSE3 functions

Improves performance and no longer breaks the ABI (by forgetting
to call emms).

Old benchmarks:
add_8x8basis_c:                                         43.6 ( 1.00x)
add_8x8basis_ssse3:                                     12.3 ( 3.55x)

New benchmarks:
add_8x8basis_c:                                         43.0 ( 1.00x)
add_8x8basis_ssse3:                                      6.3 ( 6.79x)

Notice that the output of try_8x8basis_ssse3 changes a bit:
Before this commit, it computes certain values and adds the values
for i,i+1,i+4 and i+5 before right shifting them; now it adds
the values for i,i+1,i+8,i+9. The second pair in these lists
could be avoided (by shifting xmm0 and xmm1 before adding both together
instead of only shifting xmm0 after adding them), but the former
i,i+1 is inherent in using pmaddwd. This is the reason that this
function is not bitexact.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 libavcodec/mpegvideo_enc.c            |  6 +-
 libavcodec/x86/mpegvideoencdsp_init.c | 99 +++++++++++++--------------
 tests/checkasm/mpegvideoencdsp.c      |  8 +--
 3 files changed, 55 insertions(+), 58 deletions(-)

diff --git a/libavcodec/mpegvideo_enc.c b/libavcodec/mpegvideo_enc.c
index dbf4d25136..9f5da254bf 100644
--- a/libavcodec/mpegvideo_enc.c
+++ b/libavcodec/mpegvideo_enc.c
@@ -2296,7 +2296,7 @@ static av_always_inline void encode_mb_internal(MPVEncContext *const s,
  * and neither of these encoders currently supports 444. */
 #define INTERLACED_DCT(s) ((chroma_format == CHROMA_420 || chroma_format == CHROMA_422) && \
                            (s)->c.avctx->flags & AV_CODEC_FLAG_INTERLACED_DCT)
-    int16_t weight[12][64];
+    DECLARE_ALIGNED(16, int16_t, weight)[12][64];
     int16_t orig[12][64];
     const int mb_x = s->c.mb_x;
     const int mb_y = s->c.mb_y;
@@ -4293,7 +4293,7 @@ static int dct_quantize_trellis_c(MPVEncContext *const s,
     return last_non_zero;
 }
 
-static int16_t basis[64][64];
+static DECLARE_ALIGNED(16, int16_t, basis)[64][64];
 
 static void build_basis(uint8_t *perm){
     int i, j, x, y;
@@ -4317,7 +4317,7 @@ static void build_basis(uint8_t *perm){
 static int dct_quantize_refine(MPVEncContext *const s, //FIXME breaks denoise?
                         int16_t *block, int16_t *weight, int16_t *orig,
                         int n, int qscale){
-    int16_t rem[64];
+    DECLARE_ALIGNED(16, int16_t, rem)[64];
     LOCAL_ALIGNED_16(int16_t, d1, [64]);
     const uint8_t *scantable;
     const uint8_t *perm_scantable;
diff --git a/libavcodec/x86/mpegvideoencdsp_init.c b/libavcodec/x86/mpegvideoencdsp_init.c
index dc8fcd8833..3cd16fefbf 100644
--- a/libavcodec/x86/mpegvideoencdsp_init.c
+++ b/libavcodec/x86/mpegvideoencdsp_init.c
@@ -35,13 +35,6 @@ int ff_pix_norm1_sse2(const uint8_t *pix, ptrdiff_t line_size);
 #if HAVE_SSSE3_INLINE
 #define SCALE_OFFSET -1
 
-/*
- * pmulhrsw: dst[0 - 15] = (src[0 - 15] * dst[0 - 15] + 0x4000)[15 - 30]
- */
-#define PMULHRW(x, y, s, o)                     \
-    "pmulhrsw " #s ", " #x "            \n\t"   \
-    "pmulhrsw " #s ", " #y "            \n\t"
-
 #define MAX_ABS 512
 
 static int try_8x8basis_ssse3(const int16_t rem[64], const int16_t weight[64], const int16_t basis[64], int scale)
@@ -52,36 +45,39 @@ static int try_8x8basis_ssse3(const int16_t rem[64], const int16_t weight[64], c
     scale <<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT;
 
     __asm__ volatile(
-        "pxor %%mm7, %%mm7              \n\t"
-        "movd  %4, %%mm5                \n\t"
-        "punpcklwd %%mm5, %%mm5         \n\t"
-        "punpcklwd %%mm5, %%mm5         \n\t"
-        ".p2align 4                     \n\t"
-        "1:                             \n\t"
-        "movq  (%1, %0), %%mm0          \n\t"
-        "movq  8(%1, %0), %%mm1         \n\t"
-        PMULHRW(%%mm0, %%mm1, %%mm5, %%mm6)
-        "paddw (%2, %0), %%mm0          \n\t"
-        "paddw 8(%2, %0), %%mm1         \n\t"
-        "psraw $6, %%mm0                \n\t"
-        "psraw $6, %%mm1                \n\t"
-        "pmullw (%3, %0), %%mm0         \n\t"
-        "pmullw 8(%3, %0), %%mm1        \n\t"
-        "pmaddwd %%mm0, %%mm0           \n\t"
-        "pmaddwd %%mm1, %%mm1           \n\t"
-        "paddd %%mm1, %%mm0             \n\t"
-        "psrld $4, %%mm0                \n\t"
-        "paddd %%mm0, %%mm7             \n\t"
-        "add $16, %0                    \n\t"
-        "cmp $128, %0                   \n\t" //FIXME optimize & bench
-        " jb 1b                         \n\t"
-        "pshufw $0x0E, %%mm7, %%mm6     \n\t"
-        "paddd %%mm6, %%mm7             \n\t" // faster than phaddd on core2
-        "psrld $2, %%mm7                \n\t"
-        "movd %%mm7, %0                 \n\t"
-
+        "pxor            %%xmm2, %%xmm2     \n\t"
+        "movd                %4, %%xmm3     \n\t"
+        "punpcklwd       %%xmm3, %%xmm3     \n\t"
+        "pshufd      $0, %%xmm3, %%xmm3     \n\t"
+        ".p2align 4                         \n\t"
+        "1:                                 \n\t"
+        "movdqa        (%1, %0), %%xmm0     \n\t"
+        "movdqa      16(%1, %0), %%xmm1     \n\t"
+        "pmulhrsw        %%xmm3, %%xmm0     \n\t"
+        "pmulhrsw        %%xmm3, %%xmm1     \n\t"
+        "paddw         (%2, %0), %%xmm0     \n\t"
+        "paddw       16(%2, %0), %%xmm1     \n\t"
+        "psraw               $6, %%xmm0     \n\t"
+        "psraw               $6, %%xmm1     \n\t"
+        "pmullw        (%3, %0), %%xmm0     \n\t"
+        "pmullw      16(%3, %0), %%xmm1     \n\t"
+        "pmaddwd         %%xmm0, %%xmm0     \n\t"
+        "pmaddwd         %%xmm1, %%xmm1     \n\t"
+        "paddd           %%xmm1, %%xmm0     \n\t"
+        "psrld               $4, %%xmm0     \n\t"
+        "paddd           %%xmm0, %%xmm2     \n\t"
+        "add                $32, %0         \n\t"
+        "cmp               $128, %0         \n\t" //FIXME optimize & bench
+        " jb                 1b             \n\t"
+        "pshufd   $0x0E, %%xmm2, %%xmm0     \n\t"
+        "paddd           %%xmm0, %%xmm2     \n\t"
+        "pshufd   $0x01, %%xmm2, %%xmm0     \n\t"
+        "paddd           %%xmm0, %%xmm2     \n\t"
+        "psrld               $2, %%xmm2     \n\t"
+        "movd            %%xmm2, %0         \n\t"
         : "+r" (i)
         : "r"(basis), "r"(rem), "r"(weight), "g"(scale)
+        XMM_CLOBBERS_ONLY("%xmm0", "%xmm1", "%xmm2", "%xmm3")
     );
     return i;
 }
@@ -93,24 +89,25 @@ static void add_8x8basis_ssse3(int16_t rem[64], const int16_t basis[64], int sca
     if (FFABS(scale) < 1024) {
         scale <<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT;
         __asm__ volatile(
-                "movd  %3, %%mm5        \n\t"
-                "punpcklwd %%mm5, %%mm5 \n\t"
-                "punpcklwd %%mm5, %%mm5 \n\t"
-                ".p2align 4             \n\t"
-                "1:                     \n\t"
-                "movq  (%1, %0), %%mm0  \n\t"
-                "movq  8(%1, %0), %%mm1 \n\t"
-                PMULHRW(%%mm0, %%mm1, %%mm5, %%mm6)
-                "paddw (%2, %0), %%mm0  \n\t"
-                "paddw 8(%2, %0), %%mm1 \n\t"
-                "movq %%mm0, (%2, %0)   \n\t"
-                "movq %%mm1, 8(%2, %0)  \n\t"
-                "add $16, %0            \n\t"
-                "cmp $128, %0           \n\t" // FIXME optimize & bench
-                " jb 1b                 \n\t"
-
+                "movd                %3, %%xmm2     \n\t"
+                "punpcklwd       %%xmm2, %%xmm2     \n\t"
+                "pshufd      $0, %%xmm2, %%xmm2     \n\t"
+                ".p2align 4                         \n\t"
+                "1:                                 \n\t"
+                "movdqa        (%1, %0), %%xmm0     \n\t"
+                "movdqa      16(%1, %0), %%xmm1     \n\t"
+                "pmulhrsw        %%xmm2, %%xmm0     \n\t"
+                "pmulhrsw        %%xmm2, %%xmm1     \n\t"
+                "paddw         (%2, %0), %%xmm0     \n\t"
+                "paddw       16(%2, %0), %%xmm1     \n\t"
+                "movdqa          %%xmm0, (%2, %0)   \n\t"
+                "movdqa          %%xmm1, 16(%2, %0) \n\t"
+                "add                $32, %0         \n\t"
+                "cmp               $128, %0         \n\t" // FIXME optimize & bench
+                " jb                 1b             \n\t"
                 : "+r" (i)
                 : "r"(basis), "r"(rem), "g"(scale)
+                XMM_CLOBBERS_ONLY("%xmm0", "%xmm1", "%xmm2")
         );
     } else {
         for (i=0; i<8*8; i++) {
diff --git a/tests/checkasm/mpegvideoencdsp.c b/tests/checkasm/mpegvideoencdsp.c
index 281195cd5f..a4a4fa6f5c 100644
--- a/tests/checkasm/mpegvideoencdsp.c
+++ b/tests/checkasm/mpegvideoencdsp.c
@@ -39,13 +39,13 @@
 
 static void check_add_8x8basis(MpegvideoEncDSPContext *c)
 {
-    declare_func_emms(AV_CPU_FLAG_SSSE3, void, int16_t rem[64], const int16_t basis[64], int scale);
+    declare_func(void, int16_t rem[64], const int16_t basis[64], int scale);
     if (check_func(c->add_8x8basis, "add_8x8basis")) {
         // FIXME: What are the actual ranges for these values?
         int scale = sign_extend(rnd(), 12);
-        int16_t rem1[64];
-        int16_t rem2[64];
-        int16_t basis[64];
+        DECLARE_ALIGNED(16, int16_t, rem1)[64];
+        DECLARE_ALIGNED(16, int16_t, rem2)[64];
+        DECLARE_ALIGNED(16, int16_t, basis)[64];
 
         randomize_buffer_clipped(basis, -15760, 15760);
         randomize_buffers(rem1, sizeof(rem1));
-- 
2.49.1

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org