From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 460CC4CEF7
	for <ffmpegdev@gitmailbox.com>; Sun,  2 Nov 2025 22:07:30 +0000 (UTC)
Authentication-Results: ffbox; dkim=fail (body hash mismatch (got 
   b'JwkuG5pezTZKsYMb83VYCm9g3IkSCITlYEKAbruVb5U=', expected 
   b'/K3JQFWDrJoi6kgqIMD07mcWHwWW4/eSQDGecXi+GKE=')) header.d=ffmpeg.org 
   header.i=@ffmpeg.org header.a=rsa-sha256
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org;
 i=@ffmpeg.org; q=dns/txt; s=mail; t=1762121236; h=mime-version : to :
 date : message-id : reply-to : subject : list-id : list-archive :
 list-archive : list-help : list-owner : list-post : list-subscribe :
 list-unsubscribe : from : cc : content-type :
 content-transfer-encoding : from;
 bh=JwkuG5pezTZKsYMb83VYCm9g3IkSCITlYEKAbruVb5U=;
 b=4dXzSHpcuozQtG8ipLwOS1qGaSqLbq7hBvMhK59JwTyMFzH8Dkh62BvFtkaEXS7ehGyhX
 Y7xEmaxPWKG1xAmOM6sBDOrU/9tBE0ypVS519j1PuFrQUF/VrrdDPYzp/K8JAjSwxyGUdKB
 +SiImmkrzJZDVI/bfozzGpeq+0FqhNefDBkFEnU0YrZtjxnE++PD+5Cy7o+28aL8GGbzMnO
 qC5wK3w+6ttmhvrNiK80dFWqACAxkrtyidWyljY6qu9SkBdS/YVAOvc/spcVgkyHW7AVr68
 wOQHpa86aY0mOkGWeS1nJh+UnWrBYszZYXeQpl8h0z0/U1EvoV7Vl3eQs4vw==
Received: from [172.19.0.2] (unknown [172.19.0.2])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 0653D68F9C5;
	Mon,  3 Nov 2025 00:07:16 +0200 (EET)
ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1762121219;
 b=aqOVtqhkv/Fp88ei4TxsOxAO7BRQGKzyhB6GZV1RXUJOL0udMRry/S2M/0+r8hsLzWgSh
 ttVOVO8KKk81d0sxVXPkJ6tZuO1SRLPFNaKeChW+oQOncMsZPG7xo+DSgHAB+8qTgXlN0sv
 z0vm92CEMn7s1E1DT670S3qt8WUfijK2EZdJuYw54SFWZ/cSAYVm6TRYH6wcHgNpwnvSux+
 T2Y9X/l5/NEF2ULvABk2AO0xQxYWhWt+A5beaDookaSNhr0yCU2kGZnVIc2W9VbEark5Nhr
 W3wuzSS3YpE7ZqfzRu99ldN83yqefM9FimiBfknVtCjSJ36lG8Ritwy5HnQw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=ffmpeg.org; s=arc; t=1762121219; h=from : sender : reply-to :
 subject : date : message-id : to : cc : mime-version : content-type :
 content-transfer-encoding : content-id : content-description :
 resent-date : resent-from : resent-sender : resent-to : resent-cc :
 resent-message-id : in-reply-to : references : list-id : list-help :
 list-unsubscribe : list-subscribe : list-post : list-owner :
 list-archive; bh=S4i5hPtuTlxrq6s1nndqNUm08nlMmM+lWAVtTZA0MK8=;
 b=ajRQAbzkaoG2Qc5k7//lZwCHvY8aQblOxKOaRjYFtFQD6DDv8tQX2AZQ11jQl1otrLudT
 tXlH/5kKJxSnDVB2kxbP/V2DK2/C1vLJvo5e1UXdsp4k7o65JYcw45kcDv1hnTw0j4w2JgO
 JVHBKM+4MxgoVjOQoNYsYW2cJUSMRtiqrtb5B8bTrIFbA2YPufSr6OL51/36MWUrCgUjhDm
 N0ZAkKeeF+cKgsTvQ5t1QDVd3eAj84C6OdnEDsY2rMwAEcXIXhSYG/ZMeEnP9gXXnLDcTlG
 dvQEVYYLE1n83+5rY21bLph8lnyew85MnjbHNu1MiKeunqNzvjCHcya8HRkg==
ARC-Authentication-Results: i=1; ffmpeg.org;
 dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org;
 arc=none;
 dmarc=pass header.from=ffmpeg.org policy.dmarc=quarantine
Authentication-Results: ffmpeg.org;
 dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org;
 arc=none (Message is not ARC signed);
 dmarc=pass (Used From Domain Record) header.from=ffmpeg.org
 policy.dmarc=quarantine
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org;
 i=@ffmpeg.org; q=dns/txt; s=mail; t=1762121212; h=content-type :
 mime-version : content-transfer-encoding : from : to : reply-to :
 subject : date : from;
 bh=/K3JQFWDrJoi6kgqIMD07mcWHwWW4/eSQDGecXi+GKE=;
 b=u6GEBePSeBcp4F4Bb0EHaXamByG5gB3wYJ2Jh578Ykl2RHUgiEwFPvQ46GsqdfFJdzeV8
 d01GJwwJ6emr7dIXFc4WJr0IiPHnv7fMoaUgKilbNyoTaOc3qd0b1Ph2/dUzKddAzQPayl4
 YIoIuMs4I9kC8W0gDrNeekvVnk3gwpi1q44cEIcZ6aP08XZc3GWP/5Df1VGl1Eja+Gwuxx8
 TS4vklb1CgGMmCeOZakVgI2zDANg0nreH//AV436cV2UYaw5ZxJyFt2ZwrAvtLS68TVvV5l
 dGZaz5CoPtkwz6VvBMgkDogr3RgT+qeRCv2VLCrY7xy2h5hEl5IpXBlcXIgg==
Received: from 02c22a36bd31 (code.ffmpeg.org [188.245.149.3])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 33FE468F6D7
	for <ffmpeg-devel@ffmpeg.org>; Mon,  3 Nov 2025 00:06:52 +0200 (EET)
MIME-Version: 1.0
To: ffmpeg-devel@ffmpeg.org
Date: Sun, 02 Nov 2025 22:06:51 -0000
Message-ID: <176212121236.25.5438835603124650670@2cb04c0e5124>
Message-ID-Hash: WF6DW47I5ADWVVWOJZW2BUCHMZ2T6C7N
X-Message-ID-Hash: WF6DW47I5ADWVVWOJZW2BUCHMZ2T6C7N
X-MailFrom: code@ffmpeg.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop;
 banned-address; header-match-ffmpeg-devel.ffmpeg.org-0;
 header-match-ffmpeg-devel.ffmpeg.org-1;
 header-match-ffmpeg-devel.ffmpeg.org-2;
 header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation;
 nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size;
 news-moderation; no-subject; digests; suspicious-header
X-Mailman-Version: 3.3.10
Precedence: list
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: [FFmpeg-devel] [PATCH] avcodec/x86/me_cmp: Avoid MMX in (n)sse (PR #20822)
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
Archived-At: 
 <https://lists.ffmpeg.org/archives/list/ffmpeg-devel@ffmpeg.org/message/WF6DW47I5ADWVVWOJZW2BUCHMZ2T6C7N/>
Archived-At: 
 <https://lists.ffmpeg.org/lore/ffmpeg-devel/176212121236.25.5438835603124650670@2cb04c0e5124/>
List-Archive: 
 <https://lists.ffmpeg.org/archives/list/ffmpeg-devel@ffmpeg.org/>
List-Archive: <https://lists.ffmpeg.org/lore/ffmpeg-devel/>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Owner: <mailto:ffmpeg-devel-owner@ffmpeg.org>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Subscribe: <mailto:ffmpeg-devel-join@ffmpeg.org>
List-Unsubscribe: <mailto:ffmpeg-devel-leave@ffmpeg.org>
From: mkver via ffmpeg-devel <ffmpeg-devel@ffmpeg.org>
Cc: mkver <code@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Archived-At: <https://master.gitmailbox.com/ffmpegdev/176212121236.25.5438835603124650670@2cb04c0e5124/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

PR #20822 opened by mkver
URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20822
Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20822.patch


>>From 8c9f4f695859f018109294b6712b9f97eb777ed6 Mon Sep 17 00:00:00 2001
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Date: Sun, 2 Nov 2025 17:43:10 +0100
Subject: [PATCH 1/5] avcodec/x86/me_cmp: Avoid unnecessary instruction

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 libavcodec/x86/me_cmp.asm | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/libavcodec/x86/me_cmp.asm b/libavcodec/x86/me_cmp.asm
index 7825c8ef71..cf4ee941f7 100644
--- a/libavcodec/x86/me_cmp.asm
+++ b/libavcodec/x86/me_cmp.asm
@@ -282,9 +282,6 @@ HADAMARD8_DIFF 9
 
 %macro SUM_SQUARED_ERRORS 1
 cglobal sse%1, 5,5,8, v, pix1, pix2, lsize, h
-%if %1 == mmsize
-    shr       hd, 1
-%endif
     pxor      m0, m0         ; mm0 = 0
     pxor      m7, m7         ; mm7 holds the sum
 
@@ -334,11 +331,12 @@ cglobal sse%1, 5,5,8, v, pix1, pix2, lsize, h
 %if %1 == mmsize
     lea    pix1q, [pix1q + 2*lsizeq]
     lea    pix2q, [pix2q + 2*lsizeq]
+    sub       hd, 2
 %else
     add    pix1q, lsizeq
     add    pix2q, lsizeq
-%endif
     dec       hd
+%endif
     jnz .next2lines
 
     HADDD     m7, m1
-- 
2.49.1


>>From 205a0e1a49d168fd07c55d152b6fc5fc0706aeb8 Mon Sep 17 00:00:00 2001
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Date: Sun, 2 Nov 2025 17:50:15 +0100
Subject: [PATCH 2/5] avcodec/x86/me_cmp: Rename registers

This will avoid using xmm registers that are volatile for Win64
in the next commit.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 libavcodec/x86/me_cmp.asm | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/libavcodec/x86/me_cmp.asm b/libavcodec/x86/me_cmp.asm
index cf4ee941f7..03db905346 100644
--- a/libavcodec/x86/me_cmp.asm
+++ b/libavcodec/x86/me_cmp.asm
@@ -283,7 +283,7 @@ HADAMARD8_DIFF 9
 %macro SUM_SQUARED_ERRORS 1
 cglobal sse%1, 5,5,8, v, pix1, pix2, lsize, h
     pxor      m0, m0         ; mm0 = 0
-    pxor      m7, m7         ; mm7 holds the sum
+    pxor      m5, m5         ; m5 holds the sum
 
 .next2lines: ; FIXME why are these unaligned movs? pix1[] is aligned
     movu      m1, [pix1q]    ; m1 = pix1[0][0-15], [0-7] for mmx
@@ -299,12 +299,12 @@ cglobal sse%1, 5,5,8, v, pix1, pix2, lsize, h
     ; todo: mm1-mm2, mm3-mm4
     ; algo: subtract mm1 from mm2 with saturation and vice versa
     ;       OR the result to get the absolute difference
-    mova      m5, m1
-    mova      m6, m3
+    mova      m6, m1
+    mova      m7, m3
     psubusb   m1, m2
     psubusb   m3, m4
-    psubusb   m2, m5
-    psubusb   m4, m6
+    psubusb   m2, m6
+    psubusb   m4, m7
 
     por       m2, m1
     por       m4, m3
@@ -325,8 +325,8 @@ cglobal sse%1, 5,5,8, v, pix1, pix2, lsize, h
 
     paddd     m1, m2
     paddd     m3, m4
-    paddd     m7, m1
-    paddd     m7, m3
+    paddd     m5, m1
+    paddd     m5, m3
 
 %if %1 == mmsize
     lea    pix1q, [pix1q + 2*lsizeq]
@@ -339,8 +339,8 @@ cglobal sse%1, 5,5,8, v, pix1, pix2, lsize, h
 %endif
     jnz .next2lines
 
-    HADDD     m7, m1
-    movd     eax, m7         ; return value
+    HADDD     m5, m1
+    movd     eax, m5         ; return value
     RET
 %endmacro
 
-- 
2.49.1


>>From d2e5fe5863476a420ae97c07e37e4003d6af3d61 Mon Sep 17 00:00:00 2001
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Date: Sun, 2 Nov 2025 18:02:27 +0100
Subject: [PATCH 3/5] avcodec/x86/me_cmp: Add ff_sse8_sse2()

Benchmarks:
sse_1_c:                                                51.9 ( 1.00x)
sse_1_mmx:                                              16.5 ( 3.15x)
sse_1_sse2:                                              9.7 ( 5.36x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 libavcodec/x86/me_cmp.asm    | 20 ++++++++++++++++++--
 libavcodec/x86/me_cmp_init.c |  3 +++
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/libavcodec/x86/me_cmp.asm b/libavcodec/x86/me_cmp.asm
index 03db905346..57a09a2b75 100644
--- a/libavcodec/x86/me_cmp.asm
+++ b/libavcodec/x86/me_cmp.asm
@@ -281,11 +281,25 @@ HADAMARD8_DIFF 9
 ;               ptrdiff_t line_size, int h)
 
 %macro SUM_SQUARED_ERRORS 1
-cglobal sse%1, 5,5,8, v, pix1, pix2, lsize, h
+cglobal sse%1, 5,5,%1 < mmsize ? 6 : 8, v, pix1, pix2, lsize, h
     pxor      m0, m0         ; mm0 = 0
     pxor      m5, m5         ; m5 holds the sum
 
 .next2lines: ; FIXME why are these unaligned movs? pix1[] is aligned
+%if %1 < mmsize
+    movh      m1, [pix1q]
+    movh      m2, [pix2q]
+    movh      m3, [pix1q+lsizeq]
+    movh      m4, [pix2q+lsizeq]
+    punpcklbw m1, m0
+    punpcklbw m2, m0
+    punpcklbw m3, m0
+    punpcklbw m4, m0
+    psubw     m1, m2
+    psubw     m3, m4
+    pmaddwd   m1, m1
+    pmaddwd   m3, m3
+%else
     movu      m1, [pix1q]    ; m1 = pix1[0][0-15], [0-7] for mmx
     movu      m2, [pix2q]    ; m2 = pix2[0][0-15], [0-7] for mmx
 %if %1 == mmsize
@@ -325,10 +339,11 @@ cglobal sse%1, 5,5,8, v, pix1, pix2, lsize, h
 
     paddd     m1, m2
     paddd     m3, m4
+%endif
     paddd     m5, m1
     paddd     m5, m3
 
-%if %1 == mmsize
+%if %1 <= mmsize
     lea    pix1q, [pix1q + 2*lsizeq]
     lea    pix2q, [pix2q + 2*lsizeq]
     sub       hd, 2
@@ -351,6 +366,7 @@ INIT_MMX mmx
 SUM_SQUARED_ERRORS 16
 
 INIT_XMM sse2
+SUM_SQUARED_ERRORS 8
 SUM_SQUARED_ERRORS 16
 
 ;-----------------------------------------------
diff --git a/libavcodec/x86/me_cmp_init.c b/libavcodec/x86/me_cmp_init.c
index 9b23cbe4dc..dd5ffe0f45 100644
--- a/libavcodec/x86/me_cmp_init.c
+++ b/libavcodec/x86/me_cmp_init.c
@@ -32,6 +32,8 @@ int ff_sum_abs_dctelem_sse2(const int16_t *block);
 int ff_sum_abs_dctelem_ssse3(const int16_t *block);
 int ff_sse8_mmx(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                 ptrdiff_t stride, int h);
+int ff_sse8_sse2(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+                 ptrdiff_t stride, int h);
 int ff_sse16_mmx(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                  ptrdiff_t stride, int h);
 int ff_sse16_sse2(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
@@ -152,6 +154,7 @@ av_cold void ff_me_cmp_init_x86(MECmpContext *c, AVCodecContext *avctx)
 
     if (EXTERNAL_SSE2(cpu_flags)) {
         c->sse[0] = ff_sse16_sse2;
+        c->sse[1]            = ff_sse8_sse2;
         c->sum_abs_dctelem   = ff_sum_abs_dctelem_sse2;
 
         c->pix_abs[0][0] = ff_sad16_sse2;
-- 
2.49.1


>>From e5d144288852433a7546c8360563089e612b451b Mon Sep 17 00:00:00 2001
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Date: Sun, 2 Nov 2025 21:24:49 +0100
Subject: [PATCH 4/5] avcodec/x86/me_cmp: Port nsse{8,16} to SSSE3

Even nsse8 has to operate on eight words and therefore gains
a lot from xmm registers (and pabsw).

Old benchmarks:
nsse_0_c:                                              359.2 ( 1.00x)
nsse_0_mmx:                                            151.8 ( 2.37x)
nsse_1_c:                                              151.2 ( 1.00x)
nsse_1_mmx:                                             77.5 ( 1.95x)

New benchmarks:
nsse_0_c:                                              358.8 ( 1.00x)
nsse_0_ssse3:                                           62.2 ( 5.77x)
nsse_1_c:                                              151.2 ( 1.00x)
nsse_1_ssse3:                                           33.6 ( 4.50x)

The MMX nsse functions have been removed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 libavcodec/x86/me_cmp.asm    | 106 +++++++++++++++++++----------------
 libavcodec/x86/me_cmp_init.c |  38 ++++++-------
 2 files changed, 75 insertions(+), 69 deletions(-)

diff --git a/libavcodec/x86/me_cmp.asm b/libavcodec/x86/me_cmp.asm
index 57a09a2b75..770a6f22ec 100644
--- a/libavcodec/x86/me_cmp.asm
+++ b/libavcodec/x86/me_cmp.asm
@@ -23,10 +23,15 @@
 
 %include "libavutil/x86/x86util.asm"
 
+SECTION_RODATA
+
 cextern pb_1
 cextern pb_80
 cextern pw_2
 
+pb_unpack1: db 0, 0xFF, 1, 0xFF, 2, 0xFF, 3, 0xFF, 4, 0xFF, 5, 0xFF, 6, 0xFF, 0xFF, 0xFF
+pb_unpack2: db 1, 0xFF, 2, 0xFF, 3, 0xFF, 4, 0xFF, 5, 0xFF, 6, 0xFF, 7, 0xFF, 0xFF, 0xFF
+
 SECTION .text
 
 %macro DIFF_PIXELS_1 4
@@ -403,19 +408,16 @@ INIT_XMM ssse3
 SUM_ABS_DCTELEM 6, 2
 
 ;------------------------------------------------------------------------------
-; int ff_hf_noise*_mmx(const uint8_t *pix1, ptrdiff_t lsize, int h)
+; int ff_hf_noise*_ssse3(const uint8_t *pix1, ptrdiff_t lsize, int h)
 ;------------------------------------------------------------------------------
-; %1 = 8/16. %2-5=m#
-%macro HF_NOISE_PART1 5
-    mova      m%2, [pix1q]
-%if %1 == 8
+; %1 = 8/16, %2-5=m#, %6 = src
+%macro HF_NOISE_PART1 6
+%if %1 == mmsize
+    movu      m%2, [%6]
     mova      m%3, m%2
-    psllq     m%2, 8
-    psrlq     m%3, 8
-    psrlq     m%2, 8
-%else
-    mova      m%3, [pix1q+1]
-%endif
+    pslldq    m%2, 1
+    psrldq    m%3, 1
+    psrldq    m%2, 1
     mova      m%4, m%2
     mova      m%5, m%3
     punpcklbw m%2, m7
@@ -424,57 +426,65 @@ SUM_ABS_DCTELEM 6, 2
     punpckhbw m%5, m7
     psubw     m%2, m%3
     psubw     m%4, m%5
+%else
+    movh      m%2, [%6]
+    pshufb    m%3, m%2, m5
+    pshufb    m%2, m%2, m4
+    psubw     m%2, m%3
+%endif
 %endmacro
 
-; %1-2 = m#
-%macro HF_NOISE_PART2 4
-    psubw     m%1, m%3
-    psubw     m%2, m%4
-    pxor       m3, m3
-    pxor       m1, m1
-    pcmpgtw    m3, m%1
-    pcmpgtw    m1, m%2
-    pxor      m%1, m3
-    pxor      m%2, m1
-    psubw     m%1, m3
-    psubw     m%2, m1
-    paddw     m%2, m%1
-    paddw      m6, m%2
+; %1 = 8/16, %2-5 = m#
+%macro HF_NOISE_PART2 5
+%if %1 == mmsize
+    psubw     m%2, m%3
+    psubw     m%4, m%5
+    pabsw     m%2, m%2
+    pabsw     m%4, m%4
+    paddw     m%2, m%4
+%else
+    psubw     m%2, m%3
+    pabsw     m%2, m%2
+%endif
+    paddw      m0, m%2
 %endmacro
 
 ; %1 = 8/16
 %macro HF_NOISE 1
-cglobal hf_noise%1, 3,3,0, pix1, lsize, h
+cglobal hf_noise%1, 3,3,(%1 == 8) ? 6 : 8, pix1, lsize, h
+%if %1 == 8
+    mova       m4, [pb_unpack1]
+    mova       m5, [pb_unpack2]
+%else
+    pxor       m4, m4
+%endif
     sub        hd, 2
-    pxor       m7, m7
-    pxor       m6, m6
-    HF_NOISE_PART1 %1, 0, 1, 2, 3
-    add     pix1q, lsizeq
-    HF_NOISE_PART1 %1, 4, 1, 5, 3
-    HF_NOISE_PART2     0, 2, 4, 5
-    add     pix1q, lsizeq
+    pxor       m0, m0
+    HF_NOISE_PART1 %1, 1, 2, 5, 7, pix1q
+    HF_NOISE_PART1 %1, 3, 2, 6, 7, pix1q+lsizeq
+    lea     pix1q, [pix1q+2*lsizeq]
+    HF_NOISE_PART2 %1, 1, 3, 5, 6
 .loop:
-    HF_NOISE_PART1 %1, 0, 1, 2, 3
-    HF_NOISE_PART2     4, 5, 0, 2
-    add     pix1q, lsizeq
-    HF_NOISE_PART1 %1, 4, 1, 5, 3
-    HF_NOISE_PART2     0, 2, 4, 5
-    add     pix1q, lsizeq
+    HF_NOISE_PART1 %1, 1, 2, 5, 7, pix1q
+    HF_NOISE_PART2 %1, 3, 1, 6, 5
+    HF_NOISE_PART1 %1, 3, 2, 6, 7, pix1q+lsizeq
+    lea     pix1q, [pix1q+2*lsizeq]
+    HF_NOISE_PART2 %1, 1, 3, 5, 6
     sub        hd, 2
         jne .loop
 
-    mova       m0, m6
-    punpcklwd  m0, m7
-    punpckhwd  m6, m7
-    paddd      m6, m0
-    mova       m0, m6
-    psrlq      m6, 32
-    paddd      m0, m6
-    movd      eax, m0   ; eax = result of hf_noise8;
+%if %1 == 8
+    pxor       m4, m4
+%endif
+    movhlps    m1, m0
+    paddw      m0, m1
+    punpcklwd  m0, m4
+    HADDD      m0, m1
+    movd      eax, m0   ; eax = result of hf_noise;
     RET                 ; return eax;
 %endmacro
 
-INIT_MMX mmx
+INIT_XMM ssse3
 HF_NOISE 8
 HF_NOISE 16
 
diff --git a/libavcodec/x86/me_cmp_init.c b/libavcodec/x86/me_cmp_init.c
index dd5ffe0f45..e166af8dab 100644
--- a/libavcodec/x86/me_cmp_init.c
+++ b/libavcodec/x86/me_cmp_init.c
@@ -38,8 +38,8 @@ int ff_sse16_mmx(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                  ptrdiff_t stride, int h);
 int ff_sse16_sse2(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                   ptrdiff_t stride, int h);
-int ff_hf_noise8_mmx(const uint8_t *pix1, ptrdiff_t stride, int h);
-int ff_hf_noise16_mmx(const uint8_t *pix1, ptrdiff_t stride, int h);
+int ff_hf_noise8_ssse3(const uint8_t *pix1, ptrdiff_t stride, int h);
+int ff_hf_noise16_ssse3(const uint8_t *pix1, ptrdiff_t stride, int h);
 int ff_sad8_mmxext(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                    ptrdiff_t stride, int h);
 int ff_sad16_sse2(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
@@ -86,17 +86,12 @@ hadamard_func(sse2)
 hadamard_func(ssse3)
 
 #if HAVE_X86ASM
-static int nsse16_mmx(MPVEncContext *c, const uint8_t *pix1, const uint8_t *pix2,
-                      ptrdiff_t stride, int h)
+static int nsse16_ssse3(MPVEncContext *c, const uint8_t *pix1, const uint8_t *pix2,
+                        ptrdiff_t stride, int h)
 {
-    int score1, score2;
-
-    if (c)
-        score1 = c->sse_cmp[0](c, pix1, pix2, stride, h);
-    else
-        score1 = ff_sse16_mmx(c, pix1, pix2, stride, h);
-    score2 = ff_hf_noise16_mmx(pix1, stride, h) + ff_hf_noise8_mmx(pix1+8, stride, h)
-           - ff_hf_noise16_mmx(pix2, stride, h) - ff_hf_noise8_mmx(pix2+8, stride, h);
+    int score1 = ff_sse16_sse2(c, pix1, pix2, stride, h);
+    int score2 = ff_hf_noise16_ssse3(pix1, stride, h) -
+                 ff_hf_noise16_ssse3(pix2, stride, h);
 
     if (c)
         return score1 + FFABS(score2) * c->c.avctx->nsse_weight;
@@ -104,12 +99,12 @@ static int nsse16_mmx(MPVEncContext *c, const uint8_t *pix1, const uint8_t *pix2
         return score1 + FFABS(score2) * 8;
 }
 
-static int nsse8_mmx(MPVEncContext *c, const uint8_t *pix1, const uint8_t *pix2,
-                     ptrdiff_t stride, int h)
+static int nsse8_ssse3(MPVEncContext *c, const uint8_t *pix1, const uint8_t *pix2,
+                       ptrdiff_t stride, int h)
 {
-    int score1 = ff_sse8_mmx(c, pix1, pix2, stride, h);
-    int score2 = ff_hf_noise8_mmx(pix1, stride, h) -
-                 ff_hf_noise8_mmx(pix2, stride, h);
+    int score1 = ff_sse8_sse2(c, pix1, pix2, stride, h);
+    int score2 = ff_hf_noise8_ssse3(pix1, stride, h) -
+                 ff_hf_noise8_ssse3(pix2, stride, h);
 
     if (c)
         return score1 + FFABS(score2) * c->c.avctx->nsse_weight;
@@ -121,14 +116,11 @@ static int nsse8_mmx(MPVEncContext *c, const uint8_t *pix1, const uint8_t *pix2,
 
 av_cold void ff_me_cmp_init_x86(MECmpContext *c, AVCodecContext *avctx)
 {
+#if HAVE_X86ASM
     int cpu_flags = av_get_cpu_flags();
 
     if (EXTERNAL_MMX(cpu_flags)) {
         c->sse[1]            = ff_sse8_mmx;
-#if HAVE_X86ASM
-        c->nsse[0]           = nsse16_mmx;
-        c->nsse[1]           = nsse8_mmx;
-#endif
     }
 
     if (EXTERNAL_MMXEXT(cpu_flags)) {
@@ -191,10 +183,14 @@ av_cold void ff_me_cmp_init_x86(MECmpContext *c, AVCodecContext *avctx)
     }
 
     if (EXTERNAL_SSSE3(cpu_flags)) {
+        c->nsse[0]           = nsse16_ssse3;
+        c->nsse[1]           = nsse8_ssse3;
+
         c->sum_abs_dctelem   = ff_sum_abs_dctelem_ssse3;
 #if HAVE_ALIGNED_STACK
         c->hadamard8_diff[0] = ff_hadamard8_diff16_ssse3;
         c->hadamard8_diff[1] = ff_hadamard8_diff_ssse3;
 #endif
     }
+#endif
 }
-- 
2.49.1


>>From 22067082c467e4f55c7c2b968b4b486674865490 Mon Sep 17 00:00:00 2001
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Date: Sun, 2 Nov 2025 21:36:25 +0100
Subject: [PATCH 5/5] avcodec/x86/me_cmp: Remove MMX sse functions

They are overridden by SSE2 and no longer needed by the no longer
existing nsse MMX functions. Saves 240B here.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
---
 libavcodec/x86/me_cmp.asm    | 29 ++++++-----------------------
 libavcodec/x86/me_cmp_init.c |  8 --------
 2 files changed, 6 insertions(+), 31 deletions(-)

diff --git a/libavcodec/x86/me_cmp.asm b/libavcodec/x86/me_cmp.asm
index 770a6f22ec..4545eae276 100644
--- a/libavcodec/x86/me_cmp.asm
+++ b/libavcodec/x86/me_cmp.asm
@@ -282,8 +282,8 @@ INIT_XMM ssse3
 %define ABS_SUM_8x8 ABS_SUM_8x8_64
 HADAMARD8_DIFF 9
 
-; int ff_sse*_*(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
-;               ptrdiff_t line_size, int h)
+; int ff_sse*_sse2(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
+;                  ptrdiff_t line_size, int h)
 
 %macro SUM_SQUARED_ERRORS 1
 cglobal sse%1, 5,5,%1 < mmsize ? 6 : 8, v, pix1, pix2, lsize, h
@@ -305,15 +305,10 @@ cglobal sse%1, 5,5,%1 < mmsize ? 6 : 8, v, pix1, pix2, lsize, h
     pmaddwd   m1, m1
     pmaddwd   m3, m3
 %else
-    movu      m1, [pix1q]    ; m1 = pix1[0][0-15], [0-7] for mmx
-    movu      m2, [pix2q]    ; m2 = pix2[0][0-15], [0-7] for mmx
-%if %1 == mmsize
-    movu      m3, [pix1q+lsizeq] ; m3 = pix1[1][0-15], [0-7] for mmx
-    movu      m4, [pix2q+lsizeq] ; m4 = pix2[1][0-15], [0-7] for mmx
-%else  ; %1 / 2 == mmsize; mmx only
-    mova      m3, [pix1q+8]  ; m3 = pix1[0][8-15]
-    mova      m4, [pix2q+8]  ; m4 = pix2[0][8-15]
-%endif
+    movu      m1, [pix1q]        ; m1 = pix1[0][0-15]
+    movu      m2, [pix2q]        ; m2 = pix2[0][0-15]
+    movu      m3, [pix1q+lsizeq] ; m3 = pix1[1][0-15]
+    movu      m4, [pix2q+lsizeq] ; m4 = pix2[1][0-15]
 
     ; todo: mm1-mm2, mm3-mm4
     ; algo: subtract mm1 from mm2 with saturation and vice versa
@@ -348,15 +343,9 @@ cglobal sse%1, 5,5,%1 < mmsize ? 6 : 8, v, pix1, pix2, lsize, h
     paddd     m5, m1
     paddd     m5, m3
 
-%if %1 <= mmsize
     lea    pix1q, [pix1q + 2*lsizeq]
     lea    pix2q, [pix2q + 2*lsizeq]
     sub       hd, 2
-%else
-    add    pix1q, lsizeq
-    add    pix2q, lsizeq
-    dec       hd
-%endif
     jnz .next2lines
 
     HADDD     m5, m1
@@ -364,12 +353,6 @@ cglobal sse%1, 5,5,%1 < mmsize ? 6 : 8, v, pix1, pix2, lsize, h
     RET
 %endmacro
 
-INIT_MMX mmx
-SUM_SQUARED_ERRORS 8
-
-INIT_MMX mmx
-SUM_SQUARED_ERRORS 16
-
 INIT_XMM sse2
 SUM_SQUARED_ERRORS 8
 SUM_SQUARED_ERRORS 16
diff --git a/libavcodec/x86/me_cmp_init.c b/libavcodec/x86/me_cmp_init.c
index e166af8dab..d4503eef3b 100644
--- a/libavcodec/x86/me_cmp_init.c
+++ b/libavcodec/x86/me_cmp_init.c
@@ -30,12 +30,8 @@
 
 int ff_sum_abs_dctelem_sse2(const int16_t *block);
 int ff_sum_abs_dctelem_ssse3(const int16_t *block);
-int ff_sse8_mmx(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
-                ptrdiff_t stride, int h);
 int ff_sse8_sse2(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                  ptrdiff_t stride, int h);
-int ff_sse16_mmx(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
-                 ptrdiff_t stride, int h);
 int ff_sse16_sse2(MPVEncContext *v, const uint8_t *pix1, const uint8_t *pix2,
                   ptrdiff_t stride, int h);
 int ff_hf_noise8_ssse3(const uint8_t *pix1, ptrdiff_t stride, int h);
@@ -119,10 +115,6 @@ av_cold void ff_me_cmp_init_x86(MECmpContext *c, AVCodecContext *avctx)
 #if HAVE_X86ASM
     int cpu_flags = av_get_cpu_flags();
 
-    if (EXTERNAL_MMX(cpu_flags)) {
-        c->sse[1]            = ff_sse8_mmx;
-    }
-
     if (EXTERNAL_MMXEXT(cpu_flags)) {
 #if !HAVE_ALIGNED_STACK
         c->hadamard8_diff[0] = ff_hadamard8_diff16_mmxext;
-- 
2.49.1

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org