From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 65B884B11F for ; Tue, 23 Sep 2025 04:46:02 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'XwQgPIJj5OC3XiQFmlyTW7Y5p2e3uCV8hLBwEDRzEhk=', expected b'noeeyQn0hvHM0aETyYB4AMjzqDtg4fbxo5PlBvi6WIw=')) header.d=ffmpeg.org header.i=@ffmpeg.org header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1758602746; h=mime-version : to : date : message-id : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=XwQgPIJj5OC3XiQFmlyTW7Y5p2e3uCV8hLBwEDRzEhk=; b=T14VCpnBTmz42WG7TWkmvfpsJGu7v3+ElSkKyUjs8OdXdeSci0UhI3jJ3dBwyDiAsEl1O pjouVkVLYA2Tb9k6cWL5Uic9pUD/vZBUzAwPejFTjKk+rPf82l8oKEYA2A5ncLYxRlR/GwX G/MbGRLL0CNtSr34CqXDAMzoFJU5vOUACBrmpXNtEa3McpqWJFMi61rHZpHjpYNG4YJdKNP Lf3CMRKzavAfrMcv4Xbk/ISQppN4tcnT9AWCiowG9/7Nxc0ERPWJ8cM3p3Q3mwpYgXuUPbq 2XZ59pAYJ23QJh9M/dObfLqPOzpqfGDjO1Gk8ZaUt0uukMeREePENo/1UNSw== Received: from [172.19.0.4] (unknown [172.19.0.4]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id A32BD68EA94; Tue, 23 Sep 2025 07:45:46 +0300 (EEST) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1758602743; b=dpZX1Z/b03SvwHDnaqFkg2vpjybhGZRnqXsqVHInMvdt4pOy54oaY470ABJV/qM9ZO8kE 4Xh321ymzcJCYclsSH0o/Mv4xzz58dfdOWQINWA3+rP38gYT5RJUgIlPdUL0TvjHkihh4t/ 8J7lw7I1Y/Rbql230Kziy5qeHFxS0eWOkc3k/EJNnFAzPCVH/vvR91YFqlapGkdGJXZGW0/ 5qMXDhomjKU6IjYeVovGM1AF/x+mkZFO6IHXH1kJzb+C38nFRICP9sg4PHZuKmV3j3KidHv ktZJwf50SajKNmvmZtRm18y3p/uAS7VrbticEJ7R1ciiCLnUJGX+C3/MtVqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1758602743; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=tMfm94TvArSKIbbWTMKCwj25u0IaM/LFdLiyZOTqXkM=; b=kjsURQ9BLBxLsA92qBkcWWJbND3c2dh+EdAOrb27FZplE+0DheZMLew0/YpjufzbNmv0K yfcZqAUyd74aWiZNHLXkQVKEAcb45UmNJcxszY1ashCzYH2zq4t2v2wfA+XTJ8F1QR8YJJ5 4N7uSkwIc9TEIhZhYWW3w6mfSOBB73qPgXMSHohLFFoxjFAL8PQh2AycM1LZAuBDufOV7GJ uNGe/qdbOqTMj7FxJh/Wiu6ijjGCmtz3DXDbnEUR/RuMaFEtE2kxZS64qChtFxKp8YxSpGk m0OXzCT5a5TGUGDzWemTL6OqpmlzDfzlif2WBNACvV8W++GMwzoBRuwN+cZg== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none; dmarc=pass header.from=ffmpeg.org policy.dmarc=quarantine Authentication-Results: ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=ffmpeg.org policy.dmarc=quarantine DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1758602732; h=content-type : mime-version : content-transfer-encoding : from : to : reply-to : subject : date : from; bh=noeeyQn0hvHM0aETyYB4AMjzqDtg4fbxo5PlBvi6WIw=; b=3tsQ3Ee3msh0PKI+2G/Cjh9bIocZ9cUnz5HCirLwCqh+SzcMrJ5k9yGbtxs/4XvYyRVMz 0fyIVtUX96dj9K0+5XZxrrXNCezSOSmrsw8paKEjsled3nhD+opb0holdkEpublwO4Ym2Y2 Jcg0BD0OW1/gKCXC0NIbZXn7L5b47sZG1Yju4z0+d6jlA564OdfdN3sEZCk2XLfDr/P79fS WbXLC/DFPLcR+e8HpEMtiiywQ0qjAEWoBwthXGdHqjAOMv+DkAn8LkIYotcLzxZso33WyTo CC8CpqBH3gVfJ0U8CKIPf9m+FPifcTN79PeREu1HrZpP6vcTggdef5fJtegw== Received: from ed19c606a818 (code.ffmpeg.org [188.245.149.3]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id A66F368EAAD for ; Tue, 23 Sep 2025 07:45:32 +0300 (EEST) MIME-Version: 1.0 To: ffmpeg-devel@ffmpeg.org Date: Tue, 23 Sep 2025 04:45:32 -0000 Message-ID: <175860273296.25.419656424508006256@463a07221176> Message-ID-Hash: DEY2N2I6IJGTSWUMCJVLS7THIZ4YJCV7 X-Message-ID-Hash: DEY2N2I6IJGTSWUMCJVLS7THIZ4YJCV7 X-MailFrom: code@ffmpeg.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] Nuke a few MMX functions, HpelDSP patches (PR #20582) List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: mkver via ffmpeg-devel Cc: mkver Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: PR #20582 opened by mkver URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20582 Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20582.patch >>From 45f89dbd435e83ed76acc410b14a44dce1a72f95 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 22:51:18 +0200 Subject: [PATCH 01/20] avcodec/hpeldsp: Fix documentation This commit fixes two issues in the documentation: a) The documentation for {put,avg}_pixels_tab only mentions widths 16 and 8, although it explicitly mentions that there are four horizontal blocksizes. This part of the patch basically reverts e5771f4f37b67951485205e110f4da5e7e32ea74. b) The restrictions on height don't match the reality. While most users abide by it, some do not: i) vp56.c copies a 16x12 block. ii) indeo3 can copy an arbitrary multiple of four lines for block widths 4, 8 and 16. iii) SVQ3 can use block sizes luma block sizes 16x16, 8x16, 16x8, 8x8, 4x8, 8x4 and 4x4 and the corresponding 8x8, 4x8, 8x4, 4x4, 2x4, 4x2 and 2x2 chroma block sizes. This implies that for widths 2 and 4 height can be two and is guaranteed to be at least even. For all other widths, height can be a multiple of four. Furthermore, a comment for the SVQ3 blocksizes has been added. Signed-off-by: Andreas Rheinhardt --- libavcodec/hpeldsp.h | 21 +++++++++++---------- libavcodec/svq3.c | 1 + 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/libavcodec/hpeldsp.h b/libavcodec/hpeldsp.h index 41a46f0760..1f6a165bf6 100644 --- a/libavcodec/hpeldsp.h +++ b/libavcodec/hpeldsp.h @@ -31,11 +31,12 @@ #include #include -/* add and put pixel (decoding) */ -// blocksizes for hpel_pixels_func are 8x4,8x8 16x8 16x16 -// h for hpel_pixels_func is limited to {width/2, width} but never larger -// than 16 and never smaller than 4 -typedef void (*op_pixels_func)(uint8_t *block /*align width (8 or 16)*/, +/** + * Average and put pixel + * Widths can be 16, 8, 4 or 2. For for widths 2 and 4, h is always a positive + * multiple of 2; otherwise, it is a positive multiple of 4. + */ +typedef void (*op_pixels_func)(uint8_t *block /* align width */, const uint8_t *pixels /*align 1*/, ptrdiff_t line_size, int h); @@ -46,8 +47,8 @@ typedef struct HpelDSPContext { /** * Halfpel motion compensation with rounding (a+b+1)>>1. * this is an array[4][4] of motion compensation functions for 4 - * horizontal blocksizes (8,16) and the 4 halfpel positions
- * *pixels_tab[ 0->16xH 1->8xH ][ xhalfpel + 2*yhalfpel ] + * horizontal blocksizes (2,4,8,16) and the 4 halfpel positions
+ * *pixels_tab[ 0->16xH 1->8xH 2->4xH 3->2xH ][ xhalfpel + 2*yhalfpel ] * @param block destination where the result is stored * @param pixels source * @param line_size number of bytes in a horizontal line of block @@ -58,8 +59,8 @@ typedef struct HpelDSPContext { /** * Halfpel motion compensation with rounding (a+b+1)>>1. * This is an array[4][4] of motion compensation functions for 4 - * horizontal blocksizes (8,16) and the 4 halfpel positions
- * *pixels_tab[ 0->16xH 1->8xH ][ xhalfpel + 2*yhalfpel ] + * horizontal blocksizes (2,4,8,16) and the 4 halfpel positions
+ * *pixels_tab[ 0->16xH 1->8xH 2->4xH 3->2xH ][ xhalfpel + 2*yhalfpel ] * @param block destination into which the result is averaged (a+b+1)>>1 * @param pixels source * @param line_size number of bytes in a horizontal line of block @@ -85,7 +86,7 @@ typedef struct HpelDSPContext { * Halfpel motion compensation with no rounding (a+b)>>1. * this is an array[4] of motion compensation functions for 1 * horizontal blocksize (16) and the 4 halfpel positions
- * *pixels_tab[0][ xhalfpel + 2*yhalfpel ] + * *pixels_tab[ xhalfpel + 2*yhalfpel ] * @param block destination into which the result is averaged (a+b)>>1 * @param pixels source * @param line_size number of bytes in a horizontal line of block diff --git a/libavcodec/svq3.c b/libavcodec/svq3.c index 4c4f3018c5..dfcfce77d3 100644 --- a/libavcodec/svq3.c +++ b/libavcodec/svq3.c @@ -504,6 +504,7 @@ static inline int svq3_mc_dir(SVQ3Context *s, int size, int mode, int dir, int avg) { int i, j, k, mx, my, dx, dy, x, y; + // 0->16x16,1->8x16,2->16x8,3->8x8,4->4x8,5->8x4,6->4x4 const int part_width = ((size & 5) == 4) ? 4 : 16 >> (size & 1); const int part_height = 16 >> ((unsigned)(size + 1) / 3); const int extra_width = (mode == PREDICT_MODE) ? -16 * 6 : 0; -- 2.49.1 >>From b9b6c00e625b31e5fe34963d8b3d14ef89086cf8 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Tue, 23 Sep 2025 05:34:37 +0200 Subject: [PATCH 02/20] avcodec/hpel_template: Fix unintentional usage of unsigned offsets The value of sizeof() is of type size_t which means that an expression like src1[i * src_stride1 + 4 * (int)sizeof(pixel)] will use a very large offset if src_stride1 is sufficiently negative. It works in practice (because it is correct modulo SIZE_MAX), but UBSan treats it as error: libavcodec/hpel_template.c:104:1: runtime error: addition of unsigned offset to 0x7ffdfa0391d8 overflowed to 0x7ffdfa0391cc Fix this by casting sizeof(pixel) to int. (This has been uncovered by a checkasm test for the hpeldsp which will be added in a later commit.) Signed-off-by: Andreas Rheinhardt --- libavcodec/hpel_template.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/hpel_template.c b/libavcodec/hpel_template.c index fccfe7610f..77ebcd74a2 100644 --- a/libavcodec/hpel_template.c +++ b/libavcodec/hpel_template.c @@ -40,9 +40,9 @@ static inline void FUNC(OPNAME ## _pixels8_l2)(uint8_t *dst, \ a = AV_RN4P(&src1[i * src_stride1]); \ b = AV_RN4P(&src2[i * src_stride2]); \ OP(*((pixel4 *) &dst[i * dst_stride]), rnd_avg_pixel4(a, b)); \ - a = AV_RN4P(&src1[i * src_stride1 + 4 * sizeof(pixel)]); \ - b = AV_RN4P(&src2[i * src_stride2 + 4 * sizeof(pixel)]); \ - OP(*((pixel4 *) &dst[i * dst_stride + 4 * sizeof(pixel)]), \ + a = AV_RN4P(&src1[i * src_stride1 + 4 * (int)sizeof(pixel)]); \ + b = AV_RN4P(&src2[i * src_stride2 + 4 * (int)sizeof(pixel)]); \ + OP(*((pixel4 *) &dst[i * dst_stride + 4 * (int)sizeof(pixel)]), \ rnd_avg_pixel4(a, b)); \ } \ } \ -- 2.49.1 >>From 17c45ae1a65f20eab35d81d310ff80edde4456c4 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Tue, 23 Sep 2025 06:11:43 +0200 Subject: [PATCH 03/20] avcodec/hpel{dsp,_template}: Use ptrdiff_t for strides Signed-off-by: Andreas Rheinhardt --- libavcodec/hpel_template.c | 24 ++++++++++++------------ libavcodec/hpeldsp.c | 6 +++--- 2 files changed, 15 insertions(+), 15 deletions(-) diff --git a/libavcodec/hpel_template.c b/libavcodec/hpel_template.c index 77ebcd74a2..67bee665a9 100644 --- a/libavcodec/hpel_template.c +++ b/libavcodec/hpel_template.c @@ -29,9 +29,9 @@ static inline void FUNC(OPNAME ## _pixels8_l2)(uint8_t *dst, \ const uint8_t *src1, \ const uint8_t *src2, \ - int dst_stride, \ - int src_stride1, \ - int src_stride2, \ + ptrdiff_t dst_stride, \ + ptrdiff_t src_stride1, \ + ptrdiff_t src_stride2, \ int h) \ { \ int i; \ @@ -50,9 +50,9 @@ static inline void FUNC(OPNAME ## _pixels8_l2)(uint8_t *dst, \ static inline void FUNC(OPNAME ## _pixels4_l2)(uint8_t *dst, \ const uint8_t *src1, \ const uint8_t *src2, \ - int dst_stride, \ - int src_stride1, \ - int src_stride2, \ + ptrdiff_t dst_stride, \ + ptrdiff_t src_stride1, \ + ptrdiff_t src_stride2, \ int h) \ { \ int i; \ @@ -67,9 +67,9 @@ static inline void FUNC(OPNAME ## _pixels4_l2)(uint8_t *dst, \ static inline void FUNC(OPNAME ## _pixels2_l2)(uint8_t *dst, \ const uint8_t *src1, \ const uint8_t *src2, \ - int dst_stride, \ - int src_stride1, \ - int src_stride2, \ + ptrdiff_t dst_stride, \ + ptrdiff_t src_stride1, \ + ptrdiff_t src_stride2, \ int h) \ { \ int i; \ @@ -84,9 +84,9 @@ static inline void FUNC(OPNAME ## _pixels2_l2)(uint8_t *dst, \ static inline void FUNC(OPNAME ## _pixels16_l2)(uint8_t *dst, \ const uint8_t *src1, \ const uint8_t *src2, \ - int dst_stride, \ - int src_stride1, \ - int src_stride2, \ + ptrdiff_t dst_stride, \ + ptrdiff_t src_stride1, \ + ptrdiff_t src_stride2, \ int h) \ { \ FUNC(OPNAME ## _pixels8_l2)(dst, src1, src2, dst_stride, \ diff --git a/libavcodec/hpeldsp.c b/libavcodec/hpeldsp.c index db0e02ee93..688939ad3f 100644 --- a/libavcodec/hpeldsp.c +++ b/libavcodec/hpeldsp.c @@ -39,9 +39,9 @@ static inline void OPNAME ## _no_rnd_pixels8_l2_8(uint8_t *dst, \ const uint8_t *src1, \ const uint8_t *src2, \ - int dst_stride, \ - int src_stride1, \ - int src_stride2, \ + ptrdiff_t dst_stride, \ + ptrdiff_t src_stride1,\ + ptrdiff_t src_stride2,\ int h) \ { \ int i; \ -- 2.49.1 >>From 4cf16f9014cee0a6ca00118c95d7772a220ff877 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Mon, 22 Sep 2025 03:43:20 +0200 Subject: [PATCH 04/20] tests/checkasm: Add hpeldsp checkasm Signed-off-by: Andreas Rheinhardt --- tests/checkasm/Makefile | 1 + tests/checkasm/checkasm.c | 3 + tests/checkasm/checkasm.h | 1 + tests/checkasm/hpeldsp.c | 115 ++++++++++++++++++++++++++++++++++++++ tests/fate/checkasm.mak | 1 + 5 files changed, 121 insertions(+) create mode 100644 tests/checkasm/hpeldsp.c diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile index 0a54adc96a..c41d719e82 100644 --- a/tests/checkasm/Makefile +++ b/tests/checkasm/Makefile @@ -12,6 +12,7 @@ AVCODECOBJS-$(CONFIG_H264CHROMA) += h264chroma.o AVCODECOBJS-$(CONFIG_H264DSP) += h264dsp.o AVCODECOBJS-$(CONFIG_H264PRED) += h264pred.o AVCODECOBJS-$(CONFIG_H264QPEL) += h264qpel.o +AVCODECOBJS-$(CONFIG_HPELDSP) += hpeldsp.o AVCODECOBJS-$(CONFIG_IDCTDSP) += idctdsp.o AVCODECOBJS-$(CONFIG_LLAUDDSP) += llauddsp.o AVCODECOBJS-$(CONFIG_LLVIDDSP) += llviddsp.o diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c index ad4d9b53b6..b23e4ce889 100644 --- a/tests/checkasm/checkasm.c +++ b/tests/checkasm/checkasm.c @@ -184,6 +184,9 @@ static const struct { { "hevc_pel", checkasm_check_hevc_pel }, { "hevc_sao", checkasm_check_hevc_sao }, #endif + #if CONFIG_HPELDSP + { "hpeldsp", checkasm_check_hpeldsp }, + #endif #if CONFIG_HUFFYUV_DECODER { "huffyuvdsp", checkasm_check_huffyuvdsp }, #endif diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h index 1684c427d6..0f02c4fb6d 100644 --- a/tests/checkasm/checkasm.h +++ b/tests/checkasm/checkasm.h @@ -110,6 +110,7 @@ void checkasm_check_hevc_deblock(void); void checkasm_check_hevc_idct(void); void checkasm_check_hevc_pel(void); void checkasm_check_hevc_sao(void); +void checkasm_check_hpeldsp(void); void checkasm_check_huffyuvdsp(void); void checkasm_check_idctdsp(void); void checkasm_check_idet(void); diff --git a/tests/checkasm/hpeldsp.c b/tests/checkasm/hpeldsp.c new file mode 100644 index 0000000000..ba290b3ab8 --- /dev/null +++ b/tests/checkasm/hpeldsp.c @@ -0,0 +1,115 @@ +/* + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with FFmpeg; if not, write to the Free Software Foundation, Inc., + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include +#include + +#include "checkasm.h" +#include "libavutil/intreadwrite.h" +#include "libavutil/macros.h" +#include "libavutil/mem_internal.h" +#include "libavcodec/avcodec.h" +#include "libavcodec/hpeldsp.h" + +#define MAX_BLOCK_SIZE 16 +#define MAX_HEIGHT 16 +#define MAX_STRIDE 64 +// BUF_SIZE is bigger than necessary in order to test strides > block width. +#define BUF_SIZE ((MAX_HEIGHT - 1) * MAX_STRIDE + MAX_BLOCK_SIZE) +// Due to hpel interpolation the input needs to have one more line than +// the output and the last line needs one more element. +// The input is not subject to alignment requirements; making the input buffer +// bigger (by MAX_BLOCK_SIZE - 1) allows us to use a random misalignment. +#define INPUT_BUF_SIZE (MAX_HEIGHT * MAX_STRIDE + MAX_BLOCK_SIZE + 1 + (MAX_BLOCK_SIZE - 1)) + +#define randomize_buffers(buf0, buf1) \ + do { \ + static_assert(sizeof(buf0) == sizeof(buf1), "Incompatible buffers"); \ + static_assert(!(sizeof(buf0) % 4), "Tail handling needed"); \ + static_assert(sizeof(buf0[0]) == 1 && sizeof(buf1[0]) == 1, \ + "Pointer arithmetic needs to be adapted"); \ + for (size_t k = 0; k < sizeof(buf0); k += 4) { \ + uint32_t r = rnd(); \ + AV_WN32A(buf0 + k, r); \ + AV_WN32A(buf1 + k, r); \ + } \ + } while (0) + + +void checkasm_check_hpeldsp(void) +{ + DECLARE_ALIGNED(MAX_BLOCK_SIZE, uint8_t, srcbuf0)[INPUT_BUF_SIZE]; + DECLARE_ALIGNED(MAX_BLOCK_SIZE, uint8_t, srcbuf1)[INPUT_BUF_SIZE]; + DECLARE_ALIGNED(MAX_BLOCK_SIZE, uint8_t, dstbuf0)[BUF_SIZE]; + DECLARE_ALIGNED(MAX_BLOCK_SIZE, uint8_t, dstbuf1)[BUF_SIZE]; + HpelDSPContext hdsp; + static const struct { + const char *name; + size_t offset; + unsigned nb_blocksizes; + } tests[] = { +#define TEST(NAME, NB) { .name = #NAME, .offset = offsetof(HpelDSPContext, NAME), .nb_blocksizes = NB } + TEST(put_pixels_tab, 4), + TEST(avg_pixels_tab, 4), + TEST(put_no_rnd_pixels_tab, 2), // put_no_rnd_pixels_tab only has two usable blocksizes + TEST(avg_no_rnd_pixels_tab, 1), + }; + declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT, void, uint8_t *dst, const uint8_t *src, ptrdiff_t stride, int h); + + ff_hpeldsp_init(&hdsp, AV_CODEC_FLAG_BITEXACT); + + for (size_t i = 0; i < FF_ARRAY_ELEMS(tests); ++i) { + op_pixels_func (*func_tab)[4] = (op_pixels_func (*)[4])((char*)&hdsp + tests[i].offset); + for (unsigned j = 0; j < tests[i].nb_blocksizes; ++j) { + const unsigned blocksize = MAX_BLOCK_SIZE >> j; + // h must always be a multiple of four, except when width is two or four. + const unsigned h_mult = blocksize <= 4 ? 2 : 4; + + for (unsigned dxy = 0; dxy < 4; ++dxy) { + if (check_func(func_tab[j][dxy], "%s[%u][%u]", tests[i].name, j, dxy)) { + // Don't always use output that is 16-aligned. + size_t dst_offset = (rnd() % (MAX_BLOCK_SIZE / blocksize)) * blocksize; + size_t src_offset = rnd() % MAX_BLOCK_SIZE; + ptrdiff_t stride = (rnd() % (MAX_STRIDE / blocksize) + 1) * blocksize; + int h = (rnd() % (MAX_HEIGHT / h_mult) + 1) * h_mult; + const uint8_t *src0 = srcbuf0 + src_offset, *src1 = srcbuf1 + src_offset; + uint8_t *dst0 = dstbuf0 + dst_offset, *dst1 = dstbuf1 + dst_offset; + + if (rnd() & 1) { + // Flip stride. + dst1 += (h - 1) * stride; + dst0 += (h - 1) * stride; + // Due to interpolation potentially h + 1 lines are read + // from src, hence h * stride. + src0 += h * stride; + src1 += h * stride; + stride = -stride; + } + + randomize_buffers(srcbuf0, srcbuf1); + randomize_buffers(dstbuf0, dstbuf1); + call_ref(dst0, src0, stride, h); + call_new(dst1, src1, stride, h); + if (memcmp(srcbuf0, srcbuf1, sizeof(srcbuf0)) || memcmp(dstbuf0, dstbuf1, sizeof(dstbuf0))) + fail(); + bench_new(dst0, src0, stride, h); + } + } + } + } +} diff --git a/tests/fate/checkasm.mak b/tests/fate/checkasm.mak index 56476d254c..7570c89ad9 100644 --- a/tests/fate/checkasm.mak +++ b/tests/fate/checkasm.mak @@ -27,6 +27,7 @@ FATE_CHECKASM = fate-checkasm-aacencdsp \ fate-checkasm-hevc_idct \ fate-checkasm-hevc_pel \ fate-checkasm-hevc_sao \ + fate-checkasm-hpeldsp \ fate-checkasm-huffyuvdsp \ fate-checkasm-idctdsp \ fate-checkasm-jpeg2000dsp \ -- 2.49.1 >>From 19a297faf2096279b23e2bff502734c1fd9c5ffb Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Mon, 22 Sep 2025 05:24:49 +0200 Subject: [PATCH 05/20] avcodec/x86/hpeldsp_init: Remove MMX functions overridden by MMXEXT Forgotten in a51279bbdea0d6db920d71980262bccd0ce78226 because I only looked for MMX(EXT) functions overridden by SSE2. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp_init.c | 34 +--------------------------------- 1 file changed, 1 insertion(+), 33 deletions(-) diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index 6b2ad4494b..c190e7b473 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -161,38 +161,6 @@ static void avg_no_rnd_pixels8_xy2_mmx(uint8_t *block, const uint8_t *pixels, :FF_REG_a, "memory"); } -static void put_no_rnd_pixels8_x2_mmx(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -{ - MOVQ_BFE(mm6); - __asm__ volatile( - "lea (%3, %3), %%"FF_REG_a" \n\t" - ".p2align 3 \n\t" - "1: \n\t" - "movq (%1), %%mm0 \n\t" - "movq 1(%1), %%mm1 \n\t" - "movq (%1, %3), %%mm2 \n\t" - "movq 1(%1, %3), %%mm3 \n\t" - PAVGBP_MMX_NO_RND(%%mm0, %%mm1, %%mm4, %%mm2, %%mm3, %%mm5) - "movq %%mm4, (%2) \n\t" - "movq %%mm5, (%2, %3) \n\t" - "add %%"FF_REG_a", %1 \n\t" - "add %%"FF_REG_a", %2 \n\t" - "movq (%1), %%mm0 \n\t" - "movq 1(%1), %%mm1 \n\t" - "movq (%1, %3), %%mm2 \n\t" - "movq 1(%1, %3), %%mm3 \n\t" - PAVGBP_MMX_NO_RND(%%mm0, %%mm1, %%mm4, %%mm2, %%mm3, %%mm5) - "movq %%mm4, (%2) \n\t" - "movq %%mm5, (%2, %3) \n\t" - "add %%"FF_REG_a", %1 \n\t" - "add %%"FF_REG_a", %2 \n\t" - "subl $4, %0 \n\t" - "jnz 1b \n\t" - :"+g"(h), "+S"(pixels), "+D"(block) - :"r"((x86_reg)line_size) - :FF_REG_a, "memory"); -} - static void put_no_rnd_pixels16_x2_mmx(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) { MOVQ_BFE(mm6); @@ -405,7 +373,7 @@ static void hpeldsp_init_mmx(HpelDSPContext *c, int flags) SET_HPEL_FUNCS12(avg_no_rnd, , 16, mmx); c->avg_no_rnd_pixels_tab[3] = avg_no_rnd_pixels16_xy2_mmx; SET_HPEL_FUNCS03(put, [1], 8, mmx); - SET_HPEL_FUNCS(put_no_rnd, [1], 8, mmx); + SET_HPEL_FUNCS03(put_no_rnd, [1], 8, mmx); #endif } -- 2.49.1 >>From d76026c6179671f7a5e42acf10bb2d8e3690ee0c Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 00:26:32 +0200 Subject: [PATCH 06/20] avcodec/x86/hpeldsp_init: Remove MMX(EXT) functions overridden by SSE2FAST CPUs which support SSE2, but not in a fast way (so that they get the additional AV_CPU_FLAG_SSE2SLOW) are ancient nowadays (2007 and older), so ignore the distinction between the two and remove MMX and MMXEXT functions that are now overridden by SSE2 functions. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp.asm | 40 ---------------------------------- libavcodec/x86/hpeldsp_init.c | 41 +++++------------------------------ 2 files changed, 6 insertions(+), 75 deletions(-) diff --git a/libavcodec/x86/hpeldsp.asm b/libavcodec/x86/hpeldsp.asm index 3bc278618c..b59195de95 100644 --- a/libavcodec/x86/hpeldsp.asm +++ b/libavcodec/x86/hpeldsp.asm @@ -84,47 +84,7 @@ cglobal put_pixels8_x2, 4,5 INIT_MMX mmxext PUT_PIXELS8_X2 - ; void ff_put_pixels16_x2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -%macro PUT_PIXELS_16 0 -cglobal put_pixels16_x2, 4,5 - lea r4, [r2*2] -.loop: - mova m0, [r1] - mova m1, [r1+r2] - mova m2, [r1+8] - mova m3, [r1+r2+8] - PAVGB m0, [r1+1] - PAVGB m1, [r1+r2+1] - PAVGB m2, [r1+9] - PAVGB m3, [r1+r2+9] - mova [r0], m0 - mova [r0+r2], m1 - mova [r0+8], m2 - mova [r0+r2+8], m3 - add r1, r4 - add r0, r4 - mova m0, [r1] - mova m1, [r1+r2] - mova m2, [r1+8] - mova m3, [r1+r2+8] - PAVGB m0, [r1+1] - PAVGB m1, [r1+r2+1] - PAVGB m2, [r1+9] - PAVGB m3, [r1+r2+9] - add r1, r4 - mova [r0], m0 - mova [r0+r2], m1 - mova [r0+8], m2 - mova [r0+r2+8], m3 - add r0, r4 - sub r3d, 4 - jne .loop - RET -%endmacro - -INIT_MMX mmxext -PUT_PIXELS_16 ; The 8_X2 macro can easily be used here INIT_XMM sse2 PUT_PIXELS8_X2 diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index c190e7b473..c0913552d5 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -36,8 +36,6 @@ void ff_put_pixels8_x2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); -void ff_put_pixels16_x2_mmxext(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h); void ff_put_pixels16_x2_sse2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_avg_pixels16_x2_sse2(uint8_t *block, const uint8_t *pixels, @@ -66,10 +64,8 @@ void ff_avg_approx_pixels8_xy2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); #define put_pixels8_mmx ff_put_pixels8_mmx -#define put_pixels16_mmx ff_put_pixels16_mmx #define put_pixels8_xy2_mmx ff_put_pixels8_xy2_mmx #define put_no_rnd_pixels8_mmx ff_put_pixels8_mmx -#define put_no_rnd_pixels16_mmx ff_put_pixels16_mmx #if HAVE_INLINE_ASM @@ -323,10 +319,6 @@ CALL_2X_PIXELS(put_no_rnd_pixels16_xy2_mmx, put_no_rnd_pixels8_xy2_mmx, 8) #undef DEF #undef SET_RND -#if HAVE_MMX -CALL_2X_PIXELS(put_pixels16_xy2_mmx, ff_put_pixels8_xy2_mmx, 8) -#endif - #endif /* HAVE_INLINE_ASM */ @@ -334,12 +326,7 @@ CALL_2X_PIXELS(put_pixels16_xy2_mmx, ff_put_pixels8_xy2_mmx, 8) #define HPELDSP_AVG_PIXELS16(CPUEXT) \ CALL_2X_PIXELS(put_no_rnd_pixels16_x2 ## CPUEXT, ff_put_no_rnd_pixels8_x2 ## CPUEXT, 8) \ - CALL_2X_PIXELS(put_pixels16_y2 ## CPUEXT, ff_put_pixels8_y2 ## CPUEXT, 8) \ - CALL_2X_PIXELS(put_no_rnd_pixels16_y2 ## CPUEXT, ff_put_no_rnd_pixels8_y2 ## CPUEXT, 8) \ - CALL_2X_PIXELS(avg_pixels16_x2 ## CPUEXT, ff_avg_pixels8_x2 ## CPUEXT, 8) \ - CALL_2X_PIXELS(avg_pixels16_y2 ## CPUEXT, ff_avg_pixels8_y2 ## CPUEXT, 8) \ - CALL_2X_PIXELS(avg_pixels16_xy2 ## CPUEXT, ff_avg_pixels8_xy2 ## CPUEXT, 8) \ - CALL_2X_PIXELS(avg_approx_pixels16_xy2## CPUEXT, ff_avg_approx_pixels8_xy2## CPUEXT, 8) + CALL_2X_PIXELS(put_no_rnd_pixels16_y2 ## CPUEXT, ff_put_no_rnd_pixels8_y2 ## CPUEXT, 8) HPELDSP_AVG_PIXELS16(_mmxext) @@ -359,17 +346,12 @@ HPELDSP_AVG_PIXELS16(_mmxext) c->PFX ## _pixels_tab IDX [1] = PFX ## _pixels ## SIZE ## _x2_ ## CPU; \ c->PFX ## _pixels_tab IDX [2] = PFX ## _pixels ## SIZE ## _y2_ ## CPU; \ } while (0) -#define SET_HPEL_FUNCS(PFX, IDX, SIZE, CPU) \ - do { \ - SET_HPEL_FUNCS03(PFX, IDX, SIZE, CPU); \ - SET_HPEL_FUNCS12(PFX, IDX, SIZE, CPU); \ - } while (0) static void hpeldsp_init_mmx(HpelDSPContext *c, int flags) { #if HAVE_MMX_INLINE - SET_HPEL_FUNCS03(put, [0], 16, mmx); - SET_HPEL_FUNCS(put_no_rnd, [0], 16, mmx); + SET_HPEL_FUNCS12(put_no_rnd, [0], 16, mmx); + c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_mmx; SET_HPEL_FUNCS12(avg_no_rnd, , 16, mmx); c->avg_no_rnd_pixels_tab[3] = avg_no_rnd_pixels16_xy2_mmx; SET_HPEL_FUNCS03(put, [1], 8, mmx); @@ -380,14 +362,6 @@ static void hpeldsp_init_mmx(HpelDSPContext *c, int flags) static void hpeldsp_init_mmxext(HpelDSPContext *c, int flags) { #if HAVE_MMXEXT_EXTERNAL - c->put_pixels_tab[0][1] = ff_put_pixels16_x2_mmxext; - c->put_pixels_tab[0][2] = put_pixels16_y2_mmxext; - - c->avg_pixels_tab[0][0] = ff_avg_pixels16_mmxext; - c->avg_pixels_tab[0][1] = avg_pixels16_x2_mmxext; - c->avg_pixels_tab[0][2] = avg_pixels16_y2_mmxext; - c->avg_pixels_tab[0][3] = avg_pixels16_xy2_mmxext; - c->put_pixels_tab[1][1] = ff_put_pixels8_x2_mmxext; c->put_pixels_tab[1][2] = ff_put_pixels8_y2_mmxext; @@ -399,21 +373,18 @@ static void hpeldsp_init_mmxext(HpelDSPContext *c, int flags) c->put_no_rnd_pixels_tab[1][1] = ff_put_no_rnd_pixels8_x2_exact_mmxext; c->put_no_rnd_pixels_tab[1][2] = ff_put_no_rnd_pixels8_y2_exact_mmxext; - c->avg_no_rnd_pixels_tab[0] = ff_avg_pixels16_mmxext; - if (!(flags & AV_CODEC_FLAG_BITEXACT)) { c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_mmxext; c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_mmxext; c->put_no_rnd_pixels_tab[1][1] = ff_put_no_rnd_pixels8_x2_mmxext; c->put_no_rnd_pixels_tab[1][2] = ff_put_no_rnd_pixels8_y2_mmxext; - c->avg_pixels_tab[0][3] = avg_approx_pixels16_xy2_mmxext; c->avg_pixels_tab[1][3] = ff_avg_approx_pixels8_xy2_mmxext; } #endif /* HAVE_MMXEXT_EXTERNAL */ } -static void hpeldsp_init_sse2_fast(HpelDSPContext *c, int flags) +static void hpeldsp_init_sse2(HpelDSPContext *c, int flags) { #if HAVE_SSE2_EXTERNAL c->put_pixels_tab[0][0] = ff_put_pixels16_sse2; @@ -449,8 +420,8 @@ av_cold void ff_hpeldsp_init_x86(HpelDSPContext *c, int flags) if (EXTERNAL_MMXEXT(cpu_flags)) hpeldsp_init_mmxext(c, flags); - if (EXTERNAL_SSE2_FAST(cpu_flags)) - hpeldsp_init_sse2_fast(c, flags); + if (EXTERNAL_SSE2(cpu_flags)) + hpeldsp_init_sse2(c, flags); if (EXTERNAL_SSSE3(cpu_flags)) hpeldsp_init_ssse3(c, flags); -- 2.49.1 >>From 6110a5bafa0bcd5da20d483bad86fe676718f777 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 01:18:54 +0200 Subject: [PATCH 07/20] avcodec/x86/h264_qpel: Remove MMX(EXT) functions overridden by SSE2FAST CPUs which support SSE2, but not in a fast way (so that they get the additional AV_CPU_FLAG_SSE2SLOW) are ancient nowadays (2007 and older), so ignore the distinction between the two and remove MMX and MMXEXT functions that are now overridden by SSE2 functions. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/h264_qpel.c | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) diff --git a/libavcodec/x86/h264_qpel.c b/libavcodec/x86/h264_qpel.c index d69ccda89c..69ffd001e0 100644 --- a/libavcodec/x86/h264_qpel.c +++ b/libavcodec/x86/h264_qpel.c @@ -46,7 +46,6 @@ void ff_avg_pixels16_l2_mmxext(uint8_t *dst, const uint8_t *src1, const uint8_t #define ff_avg_pixels8_l2_sse2 ff_avg_pixels8_l2_mmxext #define ff_put_pixels16_l2_sse2 ff_put_pixels16_l2_mmxext #define ff_avg_pixels16_l2_sse2 ff_avg_pixels16_l2_mmxext -#define ff_put_pixels16_mmxext ff_put_pixels16_mmx #define ff_put_pixels8_mmxext(...) #define ff_put_pixels4_mmxext(...) @@ -217,7 +216,6 @@ static void avg_h264_qpel16_mc00_sse2 (uint8_t *dst, const uint8_t *src, { ff_avg_pixels16_sse2(dst, src, stride, 16); } -#define avg_h264_qpel8_mc00_sse2 avg_h264_qpel8_mc00_mmxext #define H264_MC_C(OPNAME, SIZE, MMX, ALIGN) \ static void av_unused OPNAME ## h264_qpel ## SIZE ## _mc00_ ## MMX (uint8_t *dst, const uint8_t *src, ptrdiff_t stride)\ @@ -359,7 +357,7 @@ QPEL_H264_HV_XMM(avg_,AVG_MMXEXT_OP, ssse3) H264_MC(H264_MC_C_V_H_HV, 4, mmxext, 8) H264_MC(H264_MC_C_H, 8, mmxext, 8) -H264_MC(H264_MC_C_H, 16, mmxext, 8) +H264_MC(H264_MC_H, 16, mmxext, 8) H264_MC_816(H264_MC_V, sse2) H264_MC_816(H264_MC_HV, sse2) H264_MC_816(H264_MC_H, ssse3) @@ -480,10 +478,10 @@ av_cold void ff_h264qpel_init_x86(H264QpelContext *c, int bit_depth) if (EXTERNAL_MMXEXT(cpu_flags)) { if (!high_bit_depth) { - SET_QPEL_FUNCS0123(put_h264_qpel, 0, 16, mmxext, ); + SET_QPEL_FUNCS123 (put_h264_qpel, 0, 16, mmxext, ); SET_QPEL_FUNCS123 (put_h264_qpel, 1, 8, mmxext, ); SET_QPEL_FUNCS_1PP(put_h264_qpel, 2, 4, mmxext, ); - SET_QPEL_FUNCS0123(avg_h264_qpel, 0, 16, mmxext, ); + SET_QPEL_FUNCS123 (avg_h264_qpel, 0, 16, mmxext, ); SET_QPEL_FUNCS0123(avg_h264_qpel, 1, 8, mmxext, ); SET_QPEL_FUNCS(avg_h264_qpel, 2, 4, mmxext, ); } else if (bit_depth == 10) { @@ -506,6 +504,8 @@ av_cold void ff_h264qpel_init_x86(H264QpelContext *c, int bit_depth) H264_QPEL_FUNCS(3, 1, sse2); H264_QPEL_FUNCS(3, 2, sse2); H264_QPEL_FUNCS(3, 3, sse2); + c->put_h264_qpel_pixels_tab[0][0] = put_h264_qpel16_mc00_sse2; + c->avg_h264_qpel_pixels_tab[0][0] = avg_h264_qpel16_mc00_sse2; } if (bit_depth == 10) { @@ -519,14 +519,6 @@ av_cold void ff_h264qpel_init_x86(H264QpelContext *c, int bit_depth) } } - if (EXTERNAL_SSE2_FAST(cpu_flags)) { - if (!high_bit_depth) { - c->put_h264_qpel_pixels_tab[0][0] = put_h264_qpel16_mc00_sse2; - c->avg_h264_qpel_pixels_tab[0][0] = avg_h264_qpel16_mc00_sse2; - c->avg_h264_qpel_pixels_tab[1][0] = avg_h264_qpel8_mc00_sse2; - } - } - if (EXTERNAL_SSSE3(cpu_flags)) { if (!high_bit_depth) { H264_QPEL_FUNCS(1, 0, ssse3); -- 2.49.1 >>From 5e2e4491ed77f22e20c8cca90bce968081916692 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 02:08:03 +0200 Subject: [PATCH 08/20] avcodec/x86/qpeldsp_init: Use SSE2 versions where possible The mc00 versions (i.e. the qdsp functions with no subpixel interpolation) are just wrappers around their fpel versions. There are SSE2 versions of these, yet the qpel code only uses the MMX(EXT) versions. This commit changes this and also removes the MMX(EXT) versions. This also allowed to remove ff_avg_pixels16_mmxext, ff_put_pixels16_mmx. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/fpel.asm | 2 -- libavcodec/x86/fpel.h | 4 ---- libavcodec/x86/qpeldsp_init.c | 41 ++++++++++++++++++----------------- 3 files changed, 21 insertions(+), 26 deletions(-) diff --git a/libavcodec/x86/fpel.asm b/libavcodec/x86/fpel.asm index b07b789074..8551ff1ff3 100644 --- a/libavcodec/x86/fpel.asm +++ b/libavcodec/x86/fpel.asm @@ -67,12 +67,10 @@ cglobal %1_pixels%2, 4,5,4 INIT_MMX mmx OP_PIXELS put, 8 -OP_PIXELS put, 16 INIT_MMX mmxext OP_PIXELS avg, 4 OP_PIXELS avg, 8 -OP_PIXELS avg, 16 INIT_XMM sse2 OP_PIXELS put, 16 diff --git a/libavcodec/x86/fpel.h b/libavcodec/x86/fpel.h index 47ffc8eec7..851a70b99f 100644 --- a/libavcodec/x86/fpel.h +++ b/libavcodec/x86/fpel.h @@ -26,14 +26,10 @@ void ff_avg_pixels4_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_avg_pixels8_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); -void ff_avg_pixels16_mmxext(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h); void ff_avg_pixels16_sse2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_put_pixels8_mmx(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); -void ff_put_pixels16_mmx(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h); void ff_put_pixels16_sse2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); diff --git a/libavcodec/x86/qpeldsp_init.c b/libavcodec/x86/qpeldsp_init.c index 3b05e156cc..097cda0106 100644 --- a/libavcodec/x86/qpeldsp_init.c +++ b/libavcodec/x86/qpeldsp_init.c @@ -79,22 +79,10 @@ void ff_avg_mpeg4_qpel8_v_lowpass_mmxext(uint8_t *dst, const uint8_t *src, void ff_put_no_rnd_mpeg4_qpel8_v_lowpass_mmxext(uint8_t *dst, const uint8_t *src, int dstStride, int srcStride); -#define ff_put_no_rnd_pixels16_mmxext ff_put_pixels16_mmx -#define ff_put_no_rnd_pixels8_mmxext ff_put_pixels8_mmx #if HAVE_X86ASM -#define ff_put_pixels16_mmxext ff_put_pixels16_mmx -#define ff_put_pixels8_mmxext ff_put_pixels8_mmx - #define QPEL_OP(OPNAME, RND, MMX) \ -static void OPNAME ## qpel8_mc00_ ## MMX(uint8_t *dst, \ - const uint8_t *src, \ - ptrdiff_t stride) \ -{ \ - ff_ ## OPNAME ## pixels8_ ## MMX(dst, src, stride, 8); \ -} \ - \ static void OPNAME ## qpel8_mc10_ ## MMX(uint8_t *dst, \ const uint8_t *src, \ ptrdiff_t stride) \ @@ -291,13 +279,6 @@ static void OPNAME ## qpel8_mc22_ ## MMX(uint8_t *dst, \ stride, 8); \ } \ \ -static void OPNAME ## qpel16_mc00_ ## MMX(uint8_t *dst, \ - const uint8_t *src, \ - ptrdiff_t stride) \ -{ \ - ff_ ## OPNAME ## pixels16_ ## MMX(dst, src, stride, 16); \ -} \ - \ static void OPNAME ## qpel16_mc10_ ## MMX(uint8_t *dst, \ const uint8_t *src, \ ptrdiff_t stride) \ @@ -504,11 +485,23 @@ QPEL_OP(put_, _, mmxext) QPEL_OP(avg_, _, mmxext) QPEL_OP(put_no_rnd_, _no_rnd_, mmxext) +#define MC00(OPNAME, SIZE, EXT) \ +static void OPNAME ## _qpel ## SIZE ## _mc00_ ## EXT(uint8_t *dst, \ + const uint8_t *src,\ + ptrdiff_t stride) \ +{ \ + ff_ ## OPNAME ## _pixels ## SIZE ##_ ## EXT(dst, src, stride, SIZE);\ +} + +MC00(put, 8, mmx) +MC00(avg, 8, mmxext) +MC00(put, 16, sse2) +MC00(avg, 16, sse2) + #endif /* HAVE_X86ASM */ #define SET_QPEL_FUNCS(PFX, IDX, SIZE, CPU, PREFIX) \ do { \ - c->PFX ## _pixels_tab[IDX][ 0] = PREFIX ## PFX ## SIZE ## _mc00_ ## CPU; \ c->PFX ## _pixels_tab[IDX][ 1] = PREFIX ## PFX ## SIZE ## _mc10_ ## CPU; \ c->PFX ## _pixels_tab[IDX][ 2] = PREFIX ## PFX ## SIZE ## _mc20_ ## CPU; \ c->PFX ## _pixels_tab[IDX][ 3] = PREFIX ## PFX ## SIZE ## _mc30_ ## CPU; \ @@ -533,12 +526,20 @@ av_cold void ff_qpeldsp_init_x86(QpelDSPContext *c) if (X86_MMXEXT(cpu_flags)) { #if HAVE_MMXEXT_EXTERNAL SET_QPEL_FUNCS(avg_qpel, 0, 16, mmxext, ); + c->avg_qpel_pixels_tab[1][0] = avg_qpel8_mc00_mmxext; SET_QPEL_FUNCS(avg_qpel, 1, 8, mmxext, ); SET_QPEL_FUNCS(put_qpel, 0, 16, mmxext, ); + c->put_no_rnd_qpel_pixels_tab[1][0] = + c->put_qpel_pixels_tab[1][0] = put_qpel8_mc00_mmx; SET_QPEL_FUNCS(put_qpel, 1, 8, mmxext, ); SET_QPEL_FUNCS(put_no_rnd_qpel, 0, 16, mmxext, ); SET_QPEL_FUNCS(put_no_rnd_qpel, 1, 8, mmxext, ); #endif /* HAVE_MMXEXT_EXTERNAL */ } + if (EXTERNAL_SSE2(cpu_flags)) { + c->put_no_rnd_qpel_pixels_tab[0][0] = + c->put_qpel_pixels_tab[0][0] = put_qpel16_mc00_sse2; + c->avg_qpel_pixels_tab[0][0] = avg_qpel16_mc00_sse2; + } } -- 2.49.1 >>From 311bb85dfaf0960e9d4bfd230a96e7ada7af9e8d Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 05:28:17 +0200 Subject: [PATCH 09/20] avcodec/x86/hpeldsp_init: Remove MMX(EXT) funcs overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX(EXT) functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp.asm | 40 ----------------------------------- libavcodec/x86/hpeldsp_init.c | 11 +++------- 2 files changed, 3 insertions(+), 48 deletions(-) diff --git a/libavcodec/x86/hpeldsp.asm b/libavcodec/x86/hpeldsp.asm index b59195de95..9d8b58f929 100644 --- a/libavcodec/x86/hpeldsp.asm +++ b/libavcodec/x86/hpeldsp.asm @@ -370,46 +370,6 @@ INIT_XMM sse2 AVG_PIXELS8_Y2 -; void ff_avg_pixels8_xy2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -; Note this is not correctly rounded, and is therefore used for -; not-bitexact output -INIT_MMX mmxext -cglobal avg_approx_pixels8_xy2, 4,5 - mova m6, [pb_1] - lea r4, [r2*2] - mova m0, [r1] - PAVGB m0, [r1+1] -.loop: - mova m2, [r1+r4] - mova m1, [r1+r2] - psubusb m2, m6 - PAVGB m1, [r1+r2+1] - PAVGB m2, [r1+r4+1] - add r1, r4 - PAVGB m0, m1 - PAVGB m1, m2 - PAVGB m0, [r0] - PAVGB m1, [r0+r2] - mova [r0], m0 - mova [r0+r2], m1 - mova m1, [r1+r2] - mova m0, [r1+r4] - PAVGB m1, [r1+r2+1] - PAVGB m0, [r1+r4+1] - add r0, r4 - add r1, r4 - PAVGB m2, m1 - PAVGB m1, m0 - PAVGB m2, [r0] - PAVGB m1, [r0+r2] - mova [r0], m2 - mova [r0+r2], m1 - add r0, r4 - sub r3d, 4 - jne .loop - RET - - ; void ff_avg_pixels16_xy2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) %macro SET_PIXELS_XY2 1 %if cpuflag(sse2) diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index c0913552d5..7ee2db1358 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -60,11 +60,7 @@ void ff_avg_pixels8_x2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_avg_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); -void ff_avg_approx_pixels8_xy2_mmxext(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h); -#define put_pixels8_mmx ff_put_pixels8_mmx -#define put_pixels8_xy2_mmx ff_put_pixels8_xy2_mmx #define put_no_rnd_pixels8_mmx ff_put_pixels8_mmx #if HAVE_INLINE_ASM @@ -354,7 +350,9 @@ static void hpeldsp_init_mmx(HpelDSPContext *c, int flags) c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_mmx; SET_HPEL_FUNCS12(avg_no_rnd, , 16, mmx); c->avg_no_rnd_pixels_tab[3] = avg_no_rnd_pixels16_xy2_mmx; - SET_HPEL_FUNCS03(put, [1], 8, mmx); +#if HAVE_MMX_EXTERNAL + c->put_pixels_tab[1][0] = ff_put_pixels8_mmx; +#endif SET_HPEL_FUNCS03(put_no_rnd, [1], 8, mmx); #endif } @@ -368,7 +366,6 @@ static void hpeldsp_init_mmxext(HpelDSPContext *c, int flags) c->avg_pixels_tab[1][0] = ff_avg_pixels8_mmxext; c->avg_pixels_tab[1][1] = ff_avg_pixels8_x2_mmxext; c->avg_pixels_tab[1][2] = ff_avg_pixels8_y2_mmxext; - c->avg_pixels_tab[1][3] = ff_avg_pixels8_xy2_mmxext; c->put_no_rnd_pixels_tab[1][1] = ff_put_no_rnd_pixels8_x2_exact_mmxext; c->put_no_rnd_pixels_tab[1][2] = ff_put_no_rnd_pixels8_y2_exact_mmxext; @@ -378,8 +375,6 @@ static void hpeldsp_init_mmxext(HpelDSPContext *c, int flags) c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_mmxext; c->put_no_rnd_pixels_tab[1][1] = ff_put_no_rnd_pixels8_x2_mmxext; c->put_no_rnd_pixels_tab[1][2] = ff_put_no_rnd_pixels8_y2_mmxext; - - c->avg_pixels_tab[1][3] = ff_avg_approx_pixels8_xy2_mmxext; } #endif /* HAVE_MMXEXT_EXTERNAL */ } -- 2.49.1 >>From a1ff28538e3ee94a6ec3877db25a98ea008655d3 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 05:55:07 +0200 Subject: [PATCH 10/20] avcodec/x86/rv40dsp_init: Remove MMX(EXT) funcs overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX(EXT) functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp.asm | 6 ------ libavcodec/x86/hpeldsp.h | 8 -------- libavcodec/x86/hpeldsp_init.c | 14 -------------- libavcodec/x86/rv40dsp_init.c | 12 ------------ 4 files changed, 40 deletions(-) diff --git a/libavcodec/x86/hpeldsp.asm b/libavcodec/x86/hpeldsp.asm index 9d8b58f929..859894856d 100644 --- a/libavcodec/x86/hpeldsp.asm +++ b/libavcodec/x86/hpeldsp.asm @@ -372,11 +372,7 @@ AVG_PIXELS8_Y2 ; void ff_avg_pixels16_xy2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) %macro SET_PIXELS_XY2 1 -%if cpuflag(sse2) cglobal %1_pixels16_xy2, 4,5,8 -%else -cglobal %1_pixels8_xy2, 4,5 -%endif pxor m7, m7 mova m6, [pw_2] movu m0, [r1] @@ -448,8 +444,6 @@ cglobal %1_pixels8_xy2, 4,5 RET %endmacro -INIT_MMX mmxext -SET_PIXELS_XY2 avg INIT_XMM sse2 SET_PIXELS_XY2 put SET_PIXELS_XY2 avg diff --git a/libavcodec/x86/hpeldsp.h b/libavcodec/x86/hpeldsp.h index ac7e625fda..8208e43ac1 100644 --- a/libavcodec/x86/hpeldsp.h +++ b/libavcodec/x86/hpeldsp.h @@ -25,22 +25,14 @@ void ff_avg_pixels8_x2_mmx(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); -void ff_avg_pixels8_xy2_mmx(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h); -void ff_avg_pixels8_xy2_mmxext(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h); void ff_avg_pixels8_xy2_ssse3(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); -void ff_avg_pixels16_xy2_mmx(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h); void ff_avg_pixels16_xy2_sse2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_avg_pixels16_xy2_ssse3(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); -void ff_put_pixels8_xy2_mmx(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h); void ff_put_pixels8_xy2_ssse3(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_put_pixels16_xy2_sse2(uint8_t *block, const uint8_t *pixels, diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index 7ee2db1358..ab32b825c9 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -301,20 +301,6 @@ CALL_2X_PIXELS(put_no_rnd_pixels16_y2_mmx, put_no_rnd_pixels8_y2_mmx, 8) CALL_2X_PIXELS(avg_no_rnd_pixels16_xy2_mmx, avg_no_rnd_pixels8_xy2_mmx, 8) CALL_2X_PIXELS(put_no_rnd_pixels16_xy2_mmx, put_no_rnd_pixels8_xy2_mmx, 8) #endif - -/***********************************/ -/* MMX rounding */ - -#define SET_RND MOVQ_WTWO -#define DEF(x, y) ff_ ## x ## _ ## y ## _mmx -#define STATIC - -#include "rnd_template.c" - -#undef NO_AVG -#undef DEF -#undef SET_RND - #endif /* HAVE_INLINE_ASM */ diff --git a/libavcodec/x86/rv40dsp_init.c b/libavcodec/x86/rv40dsp_init.c index ab9e644c60..780358abc2 100644 --- a/libavcodec/x86/rv40dsp_init.c +++ b/libavcodec/x86/rv40dsp_init.c @@ -174,34 +174,22 @@ DEFINE_FN(put, 8, ssse3) DEFINE_FN(put, 16, sse2) DEFINE_FN(put, 16, ssse3) -DEFINE_FN(avg, 8, mmxext) DEFINE_FN(avg, 8, ssse3) DEFINE_FN(avg, 16, sse2) DEFINE_FN(avg, 16, ssse3) #endif /* HAVE_X86ASM */ -#if HAVE_MMX_INLINE -DEFINE_FN(put, 8, mmx) -#endif - av_cold void ff_rv40dsp_init_x86(RV34DSPContext *c) { av_unused int cpu_flags = av_get_cpu_flags(); -#if HAVE_MMX_INLINE - if (INLINE_MMX(cpu_flags)) { - c->put_pixels_tab[1][15] = put_rv40_qpel8_mc33_mmx; - } -#endif /* HAVE_MMX_INLINE */ - #if HAVE_X86ASM if (EXTERNAL_MMX(cpu_flags)) { c->put_chroma_pixels_tab[0] = ff_put_rv40_chroma_mc8_mmx; c->put_chroma_pixels_tab[1] = ff_put_rv40_chroma_mc4_mmx; } if (EXTERNAL_MMXEXT(cpu_flags)) { - c->avg_pixels_tab[1][15] = avg_rv40_qpel8_mc33_mmxext; c->avg_chroma_pixels_tab[0] = ff_avg_rv40_chroma_mc8_mmxext; c->avg_chroma_pixels_tab[1] = ff_avg_rv40_chroma_mc4_mmxext; } -- 2.49.1 >>From 2de73b2c42ce522c6ca537a9f58bb45394a4e2b7 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 16:17:53 +0200 Subject: [PATCH 11/20] avcodec/x86/vvc/sao_10bit: Remove unused functions Saves 65280B here. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/vvc/sao_10bit.asm | 38 -------------------------------- 1 file changed, 38 deletions(-) diff --git a/libavcodec/x86/vvc/sao_10bit.asm b/libavcodec/x86/vvc/sao_10bit.asm index b7d3d08008..ccf14a34a4 100644 --- a/libavcodec/x86/vvc/sao_10bit.asm +++ b/libavcodec/x86/vvc/sao_10bit.asm @@ -28,28 +28,6 @@ H2656_SAO_BAND_FILTER vvc, %1, %2, %3 %endmacro -%macro VVC_SAO_BAND_FILTER_FUNCS 1 - VVC_SAO_BAND_FILTER %1, 8, 1 - VVC_SAO_BAND_FILTER %1, 16, 2 - VVC_SAO_BAND_FILTER %1, 32, 4 - VVC_SAO_BAND_FILTER %1, 48, 6 - VVC_SAO_BAND_FILTER %1, 64, 8 - VVC_SAO_BAND_FILTER %1, 80, 10 - VVC_SAO_BAND_FILTER %1, 96, 12 - VVC_SAO_BAND_FILTER %1, 112, 14 - VVC_SAO_BAND_FILTER %1, 128, 16 -%endmacro - -%macro VVC_SAO_BAND_FILTER_FUNCS 0 - VVC_SAO_BAND_FILTER_FUNCS 10 - VVC_SAO_BAND_FILTER_FUNCS 12 -%endmacro - -INIT_XMM sse2 -VVC_SAO_BAND_FILTER_FUNCS -INIT_XMM avx -VVC_SAO_BAND_FILTER_FUNCS - %if HAVE_AVX2_EXTERNAL %macro VVC_SAO_BAND_FILTER_FUNCS_AVX2 1 @@ -75,22 +53,6 @@ VVC_SAO_BAND_FILTER_FUNCS_AVX2 12 H2656_SAO_EDGE_FILTER vvc, %1, %2, %3 %endmacro -%macro VVC_SAO_EDGE_FILTER_FUNCS 1 - VVC_SAO_EDGE_FILTER %1, 8, 1 - VVC_SAO_EDGE_FILTER %1, 16, 2 - VVC_SAO_EDGE_FILTER %1, 32, 4 - VVC_SAO_EDGE_FILTER %1, 48, 6 - VVC_SAO_EDGE_FILTER %1, 64, 8 - VVC_SAO_EDGE_FILTER %1, 80, 10 - VVC_SAO_EDGE_FILTER %1, 96, 12 - VVC_SAO_EDGE_FILTER %1, 112, 14 - VVC_SAO_EDGE_FILTER %1, 128, 16 -%endmacro - -INIT_XMM sse2 -VVC_SAO_EDGE_FILTER_FUNCS 10 -VVC_SAO_EDGE_FILTER_FUNCS 12 - %if HAVE_AVX2_EXTERNAL %macro VVC_SAO_EDGE_FILTER_FUNCS_AVX2 1 -- 2.49.1 >>From bb6679418e2ca224141308c08bb0081dd0eaede5 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 06:22:05 +0200 Subject: [PATCH 12/20] avcodec/x86/mpegvideoencdsp_init: Remove MMX, 3DNOw funcs overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX and 3DNOW functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Also merge the mpegvideoenc_qns_template.c file into the main file. The 3DNOW functions removed in this commit were the last in the codebase. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/mpegvideoenc_qns_template.c | 109 --------------- libavcodec/x86/mpegvideoencdsp_init.c | 152 +++++++++++---------- 2 files changed, 83 insertions(+), 178 deletions(-) delete mode 100644 libavcodec/x86/mpegvideoenc_qns_template.c diff --git a/libavcodec/x86/mpegvideoenc_qns_template.c b/libavcodec/x86/mpegvideoenc_qns_template.c deleted file mode 100644 index 0d6454f45f..0000000000 --- a/libavcodec/x86/mpegvideoenc_qns_template.c +++ /dev/null @@ -1,109 +0,0 @@ -/* - * QNS functions are compiled 3 times for MMX/3DNOW/SSSE3 - * Copyright (c) 2004 Michael Niedermayer - * - * MMX optimization by Michael Niedermayer - * 3DNow! and SSSE3 optimization by Zuxy Meng - * - * This file is part of FFmpeg. - * - * FFmpeg is free software; you can redistribute it and/or - * modify it under the terms of the GNU Lesser General Public - * License as published by the Free Software Foundation; either - * version 2.1 of the License, or (at your option) any later version. - * - * FFmpeg is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * Lesser General Public License for more details. - * - * You should have received a copy of the GNU Lesser General Public - * License along with FFmpeg; if not, write to the Free Software - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA - */ - -#include - -#include "libavutil/avassert.h" -#include "libavutil/common.h" -#include "libavutil/x86/asm.h" - -#include "inline_asm.h" - -#define MAX_ABS (512 >> (SCALE_OFFSET>0 ? SCALE_OFFSET : 0)) - -static int DEF(try_8x8basis)(const int16_t rem[64], const int16_t weight[64], const int16_t basis[64], int scale) -{ - x86_reg i=0; - - av_assert2(FFABS(scale) < MAX_ABS); - scale<<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT; - - SET_RND(mm6); - __asm__ volatile( - "pxor %%mm7, %%mm7 \n\t" - "movd %4, %%mm5 \n\t" - "punpcklwd %%mm5, %%mm5 \n\t" - "punpcklwd %%mm5, %%mm5 \n\t" - ".p2align 4 \n\t" - "1: \n\t" - "movq (%1, %0), %%mm0 \n\t" - "movq 8(%1, %0), %%mm1 \n\t" - PMULHRW(%%mm0, %%mm1, %%mm5, %%mm6) - "paddw (%2, %0), %%mm0 \n\t" - "paddw 8(%2, %0), %%mm1 \n\t" - "psraw $6, %%mm0 \n\t" - "psraw $6, %%mm1 \n\t" - "pmullw (%3, %0), %%mm0 \n\t" - "pmullw 8(%3, %0), %%mm1 \n\t" - "pmaddwd %%mm0, %%mm0 \n\t" - "pmaddwd %%mm1, %%mm1 \n\t" - "paddd %%mm1, %%mm0 \n\t" - "psrld $4, %%mm0 \n\t" - "paddd %%mm0, %%mm7 \n\t" - "add $16, %0 \n\t" - "cmp $128, %0 \n\t" //FIXME optimize & bench - " jb 1b \n\t" - PHADDD(%%mm7, %%mm6) - "psrld $2, %%mm7 \n\t" - "movd %%mm7, %0 \n\t" - - : "+r" (i) - : "r"(basis), "r"(rem), "r"(weight), "g"(scale) - ); - return i; -} - -static void DEF(add_8x8basis)(int16_t rem[64], const int16_t basis[64], int scale) -{ - x86_reg i=0; - - if(FFABS(scale) < MAX_ABS){ - scale<<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT; - SET_RND(mm6); - __asm__ volatile( - "movd %3, %%mm5 \n\t" - "punpcklwd %%mm5, %%mm5 \n\t" - "punpcklwd %%mm5, %%mm5 \n\t" - ".p2align 4 \n\t" - "1: \n\t" - "movq (%1, %0), %%mm0 \n\t" - "movq 8(%1, %0), %%mm1 \n\t" - PMULHRW(%%mm0, %%mm1, %%mm5, %%mm6) - "paddw (%2, %0), %%mm0 \n\t" - "paddw 8(%2, %0), %%mm1 \n\t" - "movq %%mm0, (%2, %0) \n\t" - "movq %%mm1, 8(%2, %0) \n\t" - "add $16, %0 \n\t" - "cmp $128, %0 \n\t" // FIXME optimize & bench - " jb 1b \n\t" - - : "+r" (i) - : "r"(basis), "r"(rem), "g"(scale) - ); - }else{ - for(i=0; i<8*8; i++){ - rem[i] += (basis[i]*scale + (1<<(BASIS_SHIFT - RECON_SHIFT-1)))>>(BASIS_SHIFT - RECON_SHIFT); - } - } -} diff --git a/libavcodec/x86/mpegvideoencdsp_init.c b/libavcodec/x86/mpegvideoencdsp_init.c index d39091a5c9..78c2ef87b8 100644 --- a/libavcodec/x86/mpegvideoencdsp_init.c +++ b/libavcodec/x86/mpegvideoencdsp_init.c @@ -16,9 +16,13 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ +#include + #include "libavutil/attributes.h" #include "libavutil/avassert.h" +#include "libavutil/common.h" #include "libavutil/cpu.h" +#include "libavutil/x86/asm.h" #include "libavutil/x86/cpu.h" #include "libavcodec/avcodec.h" #include "libavcodec/mpegvideoencdsp.h" @@ -28,71 +32,93 @@ int ff_pix_sum16_xop(const uint8_t *pix, ptrdiff_t line_size); int ff_pix_norm1_sse2(const uint8_t *pix, ptrdiff_t line_size); #if HAVE_INLINE_ASM - -#define PHADDD(a, t) \ - "movq " #a ", " #t " \n\t" \ - "psrlq $32, " #a " \n\t" \ - "paddd " #t ", " #a " \n\t" - -/* - * pmulhw: dst[0 - 15] = (src[0 - 15] * dst[0 - 15])[16 - 31] - * pmulhrw: dst[0 - 15] = (src[0 - 15] * dst[0 - 15] + 0x8000)[16 - 31] - * pmulhrsw: dst[0 - 15] = (src[0 - 15] * dst[0 - 15] + 0x4000)[15 - 30] - */ -#define PMULHRW(x, y, s, o) \ - "pmulhw " #s ", " #x " \n\t" \ - "pmulhw " #s ", " #y " \n\t" \ - "paddw " #o ", " #x " \n\t" \ - "paddw " #o ", " #y " \n\t" \ - "psraw $1, " #x " \n\t" \ - "psraw $1, " #y " \n\t" -#define DEF(x) x ## _mmx -#define SET_RND MOVQ_WONE -#define SCALE_OFFSET 1 - -#include "mpegvideoenc_qns_template.c" - -#undef DEF -#undef SET_RND -#undef SCALE_OFFSET -#undef PMULHRW - -#define DEF(x) x ## _3dnow -#define SET_RND(x) -#define SCALE_OFFSET 0 -#define PMULHRW(x, y, s, o) \ - "pmulhrw " #s ", " #x " \n\t" \ - "pmulhrw " #s ", " #y " \n\t" - -#include "mpegvideoenc_qns_template.c" - -#undef DEF -#undef SET_RND -#undef SCALE_OFFSET -#undef PMULHRW - #if HAVE_SSSE3_INLINE -#undef PHADDD -#define DEF(x) x ## _ssse3 -#define SET_RND(x) #define SCALE_OFFSET -1 -#define PHADDD(a, t) \ - "pshufw $0x0E, " #a ", " #t " \n\t" \ - /* faster than phaddd on core2 */ \ - "paddd " #t ", " #a " \n\t" - +/* + * pmulhrsw: dst[0 - 15] = (src[0 - 15] * dst[0 - 15] + 0x4000)[15 - 30] + */ #define PMULHRW(x, y, s, o) \ "pmulhrsw " #s ", " #x " \n\t" \ "pmulhrsw " #s ", " #y " \n\t" -#include "mpegvideoenc_qns_template.c" +#define MAX_ABS 512 + +static int try_8x8basis_ssse3(const int16_t rem[64], const int16_t weight[64], const int16_t basis[64], int scale) +{ + x86_reg i=0; + + av_assert2(FFABS(scale) < MAX_ABS); + scale <<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT; + + __asm__ volatile( + "pxor %%mm7, %%mm7 \n\t" + "movd %4, %%mm5 \n\t" + "punpcklwd %%mm5, %%mm5 \n\t" + "punpcklwd %%mm5, %%mm5 \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "movq (%1, %0), %%mm0 \n\t" + "movq 8(%1, %0), %%mm1 \n\t" + PMULHRW(%%mm0, %%mm1, %%mm5, %%mm6) + "paddw (%2, %0), %%mm0 \n\t" + "paddw 8(%2, %0), %%mm1 \n\t" + "psraw $6, %%mm0 \n\t" + "psraw $6, %%mm1 \n\t" + "pmullw (%3, %0), %%mm0 \n\t" + "pmullw 8(%3, %0), %%mm1 \n\t" + "pmaddwd %%mm0, %%mm0 \n\t" + "pmaddwd %%mm1, %%mm1 \n\t" + "paddd %%mm1, %%mm0 \n\t" + "psrld $4, %%mm0 \n\t" + "paddd %%mm0, %%mm7 \n\t" + "add $16, %0 \n\t" + "cmp $128, %0 \n\t" //FIXME optimize & bench + " jb 1b \n\t" + "pshufw $0x0E, %%mm7, %%mm6 \n\t" + "paddd %%mm6, %%mm7 \n\t" // faster than phaddd on core2 + "psrld $2, %%mm7 \n\t" + "movd %%mm7, %0 \n\t" + + : "+r" (i) + : "r"(basis), "r"(rem), "r"(weight), "g"(scale) + ); + return i; +} + +static void add_8x8basis_ssse3(int16_t rem[64], const int16_t basis[64], int scale) +{ + x86_reg i=0; + + if (FFABS(scale) < MAX_ABS) { + scale <<= 16 + SCALE_OFFSET - BASIS_SHIFT + RECON_SHIFT; + __asm__ volatile( + "movd %3, %%mm5 \n\t" + "punpcklwd %%mm5, %%mm5 \n\t" + "punpcklwd %%mm5, %%mm5 \n\t" + ".p2align 4 \n\t" + "1: \n\t" + "movq (%1, %0), %%mm0 \n\t" + "movq 8(%1, %0), %%mm1 \n\t" + PMULHRW(%%mm0, %%mm1, %%mm5, %%mm6) + "paddw (%2, %0), %%mm0 \n\t" + "paddw 8(%2, %0), %%mm1 \n\t" + "movq %%mm0, (%2, %0) \n\t" + "movq %%mm1, 8(%2, %0) \n\t" + "add $16, %0 \n\t" + "cmp $128, %0 \n\t" // FIXME optimize & bench + " jb 1b \n\t" + + : "+r" (i) + : "r"(basis), "r"(rem), "g"(scale) + ); + } else { + for (i=0; i<8*8; i++) { + rem[i] += (basis[i]*scale + (1<<(BASIS_SHIFT - RECON_SHIFT-1)))>>(BASIS_SHIFT - RECON_SHIFT); + } + } +} -#undef DEF -#undef SET_RND -#undef SCALE_OFFSET -#undef PMULHRW -#undef PHADDD #endif /* HAVE_SSSE3_INLINE */ /* Draw the edges of width 'w' of an image of size width, height */ @@ -197,23 +223,11 @@ av_cold void ff_mpegvideoencdsp_init_x86(MpegvideoEncDSPContext *c, #if HAVE_INLINE_ASM if (INLINE_MMX(cpu_flags)) { - if (!(avctx->flags & AV_CODEC_FLAG_BITEXACT)) { - c->try_8x8basis = try_8x8basis_mmx; - } - c->add_8x8basis = add_8x8basis_mmx; - if (avctx->bits_per_raw_sample <= 8) { c->draw_edges = draw_edges_mmx; } } - if (INLINE_AMD3DNOW(cpu_flags)) { - if (!(avctx->flags & AV_CODEC_FLAG_BITEXACT)) { - c->try_8x8basis = try_8x8basis_3dnow; - } - c->add_8x8basis = add_8x8basis_3dnow; - } - #if HAVE_SSSE3_INLINE if (INLINE_SSSE3(cpu_flags)) { if (!(avctx->flags & AV_CODEC_FLAG_BITEXACT)) { -- 2.49.1 >>From a357241e421ff73bdd0ddf2e879266949a1a9bf9 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 13:12:31 +0200 Subject: [PATCH 13/20] avfilter/x86/vf_gradfun: Remove MMXEXT func overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 version of filter_line. This commit therefore removes the overridden MMXEXT version (which didn't abide by the ABI) which allows us to remove an emms_c() from vf_gradfun.c, so that users with SSSE3 no longer pay a price for the mere existence of an MMXEXT version. Signed-off-by: Andreas Rheinhardt --- libavfilter/vf_gradfun.c | 2 -- libavfilter/x86/vf_gradfun.asm | 42 ++++++++----------------------- libavfilter/x86/vf_gradfun_init.c | 22 ---------------- 3 files changed, 10 insertions(+), 56 deletions(-) diff --git a/libavfilter/vf_gradfun.c b/libavfilter/vf_gradfun.c index 088b3c9143..4f211c3ddf 100644 --- a/libavfilter/vf_gradfun.c +++ b/libavfilter/vf_gradfun.c @@ -32,7 +32,6 @@ * Dither it back to 8bit. */ -#include "libavutil/emms.h" #include "libavutil/imgutils.h" #include "libavutil/common.h" #include "libavutil/mem.h" @@ -119,7 +118,6 @@ static void filter(GradFunContext *ctx, uint8_t *dst, const uint8_t *src, int wi ctx->filter_line(dst + y * dst_linesize, src + y * src_linesize, dc - r / 2, width, thresh, dither[y & 7]); if (++y >= height) break; } - emms_c(); } static av_cold int init(AVFilterContext *ctx) diff --git a/libavfilter/x86/vf_gradfun.asm b/libavfilter/x86/vf_gradfun.asm index d106d52100..55e7c1ea0f 100644 --- a/libavfilter/x86/vf_gradfun.asm +++ b/libavfilter/x86/vf_gradfun.asm @@ -27,7 +27,15 @@ pw_ff: times 8 dw 0xFF SECTION .text -%macro FILTER_LINE 1 +INIT_XMM ssse3 +cglobal gradfun_filter_line, 6, 6, 8 + movd m5, r4d + pxor m7, m7 + pshuflw m5, m5, 0 + mova m6, [pw_7f] + punpcklqdq m5, m5 + mova m4, [r5] +.loop: movh m0, [r2+r0] movh m1, [r3+r0] punpcklbw m0, m7 @@ -40,42 +48,12 @@ SECTION .text pminsw m2, m7 pmullw m2, m2 psllw m1, 2 - paddw m0, %1 + paddw m0, m4 pmulhw m1, m2 paddw m0, m1 psraw m0, 7 packuswb m0, m0 movh [r1+r0], m0 -%endmacro - -INIT_MMX mmxext -cglobal gradfun_filter_line, 6, 6 - movh m5, r4d - pxor m7, m7 - pshufw m5, m5,0 - mova m6, [pw_7f] - mova m3, [r5] - mova m4, [r5+8] -.loop: - FILTER_LINE m3 - add r0, 4 - jge .end - FILTER_LINE m4 - add r0, 4 - jl .loop -.end: - RET - -INIT_XMM ssse3 -cglobal gradfun_filter_line, 6, 6, 8 - movd m5, r4d - pxor m7, m7 - pshuflw m5, m5, 0 - mova m6, [pw_7f] - punpcklqdq m5, m5 - mova m4, [r5] -.loop: - FILTER_LINE m4 add r0, 8 jl .loop RET diff --git a/libavfilter/x86/vf_gradfun_init.c b/libavfilter/x86/vf_gradfun_init.c index 56e6774a79..f262f0a1bb 100644 --- a/libavfilter/x86/vf_gradfun_init.c +++ b/libavfilter/x86/vf_gradfun_init.c @@ -24,9 +24,6 @@ #include "libavutil/x86/cpu.h" #include "libavfilter/gradfun.h" -void ff_gradfun_filter_line_mmxext(intptr_t x, uint8_t *dst, const uint8_t *src, - const uint16_t *dc, int thresh, - const uint16_t *dithers); void ff_gradfun_filter_line_ssse3(intptr_t x, uint8_t *dst, const uint8_t *src, const uint16_t *dc, int thresh, const uint16_t *dithers); @@ -39,23 +36,6 @@ void ff_gradfun_blur_line_movdqu_sse2(intptr_t x, uint16_t *buf, const uint8_t *src1, const uint8_t *src2); #if HAVE_X86ASM -static void gradfun_filter_line_mmxext(uint8_t *dst, const uint8_t *src, - const uint16_t *dc, - int width, int thresh, - const uint16_t *dithers) -{ - intptr_t x; - if (width & 3) { - x = width & ~3; - ff_gradfun_filter_line_c(dst + x, src + x, dc + x / 2, - width - x, thresh, dithers); - width = x; - } - x = -width; - ff_gradfun_filter_line_mmxext(x, dst + width, src + width, dc + width / 2, - thresh, dithers); -} - static void gradfun_filter_line_ssse3(uint8_t *dst, const uint8_t *src, const uint16_t *dc, int width, int thresh, const uint16_t *dithers) @@ -93,8 +73,6 @@ av_cold void ff_gradfun_init_x86(GradFunContext *gf) #if HAVE_X86ASM int cpu_flags = av_get_cpu_flags(); - if (EXTERNAL_MMXEXT(cpu_flags)) - gf->filter_line = gradfun_filter_line_mmxext; if (EXTERNAL_SSSE3(cpu_flags)) gf->filter_line = gradfun_filter_line_ssse3; -- 2.49.1 >>From 9f6592994212f7138497fa9a0419e43a69194b27 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Sun, 21 Sep 2025 15:12:49 +0200 Subject: [PATCH 14/20] avcodec/x86/h264_qpel: Remove MMX(EXT) funcs overridden by SSSE3 SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD), so that the overwhelming majority of our users (particularly those that actually update their FFmpeg) will be using the SSSE3 versions. This commit therefore removes the MMX(EXT) functions overridden by them (which don't abide by the ABI) to get closer to a removal of emms_c. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/h264_qpel.c | 34 ++---------------- libavcodec/x86/h264_qpel_8bit.asm | 60 ------------------------------- 2 files changed, 3 insertions(+), 91 deletions(-) diff --git a/libavcodec/x86/h264_qpel.c b/libavcodec/x86/h264_qpel.c index 69ffd001e0..18d80a52f6 100644 --- a/libavcodec/x86/h264_qpel.c +++ b/libavcodec/x86/h264_qpel.c @@ -46,12 +46,10 @@ void ff_avg_pixels16_l2_mmxext(uint8_t *dst, const uint8_t *src1, const uint8_t #define ff_avg_pixels8_l2_sse2 ff_avg_pixels8_l2_mmxext #define ff_put_pixels16_l2_sse2 ff_put_pixels16_l2_mmxext #define ff_avg_pixels16_l2_sse2 ff_avg_pixels16_l2_mmxext -#define ff_put_pixels8_mmxext(...) #define ff_put_pixels4_mmxext(...) #define DEF_QPEL(OPNAME)\ void ff_ ## OPNAME ## _h264_qpel4_h_lowpass_mmxext(uint8_t *dst, const uint8_t *src, int dstStride, int srcStride);\ -void ff_ ## OPNAME ## _h264_qpel8_h_lowpass_mmxext(uint8_t *dst, const uint8_t *src, int dstStride, int srcStride);\ void ff_ ## OPNAME ## _h264_qpel8_h_lowpass_ssse3(uint8_t *dst, const uint8_t *src, int dstStride, int srcStride);\ void ff_ ## OPNAME ## _h264_qpel4_h_lowpass_l2_mmxext(uint8_t *dst, const uint8_t *src, const uint8_t *src2, int dstStride, int src2Stride);\ void ff_ ## OPNAME ## _h264_qpel8_h_lowpass_l2_mmxext(uint8_t *dst, const uint8_t *src, const uint8_t *src2, int dstStride, int src2Stride);\ @@ -91,15 +89,6 @@ static av_always_inline void ff_ ## OPNAME ## h264_qpel8or16_hv2_lowpass_ ## MMX }while(w--);\ }\ \ -static av_always_inline void ff_ ## OPNAME ## h264_qpel16_h_lowpass_ ## MMX(uint8_t *dst, const uint8_t *src, int dstStride, int srcStride){\ - ff_ ## OPNAME ## h264_qpel8_h_lowpass_ ## MMX(dst , src , dstStride, srcStride);\ - ff_ ## OPNAME ## h264_qpel8_h_lowpass_ ## MMX(dst+8, src+8, dstStride, srcStride);\ - src += 8*srcStride;\ - dst += 8*dstStride;\ - ff_ ## OPNAME ## h264_qpel8_h_lowpass_ ## MMX(dst , src , dstStride, srcStride);\ - ff_ ## OPNAME ## h264_qpel8_h_lowpass_ ## MMX(dst+8, src+8, dstStride, srcStride);\ -}\ -\ static av_always_inline void ff_ ## OPNAME ## h264_qpel16_h_lowpass_l2_ ## MMX(uint8_t *dst, const uint8_t *src, const uint8_t *src2, int dstStride, int src2Stride){\ ff_ ## OPNAME ## h264_qpel8_h_lowpass_l2_ ## MMX(dst , src , src2 , dstStride, src2Stride);\ ff_ ## OPNAME ## h264_qpel8_h_lowpass_l2_ ## MMX(dst+8, src+8, src2+8, dstStride, src2Stride);\ @@ -196,10 +185,6 @@ static av_always_inline void ff_ ## OPNAME ## h264_qpel16_hv_lowpass_ ## MMX(uin #define ff_put_h264_qpel8or16_hv2_lowpass_sse2 ff_put_h264_qpel8or16_hv2_lowpass_mmxext #define ff_avg_h264_qpel8or16_hv2_lowpass_sse2 ff_avg_h264_qpel8or16_hv2_lowpass_mmxext -#define H264_MC_C_H(OPNAME, SIZE, MMX, ALIGN) \ -H264_MC_C(OPNAME, SIZE, MMX, ALIGN)\ -H264_MC_H(OPNAME, SIZE, MMX, ALIGN)\ - #define H264_MC_C_V_H_HV(OPNAME, SIZE, MMX, ALIGN) \ H264_MC_C(OPNAME, SIZE, MMX, ALIGN)\ H264_MC_V(OPNAME, SIZE, MMX, ALIGN)\ @@ -356,8 +341,7 @@ QPEL_H264_HV_XMM(put_, PUT_OP, ssse3) QPEL_H264_HV_XMM(avg_,AVG_MMXEXT_OP, ssse3) H264_MC(H264_MC_C_V_H_HV, 4, mmxext, 8) -H264_MC(H264_MC_C_H, 8, mmxext, 8) -H264_MC(H264_MC_H, 16, mmxext, 8) +H264_MC_C(avg_, 8, mmxext, 8) H264_MC_816(H264_MC_V, sse2) H264_MC_816(H264_MC_HV, sse2) H264_MC_816(H264_MC_H, ssse3) @@ -421,20 +405,11 @@ LUMA_MC_816(10, mc33, sse2) #endif /* HAVE_X86ASM */ -#define SET_QPEL_FUNCS123(PFX, IDX, SIZE, CPU, PREFIX) \ +#define SET_QPEL_FUNCS_1PP(PFX, IDX, SIZE, CPU, PREFIX) \ do { \ c->PFX ## _pixels_tab[IDX][ 1] = PREFIX ## PFX ## SIZE ## _mc10_ ## CPU; \ c->PFX ## _pixels_tab[IDX][ 2] = PREFIX ## PFX ## SIZE ## _mc20_ ## CPU; \ c->PFX ## _pixels_tab[IDX][ 3] = PREFIX ## PFX ## SIZE ## _mc30_ ## CPU; \ - } while (0) -#define SET_QPEL_FUNCS0123(PFX, IDX, SIZE, CPU, PREFIX) \ - do { \ - c->PFX ## _pixels_tab[IDX][ 0] = PREFIX ## PFX ## SIZE ## _mc00_ ## CPU; \ - SET_QPEL_FUNCS123(PFX, IDX, SIZE, CPU, PREFIX); \ - } while (0) -#define SET_QPEL_FUNCS_1PP(PFX, IDX, SIZE, CPU, PREFIX) \ - do { \ - SET_QPEL_FUNCS123(PFX, IDX, SIZE, CPU, PREFIX); \ c->PFX ## _pixels_tab[IDX][ 4] = PREFIX ## PFX ## SIZE ## _mc01_ ## CPU; \ c->PFX ## _pixels_tab[IDX][ 5] = PREFIX ## PFX ## SIZE ## _mc11_ ## CPU; \ c->PFX ## _pixels_tab[IDX][ 6] = PREFIX ## PFX ## SIZE ## _mc21_ ## CPU; \ @@ -478,11 +453,8 @@ av_cold void ff_h264qpel_init_x86(H264QpelContext *c, int bit_depth) if (EXTERNAL_MMXEXT(cpu_flags)) { if (!high_bit_depth) { - SET_QPEL_FUNCS123 (put_h264_qpel, 0, 16, mmxext, ); - SET_QPEL_FUNCS123 (put_h264_qpel, 1, 8, mmxext, ); SET_QPEL_FUNCS_1PP(put_h264_qpel, 2, 4, mmxext, ); - SET_QPEL_FUNCS123 (avg_h264_qpel, 0, 16, mmxext, ); - SET_QPEL_FUNCS0123(avg_h264_qpel, 1, 8, mmxext, ); + c->avg_h264_qpel_pixels_tab[1][0] = avg_h264_qpel8_mc00_mmxext; SET_QPEL_FUNCS(avg_h264_qpel, 2, 4, mmxext, ); } else if (bit_depth == 10) { SET_QPEL_FUNCS(put_h264_qpel, 2, 4, 10_mmxext, ff_); diff --git a/libavcodec/x86/h264_qpel_8bit.asm b/libavcodec/x86/h264_qpel_8bit.asm index 4e64329991..89e7c282b2 100644 --- a/libavcodec/x86/h264_qpel_8bit.asm +++ b/libavcodec/x86/h264_qpel_8bit.asm @@ -96,66 +96,6 @@ INIT_MMX mmxext QPEL4_H_LOWPASS_OP put QPEL4_H_LOWPASS_OP avg -%macro QPEL8_H_LOWPASS_OP 1 -cglobal %1_h264_qpel8_h_lowpass, 4,5 ; dst, src, dstStride, srcStride - movsxdifnidn r2, r2d - movsxdifnidn r3, r3d - mov r4d, 8 - pxor m7, m7 - mova m6, [pw_5] -.loop: - mova m0, [r1] - mova m2, [r1+1] - mova m1, m0 - mova m3, m2 - punpcklbw m0, m7 - punpckhbw m1, m7 - punpcklbw m2, m7 - punpckhbw m3, m7 - paddw m0, m2 - paddw m1, m3 - psllw m0, 2 - psllw m1, 2 - mova m2, [r1-1] - mova m4, [r1+2] - mova m3, m2 - mova m5, m4 - punpcklbw m2, m7 - punpckhbw m3, m7 - punpcklbw m4, m7 - punpckhbw m5, m7 - paddw m2, m4 - paddw m5, m3 - psubw m0, m2 - psubw m1, m5 - pmullw m0, m6 - pmullw m1, m6 - movd m2, [r1-2] - movd m5, [r1+7] - punpcklbw m2, m7 - punpcklbw m5, m7 - paddw m2, m3 - paddw m4, m5 - mova m5, [pw_16] - paddw m2, m5 - paddw m4, m5 - paddw m0, m2 - paddw m1, m4 - psraw m0, 5 - psraw m1, 5 - packuswb m0, m1 - op_%1 m0, [r0], m4 - add r0, r2 - add r1, r3 - dec r4d - jg .loop - RET -%endmacro - -INIT_MMX mmxext -QPEL8_H_LOWPASS_OP put -QPEL8_H_LOWPASS_OP avg - %macro QPEL8_H_LOWPASS_OP_XMM 1 cglobal %1_h264_qpel8_h_lowpass, 4,5,8 ; dst, src, dstStride, srcStride movsxdifnidn r2, r2d -- 2.49.1 >>From be0a5f3cae765bf06d445192d838c08d1f6d213e Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Mon, 22 Sep 2025 05:41:04 +0200 Subject: [PATCH 15/20] avcodec/hpeldsp: Make put_no_rnd_pixels_tab smaller Only the blocksizes 16 and 8 are implemented, yet the motion estimation code touches the blocksize 4 entries. But really nothing touches the blocksize 2 entries, so that we can reduce the put_no_rnd_pixels_tab array size to [3][4]. Signed-off-by: Andreas Rheinhardt --- libavcodec/hpeldsp.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/hpeldsp.h b/libavcodec/hpeldsp.h index 1f6a165bf6..6c9fdce0c1 100644 --- a/libavcodec/hpeldsp.h +++ b/libavcodec/hpeldsp.h @@ -77,10 +77,10 @@ typedef struct HpelDSPContext { * @param pixels source * @param line_size number of bytes in a horizontal line of block * @param h height - * @note The size is kept at [4][4] to match the above pixel_tabs and avoid - * out of bounds reads in the motion estimation code. + * @note The size is kept at [3][4] to avoid out of bounds accesses + * in the motion estimation code. */ - op_pixels_func put_no_rnd_pixels_tab[4][4]; + op_pixels_func put_no_rnd_pixels_tab[3][4]; /** * Halfpel motion compensation with no rounding (a+b)>>1. -- 2.49.1 >>From ea75e807901542aa3834b0505511d6736a5ce3c4 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Tue, 23 Sep 2025 03:52:28 +0200 Subject: [PATCH 16/20] avcodec/x86/hpeldsp: Add SSE2 put_no_rnd size 16 versions These currently only exist as MMX and (not bitexact) MMXEXT versions. The added functions occupy 288B here. So far, they are only for the x2 and y2 (i.e. right and down, not down-right) directions. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp.asm | 94 +++++++++++++++++++++-------------- libavcodec/x86/hpeldsp_init.c | 10 +++- 2 files changed, 65 insertions(+), 39 deletions(-) diff --git a/libavcodec/x86/hpeldsp.asm b/libavcodec/x86/hpeldsp.asm index 859894856d..522a349e21 100644 --- a/libavcodec/x86/hpeldsp.asm +++ b/libavcodec/x86/hpeldsp.asm @@ -125,38 +125,42 @@ cglobal put_no_rnd_pixels8_x2, 4,5 RET +%macro NO_RND_PIXELS_X2 0 +%if cpuflag(sse2) +cglobal put_no_rnd_pixels16_x2, 4,5,5 +%else ; void ff_put_no_rnd_pixels8_x2_exact(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -INIT_MMX mmxext cglobal put_no_rnd_pixels8_x2_exact, 4,5 +%endif lea r4, [r2*3] - pcmpeqb m6, m6 + pcmpeqb m4, m4 .loop: - mova m0, [r1] - mova m2, [r1+r2] - mova m1, [r1+1] - mova m3, [r1+r2+1] - pxor m0, m6 - pxor m2, m6 - pxor m1, m6 - pxor m3, m6 + movu m0, [r1] + movu m2, [r1+r2] + movu m1, [r1+1] + movu m3, [r1+r2+1] + pxor m0, m4 + pxor m2, m4 + pxor m1, m4 + pxor m3, m4 PAVGB m0, m1 PAVGB m2, m3 - pxor m0, m6 - pxor m2, m6 + pxor m0, m4 + pxor m2, m4 mova [r0], m0 mova [r0+r2], m2 - mova m0, [r1+r2*2] - mova m1, [r1+r2*2+1] - mova m2, [r1+r4] - mova m3, [r1+r4+1] - pxor m0, m6 - pxor m1, m6 - pxor m2, m6 - pxor m3, m6 + movu m0, [r1+r2*2] + movu m1, [r1+r2*2+1] + movu m2, [r1+r4] + movu m3, [r1+r4+1] + pxor m0, m4 + pxor m1, m4 + pxor m2, m4 + pxor m3, m4 PAVGB m0, m1 PAVGB m2, m3 - pxor m0, m6 - pxor m2, m6 + pxor m0, m4 + pxor m2, m4 mova [r0+r2*2], m0 mova [r0+r4], m2 lea r1, [r1+r2*4] @@ -164,7 +168,12 @@ cglobal put_no_rnd_pixels8_x2_exact, 4,5 sub r3d, 4 jg .loop RET +%endmacro +INIT_MMX mmxext +NO_RND_PIXELS_X2 +INIT_XMM sse2 +NO_RND_PIXELS_X2 ; void ff_put_pixels8_y2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) %macro PUT_PIXELS8_Y2 0 @@ -236,33 +245,37 @@ cglobal put_no_rnd_pixels8_y2, 4,5 RET +%macro NO_RND_PIXELS_Y2 0 +%if cpuflag(sse2) +cglobal put_no_rnd_pixels16_y2, 4,5,4 +%else ; void ff_put_no_rnd_pixels8_y2_exact(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -INIT_MMX mmxext cglobal put_no_rnd_pixels8_y2_exact, 4,5 +%endif lea r4, [r2*3] - mova m0, [r1] - pcmpeqb m6, m6 + movu m0, [r1] + pcmpeqb m3, m3 add r1, r2 - pxor m0, m6 + pxor m0, m3 .loop: - mova m1, [r1] - mova m2, [r1+r2] - pxor m1, m6 - pxor m2, m6 + movu m1, [r1] + movu m2, [r1+r2] + pxor m1, m3 + pxor m2, m3 PAVGB m0, m1 PAVGB m1, m2 - pxor m0, m6 - pxor m1, m6 + pxor m0, m3 + pxor m1, m3 mova [r0], m0 mova [r0+r2], m1 - mova m1, [r1+r2*2] - mova m0, [r1+r4] - pxor m1, m6 - pxor m0, m6 + movu m1, [r1+r2*2] + movu m0, [r1+r4] + pxor m1, m3 + pxor m0, m3 PAVGB m2, m1 PAVGB m1, m0 - pxor m2, m6 - pxor m1, m6 + pxor m2, m3 + pxor m1, m3 mova [r0+r2*2], m2 mova [r0+r4], m1 lea r1, [r1+r2*4] @@ -270,7 +283,12 @@ cglobal put_no_rnd_pixels8_y2_exact, 4,5 sub r3d, 4 jg .loop RET +%endmacro +INIT_MMX mmxext +NO_RND_PIXELS_Y2 +INIT_XMM sse2 +NO_RND_PIXELS_Y2 ; void ff_avg_pixels8_x2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) %macro AVG_PIXELS8_X2 0 diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index ab32b825c9..c8ccd7b011 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -49,6 +49,8 @@ void ff_put_no_rnd_pixels8_x2_mmxext(uint8_t *block, const uint8_t *pixels, void ff_put_no_rnd_pixels8_x2_exact_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); +void ff_put_no_rnd_pixels16_x2_sse2(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); void ff_put_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_put_no_rnd_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, @@ -56,6 +58,8 @@ void ff_put_no_rnd_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, void ff_put_no_rnd_pixels8_y2_exact_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); +void ff_put_no_rnd_pixels16_y2_sse2(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); void ff_avg_pixels8_x2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_avg_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, @@ -369,10 +373,14 @@ static void hpeldsp_init_sse2(HpelDSPContext *c, int flags) { #if HAVE_SSE2_EXTERNAL c->put_pixels_tab[0][0] = ff_put_pixels16_sse2; - c->put_no_rnd_pixels_tab[0][0] = ff_put_pixels16_sse2; c->put_pixels_tab[0][1] = ff_put_pixels16_x2_sse2; c->put_pixels_tab[0][2] = ff_put_pixels16_y2_sse2; c->put_pixels_tab[0][3] = ff_put_pixels16_xy2_sse2; + + c->put_no_rnd_pixels_tab[0][0] = ff_put_pixels16_sse2; + c->put_no_rnd_pixels_tab[0][1] = ff_put_no_rnd_pixels16_x2_sse2; + c->put_no_rnd_pixels_tab[0][2] = ff_put_no_rnd_pixels16_y2_sse2; + c->avg_pixels_tab[0][0] = ff_avg_pixels16_sse2; c->avg_pixels_tab[0][1] = ff_avg_pixels16_x2_sse2; c->avg_pixels_tab[0][2] = ff_avg_pixels16_y2_sse2; -- 2.49.1 >>From 1985a14fb21ff831cc15cf446a90d836e6e2c9e5 Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Tue, 23 Sep 2025 04:15:22 +0200 Subject: [PATCH 17/20] avcodec/x86/hpeldsp: Add SSE2 avg_no_rnd size 16 versions These currently only exist as MMX versions. The added functions occupy 320B here. So far, they are only for the x2 and y2 (i.e. right and down, not down-right) directions. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp.asm | 38 ++++++++++++++++++++++++++--------- libavcodec/x86/hpeldsp_init.c | 7 +++++++ 2 files changed, 35 insertions(+), 10 deletions(-) diff --git a/libavcodec/x86/hpeldsp.asm b/libavcodec/x86/hpeldsp.asm index 522a349e21..e9f988f7b5 100644 --- a/libavcodec/x86/hpeldsp.asm +++ b/libavcodec/x86/hpeldsp.asm @@ -125,12 +125,12 @@ cglobal put_no_rnd_pixels8_x2, 4,5 RET -%macro NO_RND_PIXELS_X2 0 +%macro NO_RND_PIXELS_X2 1 %if cpuflag(sse2) -cglobal put_no_rnd_pixels16_x2, 4,5,5 +cglobal %1_no_rnd_pixels16_x2, 4,5,5 %else ; void ff_put_no_rnd_pixels8_x2_exact(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -cglobal put_no_rnd_pixels8_x2_exact, 4,5 +cglobal %1_no_rnd_pixels8_x2_exact, 4,5 %endif lea r4, [r2*3] pcmpeqb m4, m4 @@ -147,6 +147,10 @@ cglobal put_no_rnd_pixels8_x2_exact, 4,5 PAVGB m2, m3 pxor m0, m4 pxor m2, m4 +%ifidn %1, avg + pavgb m0, [r0] + pavgb m2, [r0+r2] +%endif mova [r0], m0 mova [r0+r2], m2 movu m0, [r1+r2*2] @@ -161,6 +165,10 @@ cglobal put_no_rnd_pixels8_x2_exact, 4,5 PAVGB m2, m3 pxor m0, m4 pxor m2, m4 +%ifidn %1, avg + pavgb m0, [r0+r2*2] + pavgb m2, [r0+r4] +%endif mova [r0+r2*2], m0 mova [r0+r4], m2 lea r1, [r1+r2*4] @@ -171,9 +179,10 @@ cglobal put_no_rnd_pixels8_x2_exact, 4,5 %endmacro INIT_MMX mmxext -NO_RND_PIXELS_X2 +NO_RND_PIXELS_X2 put INIT_XMM sse2 -NO_RND_PIXELS_X2 +NO_RND_PIXELS_X2 avg +NO_RND_PIXELS_X2 put ; void ff_put_pixels8_y2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) %macro PUT_PIXELS8_Y2 0 @@ -245,12 +254,12 @@ cglobal put_no_rnd_pixels8_y2, 4,5 RET -%macro NO_RND_PIXELS_Y2 0 +%macro NO_RND_PIXELS_Y2 1 %if cpuflag(sse2) -cglobal put_no_rnd_pixels16_y2, 4,5,4 +cglobal %1_no_rnd_pixels16_y2, 4,5,4 %else ; void ff_put_no_rnd_pixels8_y2_exact(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -cglobal put_no_rnd_pixels8_y2_exact, 4,5 +cglobal %1_no_rnd_pixels8_y2_exact, 4,5 %endif lea r4, [r2*3] movu m0, [r1] @@ -266,6 +275,10 @@ cglobal put_no_rnd_pixels8_y2_exact, 4,5 PAVGB m1, m2 pxor m0, m3 pxor m1, m3 +%ifidn %1, avg + pavgb m0, [r0] + pavgb m1, [r0+r2] +%endif mova [r0], m0 mova [r0+r2], m1 movu m1, [r1+r2*2] @@ -276,6 +289,10 @@ cglobal put_no_rnd_pixels8_y2_exact, 4,5 PAVGB m1, m0 pxor m2, m3 pxor m1, m3 +%ifidn %1, avg + pavgb m2,[r0+r2*2] + pavgb m1,[r0+r4] +%endif mova [r0+r2*2], m2 mova [r0+r4], m1 lea r1, [r1+r2*4] @@ -286,9 +303,10 @@ cglobal put_no_rnd_pixels8_y2_exact, 4,5 %endmacro INIT_MMX mmxext -NO_RND_PIXELS_Y2 +NO_RND_PIXELS_Y2 put INIT_XMM sse2 -NO_RND_PIXELS_Y2 +NO_RND_PIXELS_Y2 avg +NO_RND_PIXELS_Y2 put ; void ff_avg_pixels8_x2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) %macro AVG_PIXELS8_X2 0 diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index c8ccd7b011..4f369c9731 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -51,6 +51,8 @@ void ff_put_no_rnd_pixels8_x2_exact_mmxext(uint8_t *block, ptrdiff_t line_size, int h); void ff_put_no_rnd_pixels16_x2_sse2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); +void ff_avg_no_rnd_pixels16_x2_sse2(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); void ff_put_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_put_no_rnd_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, @@ -60,6 +62,8 @@ void ff_put_no_rnd_pixels8_y2_exact_mmxext(uint8_t *block, ptrdiff_t line_size, int h); void ff_put_no_rnd_pixels16_y2_sse2(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); +void ff_avg_no_rnd_pixels16_y2_sse2(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); void ff_avg_pixels8_x2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); void ff_avg_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, @@ -385,7 +389,10 @@ static void hpeldsp_init_sse2(HpelDSPContext *c, int flags) c->avg_pixels_tab[0][1] = ff_avg_pixels16_x2_sse2; c->avg_pixels_tab[0][2] = ff_avg_pixels16_y2_sse2; c->avg_pixels_tab[0][3] = ff_avg_pixels16_xy2_sse2; + c->avg_no_rnd_pixels_tab[0] = ff_avg_pixels16_sse2; + c->avg_no_rnd_pixels_tab[1] = ff_avg_no_rnd_pixels16_x2_sse2; + c->avg_no_rnd_pixels_tab[2] = ff_avg_no_rnd_pixels16_y2_sse2; #endif /* HAVE_SSE2_EXTERNAL */ } -- 2.49.1 >>From 5c8e11290064b31bf82357ee535d31eba2cf4ade Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Tue, 23 Sep 2025 04:32:55 +0200 Subject: [PATCH 18/20] avcodec/x86/hpeldsp_init: Remove MMX(EXT) funcs overridden by SSE2 This affects the {avg,put}_no_rnd_pixels16_{x,y}2 MMX and (put-only) MMXEXT versions. Removing these functions saved 1184B here. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp_init.c | 164 ---------------------------------- 1 file changed, 164 deletions(-) diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index 4f369c9731..48a1aa7a2c 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -161,167 +161,12 @@ static void avg_no_rnd_pixels8_xy2_mmx(uint8_t *block, const uint8_t *pixels, :FF_REG_a, "memory"); } -static void put_no_rnd_pixels16_x2_mmx(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -{ - MOVQ_BFE(mm6); - __asm__ volatile( - "lea (%3, %3), %%"FF_REG_a" \n\t" - ".p2align 3 \n\t" - "1: \n\t" - "movq (%1), %%mm0 \n\t" - "movq 1(%1), %%mm1 \n\t" - "movq (%1, %3), %%mm2 \n\t" - "movq 1(%1, %3), %%mm3 \n\t" - PAVGBP_MMX_NO_RND(%%mm0, %%mm1, %%mm4, %%mm2, %%mm3, %%mm5) - "movq %%mm4, (%2) \n\t" - "movq %%mm5, (%2, %3) \n\t" - "movq 8(%1), %%mm0 \n\t" - "movq 9(%1), %%mm1 \n\t" - "movq 8(%1, %3), %%mm2 \n\t" - "movq 9(%1, %3), %%mm3 \n\t" - PAVGBP_MMX_NO_RND(%%mm0, %%mm1, %%mm4, %%mm2, %%mm3, %%mm5) - "movq %%mm4, 8(%2) \n\t" - "movq %%mm5, 8(%2, %3) \n\t" - "add %%"FF_REG_a", %1 \n\t" - "add %%"FF_REG_a", %2 \n\t" - "movq (%1), %%mm0 \n\t" - "movq 1(%1), %%mm1 \n\t" - "movq (%1, %3), %%mm2 \n\t" - "movq 1(%1, %3), %%mm3 \n\t" - PAVGBP_MMX_NO_RND(%%mm0, %%mm1, %%mm4, %%mm2, %%mm3, %%mm5) - "movq %%mm4, (%2) \n\t" - "movq %%mm5, (%2, %3) \n\t" - "movq 8(%1), %%mm0 \n\t" - "movq 9(%1), %%mm1 \n\t" - "movq 8(%1, %3), %%mm2 \n\t" - "movq 9(%1, %3), %%mm3 \n\t" - PAVGBP_MMX_NO_RND(%%mm0, %%mm1, %%mm4, %%mm2, %%mm3, %%mm5) - "movq %%mm4, 8(%2) \n\t" - "movq %%mm5, 8(%2, %3) \n\t" - "add %%"FF_REG_a", %1 \n\t" - "add %%"FF_REG_a", %2 \n\t" - "subl $4, %0 \n\t" - "jnz 1b \n\t" - :"+g"(h), "+S"(pixels), "+D"(block) - :"r"((x86_reg)line_size) - :FF_REG_a, "memory"); -} - -static void put_no_rnd_pixels8_y2_mmx(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -{ - MOVQ_BFE(mm6); - __asm__ volatile( - "lea (%3, %3), %%"FF_REG_a" \n\t" - "movq (%1), %%mm0 \n\t" - ".p2align 3 \n\t" - "1: \n\t" - "movq (%1, %3), %%mm1 \n\t" - "movq (%1, %%"FF_REG_a"),%%mm2\n\t" - PAVGBP_MMX_NO_RND(%%mm1, %%mm0, %%mm4, %%mm2, %%mm1, %%mm5) - "movq %%mm4, (%2) \n\t" - "movq %%mm5, (%2, %3) \n\t" - "add %%"FF_REG_a", %1 \n\t" - "add %%"FF_REG_a", %2 \n\t" - "movq (%1, %3), %%mm1 \n\t" - "movq (%1, %%"FF_REG_a"),%%mm0\n\t" - PAVGBP_MMX_NO_RND(%%mm1, %%mm2, %%mm4, %%mm0, %%mm1, %%mm5) - "movq %%mm4, (%2) \n\t" - "movq %%mm5, (%2, %3) \n\t" - "add %%"FF_REG_a", %1 \n\t" - "add %%"FF_REG_a", %2 \n\t" - "subl $4, %0 \n\t" - "jnz 1b \n\t" - :"+g"(h), "+S"(pixels), "+D"(block) - :"r"((x86_reg)line_size) - :FF_REG_a, "memory"); -} - -static void avg_no_rnd_pixels16_x2_mmx(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -{ - MOVQ_BFE(mm6); - __asm__ volatile( - ".p2align 3 \n\t" - "1: \n\t" - "movq (%1), %%mm0 \n\t" - "movq 1(%1), %%mm1 \n\t" - "movq (%2), %%mm3 \n\t" - PAVGB_MMX_NO_RND(%%mm0, %%mm1, %%mm2, %%mm6) - PAVGB_MMX(%%mm3, %%mm2, %%mm0, %%mm6) - "movq %%mm0, (%2) \n\t" - "movq 8(%1), %%mm0 \n\t" - "movq 9(%1), %%mm1 \n\t" - "movq 8(%2), %%mm3 \n\t" - PAVGB_MMX_NO_RND(%%mm0, %%mm1, %%mm2, %%mm6) - PAVGB_MMX(%%mm3, %%mm2, %%mm0, %%mm6) - "movq %%mm0, 8(%2) \n\t" - "add %3, %1 \n\t" - "add %3, %2 \n\t" - "subl $1, %0 \n\t" - "jnz 1b \n\t" - :"+g"(h), "+S"(pixels), "+D"(block) - :"r"((x86_reg)line_size) - :"memory"); -} - -static void avg_no_rnd_pixels8_y2_mmx(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h) -{ - MOVQ_BFE(mm6); - __asm__ volatile( - "lea (%3, %3), %%"FF_REG_a" \n\t" - "movq (%1), %%mm0 \n\t" - ".p2align 3 \n\t" - "1: \n\t" - "movq (%1, %3), %%mm1 \n\t" - "movq (%1, %%"FF_REG_a"), %%mm2 \n\t" - PAVGBP_MMX_NO_RND(%%mm1, %%mm0, %%mm4, %%mm2, %%mm1, %%mm5) - "movq (%2), %%mm3 \n\t" - PAVGB_MMX(%%mm3, %%mm4, %%mm0, %%mm6) - "movq (%2, %3), %%mm3 \n\t" - PAVGB_MMX(%%mm3, %%mm5, %%mm1, %%mm6) - "movq %%mm0, (%2) \n\t" - "movq %%mm1, (%2, %3) \n\t" - "add %%"FF_REG_a", %1 \n\t" - "add %%"FF_REG_a", %2 \n\t" - - "movq (%1, %3), %%mm1 \n\t" - "movq (%1, %%"FF_REG_a"), %%mm0 \n\t" - PAVGBP_MMX_NO_RND(%%mm1, %%mm2, %%mm4, %%mm0, %%mm1, %%mm5) - "movq (%2), %%mm3 \n\t" - PAVGB_MMX(%%mm3, %%mm4, %%mm2, %%mm6) - "movq (%2, %3), %%mm3 \n\t" - PAVGB_MMX(%%mm3, %%mm5, %%mm1, %%mm6) - "movq %%mm2, (%2) \n\t" - "movq %%mm1, (%2, %3) \n\t" - "add %%"FF_REG_a", %1 \n\t" - "add %%"FF_REG_a", %2 \n\t" - - "subl $4, %0 \n\t" - "jnz 1b \n\t" - :"+g"(h), "+S"(pixels), "+D"(block) - :"r"((x86_reg)line_size) - :FF_REG_a, "memory"); -} - #if HAVE_MMX -CALL_2X_PIXELS(avg_no_rnd_pixels16_y2_mmx, avg_no_rnd_pixels8_y2_mmx, 8) -CALL_2X_PIXELS(put_no_rnd_pixels16_y2_mmx, put_no_rnd_pixels8_y2_mmx, 8) - CALL_2X_PIXELS(avg_no_rnd_pixels16_xy2_mmx, avg_no_rnd_pixels8_xy2_mmx, 8) CALL_2X_PIXELS(put_no_rnd_pixels16_xy2_mmx, put_no_rnd_pixels8_xy2_mmx, 8) #endif #endif /* HAVE_INLINE_ASM */ - -#if HAVE_X86ASM - -#define HPELDSP_AVG_PIXELS16(CPUEXT) \ - CALL_2X_PIXELS(put_no_rnd_pixels16_x2 ## CPUEXT, ff_put_no_rnd_pixels8_x2 ## CPUEXT, 8) \ - CALL_2X_PIXELS(put_no_rnd_pixels16_y2 ## CPUEXT, ff_put_no_rnd_pixels8_y2 ## CPUEXT, 8) - -HPELDSP_AVG_PIXELS16(_mmxext) - -#endif /* HAVE_X86ASM */ - #define SET_HPEL_FUNCS_EXT(PFX, IDX, SIZE, CPU) \ if (HAVE_MMX_EXTERNAL) \ c->PFX ## _pixels_tab IDX [0] = PFX ## _pixels ## SIZE ## _ ## CPU @@ -331,18 +176,11 @@ HPELDSP_AVG_PIXELS16(_mmxext) SET_HPEL_FUNCS_EXT(PFX, IDX, SIZE, CPU); \ c->PFX ## _pixels_tab IDX [3] = PFX ## _pixels ## SIZE ## _xy2_ ## CPU; \ } while (0) -#define SET_HPEL_FUNCS12(PFX, IDX, SIZE, CPU) \ - do { \ - c->PFX ## _pixels_tab IDX [1] = PFX ## _pixels ## SIZE ## _x2_ ## CPU; \ - c->PFX ## _pixels_tab IDX [2] = PFX ## _pixels ## SIZE ## _y2_ ## CPU; \ - } while (0) static void hpeldsp_init_mmx(HpelDSPContext *c, int flags) { #if HAVE_MMX_INLINE - SET_HPEL_FUNCS12(put_no_rnd, [0], 16, mmx); c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_mmx; - SET_HPEL_FUNCS12(avg_no_rnd, , 16, mmx); c->avg_no_rnd_pixels_tab[3] = avg_no_rnd_pixels16_xy2_mmx; #if HAVE_MMX_EXTERNAL c->put_pixels_tab[1][0] = ff_put_pixels8_mmx; @@ -365,8 +203,6 @@ static void hpeldsp_init_mmxext(HpelDSPContext *c, int flags) c->put_no_rnd_pixels_tab[1][2] = ff_put_no_rnd_pixels8_y2_exact_mmxext; if (!(flags & AV_CODEC_FLAG_BITEXACT)) { - c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_mmxext; - c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_mmxext; c->put_no_rnd_pixels_tab[1][1] = ff_put_no_rnd_pixels8_x2_mmxext; c->put_no_rnd_pixels_tab[1][2] = ff_put_no_rnd_pixels8_y2_mmxext; } -- 2.49.1 >>From 1ada1af98cd0ef59a56cd3db75521df2f94f2bcd Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Tue, 23 Sep 2025 04:49:44 +0200 Subject: [PATCH 19/20] avcodec/x86/hpeldsp_init: Avoid complicating macro Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp_init.c | 19 ++++--------------- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index 48a1aa7a2c..66ed886ea9 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -69,8 +69,6 @@ void ff_avg_pixels8_x2_mmxext(uint8_t *block, const uint8_t *pixels, void ff_avg_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); -#define put_no_rnd_pixels8_mmx ff_put_pixels8_mmx - #if HAVE_INLINE_ASM /***********************************/ @@ -167,25 +165,16 @@ CALL_2X_PIXELS(put_no_rnd_pixels16_xy2_mmx, put_no_rnd_pixels8_xy2_mmx, 8) #endif #endif /* HAVE_INLINE_ASM */ -#define SET_HPEL_FUNCS_EXT(PFX, IDX, SIZE, CPU) \ - if (HAVE_MMX_EXTERNAL) \ - c->PFX ## _pixels_tab IDX [0] = PFX ## _pixels ## SIZE ## _ ## CPU - -#define SET_HPEL_FUNCS03(PFX, IDX, SIZE, CPU) \ - do { \ - SET_HPEL_FUNCS_EXT(PFX, IDX, SIZE, CPU); \ - c->PFX ## _pixels_tab IDX [3] = PFX ## _pixels ## SIZE ## _xy2_ ## CPU; \ - } while (0) - static void hpeldsp_init_mmx(HpelDSPContext *c, int flags) { #if HAVE_MMX_INLINE c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_mmx; + c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_mmx; c->avg_no_rnd_pixels_tab[3] = avg_no_rnd_pixels16_xy2_mmx; -#if HAVE_MMX_EXTERNAL - c->put_pixels_tab[1][0] = ff_put_pixels8_mmx; #endif - SET_HPEL_FUNCS03(put_no_rnd, [1], 8, mmx); +#if HAVE_MMX_EXTERNAL + c->put_no_rnd_pixels_tab[1][0] = + c->put_pixels_tab[1][0] = ff_put_pixels8_mmx; #endif } -- 2.49.1 >>From 82ce2745be25d2d8b73fb4d33472209e6d2625ec Mon Sep 17 00:00:00 2001 From: Andreas Rheinhardt Date: Tue, 23 Sep 2025 05:01:41 +0200 Subject: [PATCH 20/20] avcodec/x86/rnd_template: Merge into hpeldsp_init.c It is now only included exactly once. Signed-off-by: Andreas Rheinhardt --- libavcodec/x86/hpeldsp_init.c | 74 +++++++++++++++++++++++--- libavcodec/x86/rnd_template.c | 98 ----------------------------------- 2 files changed, 67 insertions(+), 105 deletions(-) delete mode 100644 libavcodec/x86/rnd_template.c diff --git a/libavcodec/x86/hpeldsp_init.c b/libavcodec/x86/hpeldsp_init.c index 66ed886ea9..cb47cb7752 100644 --- a/libavcodec/x86/hpeldsp_init.c +++ b/libavcodec/x86/hpeldsp_init.c @@ -33,6 +33,7 @@ #include "libavcodec/pixels.h" #include "fpel.h" #include "hpeldsp.h" +#include "inline_asm.h" void ff_put_pixels8_x2_mmxext(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int h); @@ -73,15 +74,74 @@ void ff_avg_pixels8_y2_mmxext(uint8_t *block, const uint8_t *pixels, /***********************************/ /* MMX no rounding */ -#define DEF(x, y) x ## _no_rnd_ ## y ## _mmx -#define SET_RND MOVQ_WONE -#define STATIC static -#include "rnd_template.c" +// put_pixels +static void put_no_rnd_pixels8_xy2_mmx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + MOVQ_ZERO(mm7); + MOVQ_WONE(mm6); // =1 for no_rnd version + __asm__ volatile( + "movq (%1), %%mm0 \n\t" + "movq 1(%1), %%mm4 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm4, %%mm5 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpckhbw %%mm7, %%mm1 \n\t" + "punpckhbw %%mm7, %%mm5 \n\t" + "paddusw %%mm0, %%mm4 \n\t" + "paddusw %%mm1, %%mm5 \n\t" + "xor %%"FF_REG_a", %%"FF_REG_a" \n\t" + "add %3, %1 \n\t" + ".p2align 3 \n\t" + "1: \n\t" + "movq (%1, %%"FF_REG_a"), %%mm0 \n\t" + "movq 1(%1, %%"FF_REG_a"), %%mm2 \n\t" + "movq %%mm0, %%mm1 \n\t" + "movq %%mm2, %%mm3 \n\t" + "punpcklbw %%mm7, %%mm0 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpckhbw %%mm7, %%mm1 \n\t" + "punpckhbw %%mm7, %%mm3 \n\t" + "paddusw %%mm2, %%mm0 \n\t" + "paddusw %%mm3, %%mm1 \n\t" + "paddusw %%mm6, %%mm4 \n\t" + "paddusw %%mm6, %%mm5 \n\t" + "paddusw %%mm0, %%mm4 \n\t" + "paddusw %%mm1, %%mm5 \n\t" + "psrlw $2, %%mm4 \n\t" + "psrlw $2, %%mm5 \n\t" + "packuswb %%mm5, %%mm4 \n\t" + "movq %%mm4, (%2, %%"FF_REG_a") \n\t" + "add %3, %%"FF_REG_a" \n\t" -#undef DEF -#undef SET_RND -#undef STATIC + "movq (%1, %%"FF_REG_a"), %%mm2 \n\t" // 0 <-> 2 1 <-> 3 + "movq 1(%1, %%"FF_REG_a"), %%mm4 \n\t" + "movq %%mm2, %%mm3 \n\t" + "movq %%mm4, %%mm5 \n\t" + "punpcklbw %%mm7, %%mm2 \n\t" + "punpcklbw %%mm7, %%mm4 \n\t" + "punpckhbw %%mm7, %%mm3 \n\t" + "punpckhbw %%mm7, %%mm5 \n\t" + "paddusw %%mm2, %%mm4 \n\t" + "paddusw %%mm3, %%mm5 \n\t" + "paddusw %%mm6, %%mm0 \n\t" + "paddusw %%mm6, %%mm1 \n\t" + "paddusw %%mm4, %%mm0 \n\t" + "paddusw %%mm5, %%mm1 \n\t" + "psrlw $2, %%mm0 \n\t" + "psrlw $2, %%mm1 \n\t" + "packuswb %%mm1, %%mm0 \n\t" + "movq %%mm0, (%2, %%"FF_REG_a") \n\t" + "add %3, %%"FF_REG_a" \n\t" + + "subl $2, %0 \n\t" + "jnz 1b \n\t" + :"+g"(h), "+S"(pixels) + :"D"(block), "r"((x86_reg)line_size) + :FF_REG_a, "memory"); +} // this routine is 'slightly' suboptimal but mostly unused static void avg_no_rnd_pixels8_xy2_mmx(uint8_t *block, const uint8_t *pixels, diff --git a/libavcodec/x86/rnd_template.c b/libavcodec/x86/rnd_template.c deleted file mode 100644 index 4590aeddf0..0000000000 --- a/libavcodec/x86/rnd_template.c +++ /dev/null @@ -1,98 +0,0 @@ -/* - * SIMD-optimized halfpel functions are compiled twice for rnd/no_rnd - * Copyright (c) 2000, 2001 Fabrice Bellard - * Copyright (c) 2003-2004 Michael Niedermayer - * - * MMX optimization by Nick Kurshev - * mostly rewritten by Michael Niedermayer - * and improved by Zdenek Kabelac - * - * This file is part of FFmpeg. - * - * FFmpeg is free software; you can redistribute it and/or - * modify it under the terms of the GNU Lesser General Public - * License as published by the Free Software Foundation; either - * version 2.1 of the License, or (at your option) any later version. - * - * FFmpeg is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * Lesser General Public License for more details. - * - * You should have received a copy of the GNU Lesser General Public - * License along with FFmpeg; if not, write to the Free Software - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA - */ - -#include -#include - -#include "inline_asm.h" - -// put_pixels -av_unused STATIC void DEF(put, pixels8_xy2)(uint8_t *block, const uint8_t *pixels, - ptrdiff_t line_size, int h) -{ - MOVQ_ZERO(mm7); - SET_RND(mm6); // =2 for rnd and =1 for no_rnd version - __asm__ volatile( - "movq (%1), %%mm0 \n\t" - "movq 1(%1), %%mm4 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm4, %%mm5 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpckhbw %%mm7, %%mm1 \n\t" - "punpckhbw %%mm7, %%mm5 \n\t" - "paddusw %%mm0, %%mm4 \n\t" - "paddusw %%mm1, %%mm5 \n\t" - "xor %%"FF_REG_a", %%"FF_REG_a" \n\t" - "add %3, %1 \n\t" - ".p2align 3 \n\t" - "1: \n\t" - "movq (%1, %%"FF_REG_a"), %%mm0 \n\t" - "movq 1(%1, %%"FF_REG_a"), %%mm2 \n\t" - "movq %%mm0, %%mm1 \n\t" - "movq %%mm2, %%mm3 \n\t" - "punpcklbw %%mm7, %%mm0 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpckhbw %%mm7, %%mm1 \n\t" - "punpckhbw %%mm7, %%mm3 \n\t" - "paddusw %%mm2, %%mm0 \n\t" - "paddusw %%mm3, %%mm1 \n\t" - "paddusw %%mm6, %%mm4 \n\t" - "paddusw %%mm6, %%mm5 \n\t" - "paddusw %%mm0, %%mm4 \n\t" - "paddusw %%mm1, %%mm5 \n\t" - "psrlw $2, %%mm4 \n\t" - "psrlw $2, %%mm5 \n\t" - "packuswb %%mm5, %%mm4 \n\t" - "movq %%mm4, (%2, %%"FF_REG_a") \n\t" - "add %3, %%"FF_REG_a" \n\t" - - "movq (%1, %%"FF_REG_a"), %%mm2 \n\t" // 0 <-> 2 1 <-> 3 - "movq 1(%1, %%"FF_REG_a"), %%mm4 \n\t" - "movq %%mm2, %%mm3 \n\t" - "movq %%mm4, %%mm5 \n\t" - "punpcklbw %%mm7, %%mm2 \n\t" - "punpcklbw %%mm7, %%mm4 \n\t" - "punpckhbw %%mm7, %%mm3 \n\t" - "punpckhbw %%mm7, %%mm5 \n\t" - "paddusw %%mm2, %%mm4 \n\t" - "paddusw %%mm3, %%mm5 \n\t" - "paddusw %%mm6, %%mm0 \n\t" - "paddusw %%mm6, %%mm1 \n\t" - "paddusw %%mm4, %%mm0 \n\t" - "paddusw %%mm5, %%mm1 \n\t" - "psrlw $2, %%mm0 \n\t" - "psrlw $2, %%mm1 \n\t" - "packuswb %%mm1, %%mm0 \n\t" - "movq %%mm0, (%2, %%"FF_REG_a") \n\t" - "add %3, %%"FF_REG_a" \n\t" - - "subl $2, %0 \n\t" - "jnz 1b \n\t" - :"+g"(h), "+S"(pixels) - :"D"(block), "r"((x86_reg)line_size) - :FF_REG_a, "memory"); -} -- 2.49.1 _______________________________________________ ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org