From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 3133649147 for ; Tue, 16 Dec 2025 16:07:23 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'ZTMJycZ7CDGlVi2tMmFZ22PAy2vtkeCR/goDUI/hjVc=', expected b'8P7qon9ShF1rRn6m0LKrVMoUS7I00O2ZEd8TtezsVnE=')) header.d=gmail.com header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1765901229; h=to : date : message-id : mime-version : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=b0DWO6tV63GitlDBYoSETcFKpRALZrD+4tju009m/aA=; b=diLUaVUPBuAvPXjApLBhOKxRVTBAsrPfJ+V+ZF670xU2TjTj8r+5oQkhuoRhSYRH/FRTF vTx5FqQvp5Th4sgxbG5I2zWlyimWKapJrnDetsyvEfaReWxjgid5Flq+OlgBFxyJQ4d+M3T /XHiiy8zx731qPzGfT/CMvWPLN7hFv6cHZ7BeHbo89+gEdW7ACwr/chZBRRPzoG4RyuvCh4 UdwPMe9bv1/dBl8F/1FwwxPZd9KPO3dok1c6kZ2aOtJgrevMyrv0AeGY9KkfVj06pSQSeCp h8RjspeuU9hQjsP1uArJRa5sKawInGKogzuRyDN0Z5PWt8AdEbMspmfAE1Og== Received: from [172.20.0.2] (unknown [172.19.0.4]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 4627A6908DC; Tue, 16 Dec 2025 18:07:09 +0200 (EET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1765901220; b=djBYek21JNDYFdMidLovq51s8Rs/TUtmp1h5RyrkoKfiY7SrKRSL19dsEhw2YGavXKKwe 8rUxOiEwyNz0m3jCqzUtRjOtPAmnG9UvF2tH5zzOcKtjyCrqqsHTTH9Q1XDTvKhiw78gmG6 eEfvHgrIWc68pT1Xol7+jqBGFpugu9jP44OaTydSJolxjQwTWVrQj2bWwMMKykCsWkYwA0s TptaReqyOOi8KM7x0I+l3lsD5eAXUBHvzwGkZf9vY+M0OgyQc+O9nyo81D3TMVcHlQGJseI CNN4zVw4neqX5zDQG6n6Q58RUefhM3Qq1S70O4Gn28nCURRT1FtH5rfKLZCg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1765901220; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=ZTMJycZ7CDGlVi2tMmFZ22PAy2vtkeCR/goDUI/hjVc=; b=rvdyLbcN6ibb/xIFmmfW9SkrrKkSj5+kF+W9Yz6YuS00QBmj+dRDY94h6fBtTJ/9ba6pR OeJc6zO8S1aObmmi9LdT9/ZM/iQy05lxmLOVj9lIdNIeQ40e0723jPh/dDwYBYXRhOrCZxK PIvupbGKRfuDoBruBzXJX9nfmFkNoKtXg0JdWDF+t/0k8/rYKC341d7AS1kNjait2a1Sd8Q rjzYKwLet7/f/PUP5Tnyc8STkl4B7pYEqQU6dbY/iBKoJ2aw4CNUxuSW9fiVKSDYkxMSzFH gwrEb3XDwXfe2w2UceUb68w1srSTVuPHgsINDajsFyWCOgvZPNGaj2K3jmmw== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=gmail.com; arc=none; dmarc=pass header.from=gmail.com policy.dmarc=quarantine Authentication-Results: ffmpeg.org; dkim=pass header.d=gmail.com; arc=none (Message is not ARC signed); dmarc=pass (Used From Domain Record) header.from=gmail.com policy.dmarc=quarantine Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id B45D869086F for ; Tue, 16 Dec 2025 11:54:52 +0200 (EET) Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-2a0b4320665so33282135ad.1 for ; Tue, 16 Dec 2025 01:54:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1765878890; x=1766483690; darn=ffmpeg.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=8P7qon9ShF1rRn6m0LKrVMoUS7I00O2ZEd8TtezsVnE=; b=DcG3IJLHYo8XevyBtBlGyWMwWH4GBB6FzMuAavtuRKBMmgD7u8BjmHDNEgrnUL2Fcy JB4fsS1LXKHg1ONQ7Q57G5FLY5MVMBQfW3iq5qzf72FD9nKhHOYdM1uF8EdIM4uXhSpz qdv3+d0th8I5fibbNKLqDsyjoNux7l3kwo1kRg2es5k7Rmk50sEVpqHZAYiGVSpEATUc diXyavqWkmXOJKgun/nYAW9B3noA8FImB2BGhPSlvJa5mgq4BbXEaNilCVHVUHdNlCGi R6zfy9BTmt0PES4MOihdps5UAlKwrZ0c7PgNFpVAS/iozHWdKrcZiJsv+FRXQ20MDEdJ l7YQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765878890; x=1766483690; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=8P7qon9ShF1rRn6m0LKrVMoUS7I00O2ZEd8TtezsVnE=; b=bXKzYZuTwsndgfM2hb/RyN7OWk9z+XJq24V0IVrGj/oBPGpQaN9Cop+Hd+hEuhLBVF x5xoqcgkVDU5yvSFKWCDuMWyVEJWCv1kt6znggwF4EU3NG/735u0Vun5OKlGIsFW0hTu XJZ27nhnqjNkR+v8MMeqGCnSR0JCL4q1OpP4k0D9MK8+zh6iPtKaWt06Hivskr3HC1+b Zi0EIvgDTK5i+wp75HseGzF/hJozE3civppk8/gZs6OlIliweJnr4cL4d69iiFBISaT4 K3zxcesSkWNBLk0oMiWPOKf5n+sG7NUr0nxSxUPJjzEOyOLAa4z9nYx8z5N8y6NQh3LX ICug== X-Gm-Message-State: AOJu0YxcxRR4ecZEa2RVjYRpXVPKqM0U/AZwvLXrOla1aDJy1F6eZ21h ZZ83Lc/qRhg3sXmcotwxRUR4KxkAzrSpEZx7aI+C9TV0o1E5EWERipQxQjDPWg== X-Gm-Gg: AY/fxX7n6G3/+wyS7CdWTRNz2NAWGtrOEX+D65oglTWLw41nPsimbbqxEGLv6+Z8j+h /nQfrM2pfBZI1JmiMzEcaDhqIa81pOsLj5n3CP5Di3Rce5M2mMmDwELfsIhG+DB7Ka+N7k5edYt XE/Gimck8JF73cvaSOheMrG0zCV5IKbl53MBMQzhErE/Z3hyT7OjWVr1Ck6WpA8k3fyjExbXXO9 QqHzjEs/9SOnNxAlwB68M9//QXq7w3mRoj7bXkS0OItMjVd66UgnNNqh2JRqCZvzF+9Sx2/eSYl 9yfXtFfhfdMOjjGkP/Ii3irh0plqYor4PTpu+G0e832PFtAIVrZOREqJDzJVtdjzNcBTrZiBFX0 esrdXkCSUjEooaUuSBrI7lvPTM2fy0OvUTqkU3GmOKQYsLZFC6lYpbTp7vk9GkPDY0XJj5S7t3f CQYdUS71bxhaQIUGZojMxZ+iYhPu6grxKJGpk1nmVJF4j2/x2nug== X-Google-Smtp-Source: AGHT+IErkgwMkD4d3cZJFmCcKFqhJXPbK3RMKHSHXCU4JZg1J8Kjj/Sb6YEEZNiqW+jlVBJLh7UUQA== X-Received: by 2002:a17:903:1b67:b0:295:8c51:64ff with SMTP id d9443c01a7336-29f23c63b18mr149419245ad.29.1765878890151; Tue, 16 Dec 2025 01:54:50 -0800 (PST) Received: from Zanes-PC ([2404:4408:6a13:2e00:2566:4b94:73bc:b32a]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7f4c237f357sm14966498b3a.6.2025.12.16.01.54.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Dec 2025 01:54:49 -0800 (PST) To: ffmpeg-devel@ffmpeg.org Date: Tue, 16 Dec 2025 22:54:44 +1300 Message-ID: <20251216095444.47177-1-zanehambly@gmail.com> X-Mailer: git-send-email 2.51.0.windows.2 MIME-Version: 1.0 X-MailFrom: SRS0=P/zb=6W=gmail.com=zanehambly@ffmpeg.org X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation Message-ID-Hash: LHM4EZXGMOHVVFTOE72MW4CHOYKCVNZZ X-Message-ID-Hash: LHM4EZXGMOHVVFTOE72MW4CHOYKCVNZZ X-Mailman-Approved-At: Tue, 16 Dec 2025 16:06:53 +0000 X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] avcodec/x86/h264_intrapred: add AVX2 for 10-bit pred16x16 functions List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: ZaneHam via ffmpeg-devel Cc: ZaneHam Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: Add AVX2 implementations for 10-bit H.264 16x16 intra prediction: - pred16x16_vertical_10 - pred16x16_horizontal_10 - pred16x16_dc_10 - pred16x16_top_dc_10 - pred16x16_left_dc_10 - pred16x16_128_dc_10 10-bit 16x16 blocks are 32 bytes per row, perfectly matching AVX2's 256-bit YMM registers, allowing single-instruction row operations versus two XMM operations with SSE2. checkasm benchmarks on Zen3 (cycles, lower is better): C SSE2 AVX2 pred16x16_dc_10 65.7 40.3 27.3 (1.48x vs SSE2) pred16x16_128_dc_10 31.1 28.1 21.4 (1.31x vs SSE2) pred16x16_horizontal 67.8 28.1 21.6 (1.30x vs SSE2) pred16x16_left_dc_10 55.6 35.0 22.9 (1.53x vs SSE2) pred16x16_top_dc_10 49.5 32.3 21.8 (1.48x vs SSE2) pred16x16_vertical_10 32.3 28.3 24.1 (1.17x vs SSE2) Merry Christmas from New Zealand! --- libavcodec/x86/h264_intrapred_10bit.asm | 186 ++++++++++++++++++++++++ libavcodec/x86/h264_intrapred_init.c | 14 ++ 2 files changed, 200 insertions(+) diff --git a/libavcodec/x86/h264_intrapred_10bit.asm b/libavcodec/x86/h264_intrapred_10bit.asm index 2f30807332..78e2f263bc 100644 --- a/libavcodec/x86/h264_intrapred_10bit.asm +++ b/libavcodec/x86/h264_intrapred_10bit.asm @@ -1117,3 +1117,189 @@ cglobal pred16x16_128_dc_10, 2,3 dec r2d jg .loop RET + +;----------------------------------------------------------------------------- +; AVX2 versions of pred16x16 10-bit functions +; For 10-bit: 16 pixels * 2 bytes = 32 bytes = 1 YMM register (perfect match\!) +;----------------------------------------------------------------------------- + +%if HAVE_AVX2_EXTERNAL + +;----------------------------------------------------------------------------- +; void ff_pred16x16_vertical_10_avx2(pixel *src, ptrdiff_t stride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal pred16x16_vertical_10, 2, 4 + sub r0, r1 + movu m0, [r0] ; Load all 16 pixels (32 bytes) from top row + mov r2d, 4 + lea r3, [r1*3] +.loop: + movu [r0+r1*1], m0 + movu [r0+r1*2], m0 + movu [r0+r3 ], m0 + lea r0, [r0+r1*2] + movu [r0+r1*2], m0 + lea r0, [r0+r1*2] + dec r2d + jg .loop + RET + +;----------------------------------------------------------------------------- +; void ff_pred16x16_horizontal_10_avx2(pixel *src, ptrdiff_t stride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal pred16x16_horizontal_10, 2, 4 + lea r2, [r1*3] + mov r3d, 4 +.loop: + vpbroadcastw m0, [r0-2] + movu [r0], m0 + vpbroadcastw m0, [r0+r1-2] + movu [r0+r1], m0 + vpbroadcastw m0, [r0+r1*2-2] + movu [r0+r1*2], m0 + vpbroadcastw m0, [r0+r2-2] + movu [r0+r2], m0 + lea r0, [r0+r1*4] + dec r3d + jg .loop + RET + +;----------------------------------------------------------------------------- +; void ff_pred16x16_dc_10_avx2(pixel *src, ptrdiff_t stride) +; DC = (sum of 16 top pixels + sum of 16 left pixels + 16) >> 5 +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal pred16x16_dc_10, 2, 6 + mov r5, r0 ; Save dest pointer + sub r0, r1 + movu m0, [r0] ; Load top row (32 bytes) + vextracti128 xm1, m0, 1 ; Get high 128 bits + paddw xm0, xm1 ; Sum to 8 words + phaddw xm0, xm0 ; 4 words + phaddw xm0, xm0 ; 2 words + phaddw xm0, xm0 ; 1 word (top sum in low word) + movd r3d, xm0 + and r3d, 0xFFFF ; Keep only low 16 bits + + ; Sum left column using lea-based pointer advancement + lea r0, [r0+r1-2] ; Point to left pixel of row 0 + movzx r4d, word [r0] + add r3d, r4d + movzx r4d, word [r0+r1] + add r3d, r4d +%rep 7 + lea r0, [r0+r1*2] + movzx r4d, word [r0] + add r3d, r4d + movzx r4d, word [r0+r1] + add r3d, r4d +%endrep + add r3d, 16 ; Rounding + shr r3d, 5 ; Divide by 32 + + movd xm0, r3d + vpbroadcastw m0, xm0 ; Broadcast to all 16 words + + ; Fill all 16 rows + mov r3d, 4 + lea r4, [r1*3] +.loop: + movu [r5+r1*0], m0 + movu [r5+r1*1], m0 + movu [r5+r1*2], m0 + movu [r5+r4 ], m0 + lea r5, [r5+r1*4] + dec r3d + jg .loop + RET + +;----------------------------------------------------------------------------- +; void ff_pred16x16_top_dc_10_avx2(pixel *src, ptrdiff_t stride) +; DC = (sum of 16 top pixels + 8) >> 4 +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal pred16x16_top_dc_10, 2, 4 + sub r0, r1 + movu m0, [r0] ; Load top row + vextracti128 xm1, m0, 1 + paddw xm0, xm1 + phaddw xm0, xm0 + phaddw xm0, xm0 + phaddw xm0, xm0 + paddw xm0, [pw_8] ; Add 8 for rounding + psrlw xm0, 4 ; Divide by 16 + vpbroadcastw m0, xm0 + + mov r2d, 4 + lea r3, [r1*3] +.loop: + movu [r0+r1*1], m0 + movu [r0+r1*2], m0 + movu [r0+r3 ], m0 + lea r0, [r0+r1*2] + movu [r0+r1*2], m0 + lea r0, [r0+r1*2] + dec r2d + jg .loop + RET + +;----------------------------------------------------------------------------- +; void ff_pred16x16_left_dc_10_avx2(pixel *src, ptrdiff_t stride) +; DC = (sum of 16 left pixels + 8) >> 4 +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal pred16x16_left_dc_10, 2, 5 + mov r4, r0 ; Save dest pointer + + ; Sum left column using lea-based pointer advancement + sub r0, 2 ; Point to left pixel of row 0 + movzx r2d, word [r0] + movzx r3d, word [r0+r1] +%rep 7 + lea r0, [r0+r1*2] + movzx eax, word [r0] + add r2d, eax + movzx eax, word [r0+r1] + add r3d, eax +%endrep + lea r2d, [r2+r3+8] ; Sum with rounding + shr r2d, 4 ; Divide by 16 + + movd xm0, r2d + vpbroadcastw m0, xm0 + + ; Fill all 16 rows + mov r2d, 4 + lea r3, [r1*3] +.loop: + movu [r4+r1*0], m0 + movu [r4+r1*1], m0 + movu [r4+r1*2], m0 + movu [r4+r3 ], m0 + lea r4, [r4+r1*4] + dec r2d + jg .loop + RET + +;----------------------------------------------------------------------------- +; void ff_pred16x16_128_dc_10_avx2(pixel *src, ptrdiff_t stride) +; Fill with constant 512 (1 << 9 for 10-bit midpoint) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal pred16x16_128_dc_10, 2, 4 + vpbroadcastw m0, [pw_512] + mov r2d, 4 + lea r3, [r1*3] +.loop: + movu [r0+r1*0], m0 + movu [r0+r1*1], m0 + movu [r0+r1*2], m0 + movu [r0+r3 ], m0 + lea r0, [r0+r1*4] + dec r2d + jg .loop + RET + +%endif ; HAVE_AVX2_EXTERNAL diff --git a/libavcodec/x86/h264_intrapred_init.c b/libavcodec/x86/h264_intrapred_init.c index aa9bc721f0..6918c7f985 100644 --- a/libavcodec/x86/h264_intrapred_init.c +++ b/libavcodec/x86/h264_intrapred_init.c @@ -97,6 +97,12 @@ PRED16x16(128_dc, 10, sse2) PRED16x16(left_dc, 10, sse2) PRED16x16(vertical, 10, sse2) PRED16x16(horizontal, 10, sse2) +PRED16x16(dc, 10, avx2) +PRED16x16(top_dc, 10, avx2) +PRED16x16(128_dc, 10, avx2) +PRED16x16(left_dc, 10, avx2) +PRED16x16(vertical, 10, avx2) +PRED16x16(horizontal, 10, avx2) /* 8-bit versions */ PRED16x16(vertical, 8, sse) @@ -328,5 +334,13 @@ av_cold void ff_h264_pred_init_x86(H264PredContext *h, int codec_id, h->pred8x8l[VERT_RIGHT_PRED ] = ff_pred8x8l_vertical_right_10_avx; h->pred8x8l[HOR_UP_PRED ] = ff_pred8x8l_horizontal_up_10_avx; } + if (EXTERNAL_AVX2(cpu_flags)) { + h->pred16x16[DC_PRED8x8 ] = ff_pred16x16_dc_10_avx2; + h->pred16x16[TOP_DC_PRED8x8 ] = ff_pred16x16_top_dc_10_avx2; + h->pred16x16[DC_128_PRED8x8 ] = ff_pred16x16_128_dc_10_avx2; + h->pred16x16[LEFT_DC_PRED8x8 ] = ff_pred16x16_left_dc_10_avx2; + h->pred16x16[VERT_PRED8x8 ] = ff_pred16x16_vertical_10_avx2; + h->pred16x16[HOR_PRED8x8 ] = ff_pred16x16_horizontal_10_avx2; + } } } -- 2.51.0.windows.2 _______________________________________________ ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org