From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 9AEA04BFB2 for ; Mon, 1 Sep 2025 11:10:28 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'uF3dUgCDbwD6Xwzo9qF4ESNAHnU+/SR+rIcsgWRmsQM=', expected b'Yy4nc59YJsCYq7J1BcI+uw1HV2orI0SIl7B/L/XvDhQ=')) header.d=ffmpeg.org header.i=@ffmpeg.org header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1756725009; h=mime-version : to : message-id : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=uF3dUgCDbwD6Xwzo9qF4ESNAHnU+/SR+rIcsgWRmsQM=; b=xFVdrAgrxLd1bcewY7unJEttYVjeoPbg0gLaU436uc3SjWWnUJ5REZlgRS5RBSwTl0Q6S h0hK7q9Bhz8qKSEtzCF+PL9poOKDRVN3lXX0F324jZQ3BLaAVRdv7HMq3AiH4HFhZfd+JXW WP/+5R+7s3boPOyQjmbPJVcBwOXmYwpNaANFPXhE02yhnw0ZYN0vppCR/4P4WRqZST4cDNe ARwm3+h4owtMUDAcYXBBhBw8+6s+FJbbOD71hC+8kn+gRlmMe0nni7So4M0C0mH7bgHOrHM z7tOpFQPvOJbnZFdNy188RawCh939Iih1xNp582Tv57cBE3nBISdgc615AEg== Received: from [172.19.0.4] (unknown [172.19.0.4]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 94ECC68E739; Mon, 1 Sep 2025 14:10:09 +0300 (EEST) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1756724999; b=jpFkIzBWn6wOIGmt3lYrb9ghJ95zBA9u6o4o0zMvQP85xkXo3YSZPxY6BDNTyl2rBQ1JC kMYAPmz2b12EeCmbHm5ZvN4BkscDy0+O32kpj0pO1Kz621umPbXOe/QOiRkbm7JPEi1sN8Y hJMS6V6p64PhK7KSa3OO2amW37uWug7U1pWnhCFHDZXJJUw7GFiL/zCCfyxN9MfLotV97tB lPJgvkMxhr+1/KEX/XvA6uBuAfMDlcWgM16vW6NjNnv+g28JKER2Ay7HjabUUyUQiAPD7Vv WlgHj6O+M7VClM7wM7b/a4sJnFMv0Ahnxl/r3auxGm4Dm83rOKCvk8JvIQtw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1756724999; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=0cUVqIq6FWFbNACaRCg05+0LNcawWrZAKeHG0AYLxVc=; b=VBwsvkKwvtN1TBEpQRiSMBD4E0m+fT1bgc6QNkn3NDuTB1+ebZEowuzedK4MXIdh7Uygz W6fwcjCLqVoJe7rx1iG93fHfklFpmamHQGg0+5vDy1cpG7y9tnCElCAqKksMgkUmNMs6Hg3 lhZc3MeZdvFgdBd1NVelMfDNX9LvDszsp4Ndl+3ZoNvnOGmSRnrbG+Jdn3JH3iChi8zCEW1 wccc8kClnJ03TchFINCYPkhj+w9FNyXkp90h3q6ZbnqdukhVFbsPfbyTJpmfvJ1MMMU1F9N iHpC8EM5rDSCurLACHPSx84BdM6AeQVsXptAmK6T8afY4lEHlGwrdW8JMjjA== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none; dmarc=none Authentication-Results: ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none (Message is not ARC signed); dmarc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1756724990; h=content-type : mime-version : content-transfer-encoding : from : to : reply-to : subject : from; bh=Yy4nc59YJsCYq7J1BcI+uw1HV2orI0SIl7B/L/XvDhQ=; b=3Ui1JmKN0IVgIVPATAOXvv+ZiwnXOfDHqb16UwUw/FWyRk771qMKt3OyRH02gughAfejF zhVp/R4cA1sjxZarK1aI4tf/SAoCuqpRqS/fWHpo7SRw7TG6ub1EEtUgWO+VBmC1Z7H2kvi LZhxSewWT9EijfQKpcSPwQeUqHR6XGBElz4T9xITctaTT6SUOl5sv9C1MNk/KrJbFFBSaZD Ziiunmf0FMFkEgFq0DEj6bCO75BJq0owz84U853u6WKmCwkhwR3h6VbWyPOtopUroJC7HWV QhGuad7xphw5dwwEs5DtzKBnvZhQ5bHGIv/XjckO0npOaTu53BbRHIIRkw4w== Received: from 5d8f51c41678 (code.ffmpeg.org [188.245.149.3]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 15C1968D404 for ; Mon, 1 Sep 2025 14:09:50 +0300 (EEST) MIME-Version: 1.0 To: ffmpeg-devel@ffmpeg.org Message-ID: <175672499026.25.13389439321243697441@463a07221176> Message-ID-Hash: EMZFJTGOC7N3LRRYE64G7PS3MA5XQJSY X-Message-ID-Hash: EMZFJTGOC7N3LRRYE64G7PS3MA5XQJSY X-MailFrom: code@ffmpeg.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] vp9: Add 8bpc intra prediction AVX2 asm (PR #20386) List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: Henrik Gramner via ffmpeg-devel Cc: Henrik Gramner Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: PR #20386 opened by Henrik Gramner (gramner) URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20386 Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20386.patch A few of the most basic variants had existing AVX2 implementations since before. Those were rewritten to reduce code size. Checkasm numbers on Zen 5 (Strix Halo): ``` vp9_dc_32x32_8bpp_ssse3: 24.2 vp9_dc_32x32_8bpp_avx2: 10.3 vp9_dc_left_32x32_8bpp_ssse3: 23.6 vp9_dc_left_32x32_8bpp_avx2: 9.9 vp9_dc_top_32x32_8bpp_ssse3: 22.9 vp9_dc_top_32x32_8bpp_avx2: 10.0 vp9_diag_downleft_32x32_8bpp_avx: 28.5 vp9_diag_downleft_32x32_8bpp_avx2: 13.5 vp9_diag_downright_32x32_8bpp_avx: 35.0 vp9_diag_downright_32x32_8bpp_avx2: 17.0 vp9_hor_32x32_8bpp_avx: 22.3 vp9_hor_32x32_8bpp_avx2: 11.1 vp9_hor_down_32x32_8bpp_avx: 27.5 vp9_hor_down_32x32_8bpp_avx2: 19.8 vp9_hor_up_32x32_8bpp_avx: 26.0 vp9_hor_up_32x32_8bpp_avx2: 16.0 vp9_tm_32x32_8bpp_avx: 97.9 vp9_tm_32x32_8bpp_avx2: 23.6 vp9_vert_32x32_8bpp_sse: 20.8 vp9_vert_32x32_8bpp_avx2: 8.9 vp9_vert_left_32x32_8bpp_avx: 28.1 vp9_vert_left_32x32_8bpp_avx2: 15.2 vp9_vert_right_32x32_8bpp_avx: 32.0 vp9_vert_right_32x32_8bpp_avx2: 21.3 ``` >>From ce6ff1b6229f2346e3caee18efbe36e794a94c6d Mon Sep 17 00:00:00 2001 From: Henrik Gramner Date: Mon, 1 Sep 2025 02:03:00 +0200 Subject: [PATCH] vp9: Add 8bpc intra prediction AVX2 asm --- libavcodec/x86/vp9dsp_init.c | 13 +- libavcodec/x86/vp9intrapred.asm | 467 +++++++++++++++++++++----------- 2 files changed, 309 insertions(+), 171 deletions(-) diff --git a/libavcodec/x86/vp9dsp_init.c b/libavcodec/x86/vp9dsp_init.c index 9836b3321c..bbabcf38c3 100644 --- a/libavcodec/x86/vp9dsp_init.c +++ b/libavcodec/x86/vp9dsp_init.c @@ -207,11 +207,8 @@ ipred_dir_tm_h_funcs(8, avx); ipred_dir_tm_h_funcs(16, avx); ipred_dir_tm_h_funcs(32, avx); -ipred_func(32, v, avx); - -ipred_dc_funcs(32, avx2); -ipred_func(32, h, avx2); -ipred_func(32, tm, avx2); +ipred_all_funcs(32, avx2); +ipred_func(32, v, avx2); #undef ipred_func #undef ipred_dir_tm_h_funcs @@ -388,7 +385,6 @@ av_cold void ff_vp9dsp_init_x86(VP9DSPContext *dsp, int bpp, int bitexact) if (EXTERNAL_AVX_FAST(cpu_flags)) { init_fpel_func(1, 0, 32, put, , avx); init_fpel_func(0, 0, 64, put, , avx); - init_ipred(32, avx, v, VERT); } if (EXTERNAL_AVX2_FAST(cpu_flags)) { @@ -408,9 +404,8 @@ av_cold void ff_vp9dsp_init_x86(VP9DSPContext *dsp, int bpp, int bitexact) init_subpel3_32_64(1, avg, 8, avx2); #endif } - init_dc_ipred(32, avx2); - init_ipred(32, avx2, h, HOR); - init_ipred(32, avx2, tm, TM_VP8); + init_all_ipred(32, avx2); + init_ipred(32, avx2, v, VERT); } #if ARCH_X86_64 diff --git a/libavcodec/x86/vp9intrapred.asm b/libavcodec/x86/vp9intrapred.asm index 31f7d449fd..b67addd7e3 100644 --- a/libavcodec/x86/vp9intrapred.asm +++ b/libavcodec/x86/vp9intrapred.asm @@ -2,6 +2,7 @@ ;* VP9 Intra prediction SIMD optimizations ;* ;* Copyright (c) 2013 Ronald S. Bultje +;* Copyright (c) 2025 Two Orioles, LLC ;* ;* Parts based on: ;* H.264 intra prediction asm optimizations @@ -230,40 +231,6 @@ DC_16to32_FUNCS INIT_XMM ssse3 DC_16to32_FUNCS -%if HAVE_AVX2_EXTERNAL -INIT_YMM avx2 -cglobal vp9_ipred_dc_32x32, 4, 4, 3, dst, stride, l, a - mova m0, [lq] - mova m1, [aq] - DEFINE_ARGS dst, stride, stride3, cnt - lea stride3q, [strideq*3] - pxor m2, m2 - psadbw m0, m2 - psadbw m1, m2 - paddw m0, m1 - vextracti128 xm1, m0, 1 - paddw xm0, xm1 - movhlps xm1, xm0 - paddw xm0, xm1 - pmulhrsw xm0, [pw_512] - vpbroadcastb m0, xm0 - mov cntd, 4 -.loop: - mova [dstq+strideq*0], m0 - mova [dstq+strideq*1], m0 - mova [dstq+strideq*2], m0 - mova [dstq+stride3q ], m0 - lea dstq, [dstq+strideq*4] - mova [dstq+strideq*0], m0 - mova [dstq+strideq*1], m0 - mova [dstq+strideq*2], m0 - mova [dstq+stride3q ], m0 - lea dstq, [dstq+strideq*4] - dec cntd - jg .loop - RET -%endif - ; dc_top/left_NxN(uint8_t *dst, ptrdiff_t stride, const uint8_t *l, const uint8_t *a) %macro DC_1D_4to8_FUNCS 2 ; dir (top or left), arg (a or l) @@ -395,44 +362,6 @@ INIT_XMM ssse3 DC_1D_16to32_FUNCS top, a DC_1D_16to32_FUNCS left, l -%macro DC_1D_AVX2_FUNCS 2 ; dir (top or left), arg (a or l) -%if HAVE_AVX2_EXTERNAL -cglobal vp9_ipred_dc_%1_32x32, 4, 4, 3, dst, stride, l, a - mova m0, [%2q] - DEFINE_ARGS dst, stride, stride3, cnt - lea stride3q, [strideq*3] - pxor m2, m2 - psadbw m0, m2 - vextracti128 xm1, m0, 1 - paddw xm0, xm1 - movhlps xm1, xm0 - paddw xm0, xm1 - pmulhrsw xm0, [pw_1024] - vpbroadcastb m0, xm0 - mov cntd, 4 -.loop: - mova [dstq+strideq*0], m0 - mova [dstq+strideq*1], m0 - mova [dstq+strideq*2], m0 - mova [dstq+stride3q ], m0 - lea dstq, [dstq+strideq*4] - mova [dstq+strideq*0], m0 - mova [dstq+strideq*1], m0 - mova [dstq+strideq*2], m0 - mova [dstq+stride3q ], m0 - lea dstq, [dstq+strideq*4] - dec cntd - jg .loop - RET -%endif -%endmacro - -INIT_YMM avx2 -DC_1D_AVX2_FUNCS top, a -DC_1D_AVX2_FUNCS left, l - -; v - INIT_MMX mmx cglobal vp9_ipred_v_8x8, 4, 4, 0, dst, stride, l, a movq m0, [aq] @@ -486,29 +415,6 @@ cglobal vp9_ipred_v_32x32, 4, 4, 2, dst, stride, l, a jg .loop RET -INIT_YMM avx -cglobal vp9_ipred_v_32x32, 4, 4, 1, dst, stride, l, a - mova m0, [aq] - DEFINE_ARGS dst, stride, stride3, cnt - lea stride3q, [strideq*3] - mov cntd, 4 -.loop: - mova [dstq+strideq*0], m0 - mova [dstq+strideq*1], m0 - mova [dstq+strideq*2], m0 - mova [dstq+stride3q ], m0 - lea dstq, [dstq+strideq*4] - mova [dstq+strideq*0], m0 - mova [dstq+strideq*1], m0 - mova [dstq+strideq*2], m0 - mova [dstq+stride3q ], m0 - lea dstq, [dstq+strideq*4] - dec cntd - jg .loop - RET - -; h - %macro H_XMM_FUNCS 2 %if notcpuflag(avx) cglobal vp9_ipred_h_4x4, 3, 4, 1, dst, stride, l, stride3 @@ -642,34 +548,6 @@ H_XMM_FUNCS 4, 8 INIT_XMM avx H_XMM_FUNCS 4, 8 -%if HAVE_AVX2_EXTERNAL -INIT_YMM avx2 -cglobal vp9_ipred_h_32x32, 3, 5, 8, dst, stride, l, stride3, cnt - mova m5, [pb_1] - mova m6, [pb_2] - mova m7, [pb_3] - pxor m4, m4 - lea stride3q, [strideq*3] - mov cntq, 7 -.loop: - movd xm3, [lq+cntq*4] - vinserti128 m3, m3, xm3, 1 - pshufb m0, m3, m7 - pshufb m1, m3, m6 - mova [dstq+strideq*0], m0 - mova [dstq+strideq*1], m1 - pshufb m2, m3, m5 - pshufb m3, m4 - mova [dstq+strideq*2], m2 - mova [dstq+stride3q ], m3 - lea dstq, [dstq+strideq*4] - dec cntq - jge .loop - RET -%endif - -; tm - %macro TM_MMX_FUNCS 0 cglobal vp9_ipred_tm_4x4, 4, 4, 0, dst, stride, l, a pxor m1, m1 @@ -898,46 +776,9 @@ TM_XMM_FUNCS INIT_XMM avx TM_XMM_FUNCS -%if HAVE_AVX2_EXTERNAL -INIT_YMM avx2 -cglobal vp9_ipred_tm_32x32, 4, 4, 8, dst, stride, l, a - pxor m3, m3 - pinsrw xm2, [aq-1], 0 - vinserti128 m2, m2, xm2, 1 - mova m0, [aq] - DEFINE_ARGS dst, stride, l, cnt - mova m4, [pw_m256] - mova m5, [pw_m255] - pshufb m2, m4 - punpckhbw m1, m0, m3 - punpcklbw m0, m3 - psubw m1, m2 - psubw m0, m2 - mov cntq, 15 -.loop: - pinsrw xm7, [lq+cntq*2], 0 - vinserti128 m7, m7, xm7, 1 - pshufb m3, m7, m5 - pshufb m7, m4 - paddw m2, m3, m0 - paddw m3, m1 - paddw m6, m7, m0 - paddw m7, m1 - packuswb m2, m3 - packuswb m6, m7 - mova [dstq+strideq*0], m2 - mova [dstq+strideq*1], m6 - lea dstq, [dstq+strideq*2] - dec cntq - jge .loop - RET -%endif - -; dl - -%macro LOWPASS 4 ; left [dst], center, right, tmp +%macro LOWPASS 4-5 [pb_1] ; left [dst], center, right, tmp, pb_1 pxor m%4, m%1, m%3 - pand m%4, [pb_1] + pand m%4, %5 pavgb m%1, m%3 psubusb m%1, m%4 pavgb m%1, m%2 @@ -2041,4 +1882,306 @@ HU_XMM_FUNCS 7 INIT_XMM avx HU_XMM_FUNCS 7 +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 +cglobal vp9_ipred_dc_32x32, 4, 4, 3, dst, stride, l, a + pxor m1, m1 + psadbw m0, m1, [lq] + psadbw m1, [aq] + movd xm2, [pw_512] + paddw m0, m1 + vextracti128 xm1, m0, 1 +.main: + paddw xm0, xm1 + punpckhqdq xm1, xm0, xm0 + paddw xm0, xm1 + pmulhrsw xm0, xm2 + vpbroadcastb m0, xm0 +.main2: + lea r2, [strideq*3] + mov r3d, 8 +.loop: + mova [dstq+strideq*0], m0 + mova [dstq+strideq*1], m0 + mova [dstq+strideq*2], m0 + mova [dstq+r2 ], m0 + lea dstq, [dstq+strideq*4] + dec r3d + jg .loop + RET + +cglobal vp9_ipred_dc_top_32x32, 0, 4, 3, dst, stride, l, a + mov lq, amp +%if ARCH_X86_32 + jmp mangle(private_prefix %+ _vp9_ipred_dc_left_32x32 %+ SUFFIX).main +%endif + +%assign function_align 1 +cglobal vp9_ipred_dc_left_32x32, 0, 4, 3, dst, stride, l, a + movifnidn lq, lmp +.main: + movifnidn dstq, dstmp + movifnidn strideq, stridemp + pxor xm1, xm1 + psadbw xm0, xm1, [lq] + psadbw xm1, [lq+16] + movd xm2, [pw_1024] + jmp mangle(private_prefix %+ _vp9_ipred_dc_32x32 %+ SUFFIX).main + +cglobal vp9_ipred_v_32x32, 2, 4, 3, dst, stride, l, a + movifnidn aq, amp + mova m0, [aq] + jmp mangle(private_prefix %+ _vp9_ipred_dc_32x32 %+ SUFFIX).main2 + +%assign function_align 16 +cglobal vp9_ipred_h_32x32, 3, 5, 6, dst, stride, l + vpbroadcastd m2, [pb_3] + mov r3d, 7 + vpbroadcastd m3, [pb_2] + pxor m5, m5 + vpbroadcastd m4, [pb_1] + lea r4, [strideq*3] +.loop: + vpbroadcastd m1, [lq+r3*4] + pshufb m0, m1, m2 + mova [dstq+strideq*0], m0 + pshufb m0, m1, m3 + mova [dstq+strideq*1], m0 + pshufb m0, m1, m4 + mova [dstq+strideq*2], m0 + pshufb m1, m5 + mova [dstq+r4 ], m1 + lea dstq, [dstq+strideq*4] + dec r3d + jge .loop + RET + +cglobal vp9_ipred_tm_32x32, 4, 4, 8, dst, stride, l, a + vpbroadcastd m0, [aq-1] + mova m7, [aq] + pxor m1, m1 + vpbroadcastd m4, [pw_m255] + mov r3d, 15 + vpbroadcastd m5, [pw_m256] + pshufb m0, m5 + punpcklbw m6, m7, m1 + punpckhbw m7, m1 + psubw m6, m0 + psubw m7, m0 +.loop: + vpbroadcastd m3, [lq+r3*2] + pshufb m2, m3, m4 + pshufb m3, m5 + paddw m0, m2, m6 + paddw m2, m7 + paddw m1, m3, m6 + paddw m3, m7 + packuswb m0, m2 + packuswb m1, m3 + mova [dstq+strideq*0], m0 + mova [dstq+strideq*1], m1 + lea dstq, [dstq+strideq*2] + dec r3d + jge .loop + RET + +cglobal vp9_ipred_dl_32x32, 2, 5, 6, dst, stride, l, a + movifnidn aq, amp + vpbroadcastb m2, [aq+31] + vinserti128 m3, m2, [aq+16], 0 + mova m0, [aq+ 0] + vpbroadcastd m5, [pb_1] + palignr m4, m3, m0, 2 + lea r3, [strideq*2] + palignr m3, m0, 1 + LOWPASS 0, 3, 4, 1, m5 + lea r4, [strideq*3] + vperm2i128 m1, m0, m2, 0x31 + mov r2d, 8 +.loop: + shufpd m3, m0, m1, 0x05 + mova [dstq+r3*0], m0 + punpckhqdq m4, m1, m2 + mova [dstq+r3*4], m3 + palignr m0, m1, m0, 1 + mova [dstq+r3*8], m1 + palignr m1, m2, m1, 1 + mova [dstq+r4*8], m4 + add dstq, strideq + dec r2d + jg .loop + RET + +cglobal vp9_ipred_dr_32x32, 4, 5, 7, dst, stride, l, a + mova m3, [lq+ 0] + movu m1, [aq- 1] + mova m0, [aq+ 0] + vpbroadcastd m6, [pb_1] + vperm2i128 m2, m3, m1, 0x21 + lea r3, [strideq*2] + palignr m4, m1, m2, 15 + LOWPASS 0, 1, 4, 5, m6 + pslldq xm4, xm3, 1 + palignr m2, m3, 1 + vinserti128 m4, [lq+15], 1 + LOWPASS 2, 3, 4, 5, m6 + lea r4, [strideq*3] + vperm2i128 m1, m2, m0, 0x21 + mov r2d, 8 +.loop: + shufpd m3, m1, m0, 0x05 + mova [dstq+r3*0], m0 + shufpd m4, m2, m1, 0x05 + mova [dstq+r3*4], m3 + palignr m0, m1, 15 + mova [dstq+r3*8], m1 + palignr m1, m2, 15 + mova [dstq+r4*8], m4 + add dstq, strideq + pslldq m2, 1 + dec r2d + jg .loop + RET + +cglobal vp9_ipred_hd_32x32, 4, 6, 7, dst, stride, l, a + movu m1, [aq-1] + mova m0, [lq] + vpbroadcastd m6, [pb_1] + vperm2i128 m4, m0, m1, 0x21 + palignr m3, m4, m0, 1 + palignr m4, m0, 2 + LOWPASS 4, 3, 0, 2, m6 + pavgb m3, m0 + movu xm0, [aq+15] + punpcklbw m2, m3, m4 + punpckhbw m3, m4 + palignr m4, m0, m1, 2 + palignr m0, m1, 1 + LOWPASS 4, 0, 1, 5, m6 + lea r2, [strideq*8] + vinserti128 m0, m2, xm3, 1 + lea r3, [dstq+r2*1] + vpblendd m1, m2, m3, 0x0f + lea r4, [dstq+r2*2] + vperm2i128 m2, m3, 0x31 + lea r5, [r3 +r2*2] + vperm2i128 m3, m4, 0x21 +.loop: + sub r2, strideq + mova [r5 +r2], m0 + palignr m0, m1, m0, 2 + mova [r4 +r2], m1 + palignr m1, m2, m1, 2 + mova [r3 +r2], m2 + palignr m2, m3, m2, 2 + mova [dstq+r2], m3 + palignr m3, m4, m3, 2 + psrldq m4, 2 + jg .loop + RET + +cglobal vp9_ipred_hu_32x32, 3, 5, 6, dst, stride, l, a + mova m0, [lq] + vpbroadcastb xm3, [lq+31] + vpbroadcastd m1, [pb_1] + vbroadcasti128 m4, [pb_2toE_3xF] + vperm2i128 m3, m0, 0x03 + palignr m5, m3, m0, 2 + palignr m3, m0, 1 + LOWPASS 5, 3, 0, 2, m1 + vpbroadcastd m1, [pb_15] + pavgb m3, m0 + punpcklbw m2, m3, m5 + punpckhbw m3, m5 + vinserti128 m0, m2, xm3, 1 + pshufb m5, m1 + vperm2i128 m1, m2, m3, 0x12 + lea r3, [strideq*2] + vperm2i128 m2, m3, 0x31 + lea r4, [strideq*3] + vperm2i128 m3, m5, 0x31 + mov r2d, 8 +.loop: + mova [dstq+r3*0], m0 + palignr m0, m1, m0, 2 + mova [dstq+r3*4], m1 + palignr m1, m2, m1, 2 + mova [dstq+r3*8], m2 + palignr m2, m3, m2, 2 + mova [dstq+r4*8], m3 + pshufb m3, m4 + add dstq, strideq + dec r2d + jg .loop + RET + +cglobal vp9_ipred_vl_32x32, 2, 5, 6, dst, stride, l, a + movifnidn aq, amp + vpbroadcastb m4, [aq+31] + vinserti128 m0, m4, [aq+16], 0 + mova m1, [aq+ 0] + vpbroadcastd m5, [pb_1] + palignr m2, m0, m1, 2 + palignr m0, m1, 1 + LOWPASS 2, 0, 1, 3, m5 + pavgb m0, m1 + lea r3, [strideq*2] + vperm2i128 m1, m0, m4, 0x31 + lea r4, [strideq+r3*8] + vperm2i128 m3, m2, m4, 0x31 + mov r2d, 8 +.loop: + shufpd m4, m0, m1, 0x05 + mova [dstq+strideq*0], m0 + shufpd m5, m2, m3, 0x05 + mova [dstq+strideq*1], m2 + palignr m0, m1, m0, 1 + mova [dstq+r3*8 ], m4 + psrldq m1, 1 + mova [dstq+r4 ], m5 + palignr m2, m3, m2, 1 + add dstq, r3 + psrldq m3, 1 + dec r2d + jg .loop + RET + +cglobal vp9_ipred_vr_32x32, 4, 5, 7, dst, stride, l, a + mova m4, [lq+ 0] + movu m0, [aq- 1] + vpbroadcastd m6, [pb_1] + vperm2i128 m2, m4, m0, 0x21 + pslldq xm5, xm4, 1 + palignr m3, m2, m4, 1 + vinserti128 m5, [lq+15], 1 + LOWPASS 3, 4, 5, 1, m6 + mova m1, [aq+ 0] + vbroadcasti128 m4, [pb_02468ACE_13579BDF] + palignr m2, m0, m2, 15 + LOWPASS 2, 0, 1, 5, m6 + pshufb m3, m4 + lea r3, [strideq*2] + vpermq m3, m3, q2031 + pavgb m0, m1 + vinserti128 m1, m3, xm0, 1 + lea r4, [strideq+r3*8] + vperm2i128 m3, m2, 0x21 + mov r2d, 8 +.loop: + shufpd m4, m1, m0, 0x05 + mova [dstq+strideq*0], m0 + shufpd m5, m3, m2, 0x05 + mova [dstq+strideq*1], m2 + palignr m0, m1, 15 + mova [dstq+r3*8 ], m4 + pslldq m1, 1 + mova [dstq+r4 ], m5 + palignr m2, m3, 15 + add dstq, r3 + pslldq m3, 1 + dec r2d + jg .loop + RET +%endif + ; FIXME 127, 128, 129 ? -- 2.49.1 _______________________________________________ ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org