From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 666584D36A for ; Thu, 20 Feb 2025 13:22:02 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 28E0A68C49D; Thu, 20 Feb 2025 15:21:59 +0200 (EET) Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id DF82168C257 for ; Thu, 20 Feb 2025 15:21:52 +0200 (EET) Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-22185cddbffso26644185ad.1 for ; Thu, 20 Feb 2025 05:21:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1740057710; x=1740662510; darn=ffmpeg.org; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=H9c3lxi2WHjOT9Twe2sVXd9ZRy3fgSff2OZyhNsRO7o=; b=J7Rl3Ekjfspl8IB2dGetmudZguOx7qctztpMc5E9bVEPFLfXDUPS3o5aLpzpQM5uHU znX/+hlXuPLFsYETes6nJ6u9ncFNVw+aVuALat13xexYHH4rm5SsCMKPo4v2Z/n5XXgT mWZZXSh5/3VFZ/avJehDLr4EeW4ZWDhq+0AkXf2B0kjDud58Z4p/OEUxqbBZk47HUfOH vd+/W5KnQBOwdZ9IWs6GO7aCryZBbSsG1e1VyularJvOiBJ2zoZDexhl5nFwy3XOwTk9 gJku3zn7nHS4QVXgbAoYFo8O9VgWgQh/G9FUq+nncx230lk4YnwaZWT+ldGy2+ye4JT2 2A1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740057710; x=1740662510; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=H9c3lxi2WHjOT9Twe2sVXd9ZRy3fgSff2OZyhNsRO7o=; b=bW0maUtdRYODA9sW/Atv5DKV284ArENuXSF2/fJOOUHyAG0ExWKWZ1n51/CpjpjYbd Qz3fRgRm8m6FjIzFZu9BZ/8dclWkbXp0/mGbXj6ZXQKdYyHpWxCu0WkKIMD74ukx38Pe /hqOWgIFzB+wUyw/PlnCO/+nJO/1rB/hErWO8DPbTtia2BL5JLok9HFyx/jp/7Wt227F /CbHqvYD3KOu0PnAjDLkdzek7/wd77LmvNpc6WZfHengPWbUpsZHyWhBpS1QfPxKWC3t TUldxVLbjkN7XhAHHMPFJodDqlHKPRPXSwpuJOXUjsRucUjus3CHrBI36hLrCoCVhpsU hDnA== X-Gm-Message-State: AOJu0YwRkumc3vdCaFryq8xVzFfs5YebR93fypgOo1YN0Z18JAL29pW3 5xLglE7fAP0+TEgcO/UnlPlmApA7aXn01T0iLpZQ42sTYEexo2XzLE342g== X-Gm-Gg: ASbGncvgIUCXDG5TRa/8KlfsMbwG7qbEAGDC4ot68wq6a7nuBmtBqhCi5FQIhW9oPc/ WpruqZJ7UI12KC1VgumUcyymyMFru2F9vL6k5s9VcJpRIQdK74sYo8pWd22jwwouCgZiLGSH7AH GXFutJyZ152pVTdYCn2cWb1kuByh9NrE3ZaHFW51GOvMGMiH8i+itATy6vln47hGsRycrYLZ8nV Xiy3uVikK8G4fP2aTjunUZbVevYF8v+TwbDdw1bgiKDagEs98LVV2XnicFSCIsDhxyLauxOJDnp 6DXkHlDs5VFkSc9ldSWwFCwXiJGokZlCZ1Q+pg== X-Google-Smtp-Source: AGHT+IHd25y4o1UVA17J4xeWnrAHxdfnoo/5dPnbU0UDp0rBPVuRt20bTNESGe+3BeqR7sRyh/5jwA== X-Received: by 2002:a05:6a00:929e:b0:730:9637:b2ff with SMTP id d2e1a72fcca58-734140c46a9mr5288272b3a.7.1740057710287; Thu, 20 Feb 2025 05:21:50 -0800 (PST) Received: from localhost.localdomain ([106.51.30.208]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7325d65619bsm10716942b3a.113.2025.02.20.05.21.48 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Feb 2025 05:21:49 -0800 (PST) From: Shreesh Adiga <16567adigashreesh@gmail.com> To: ffmpeg-devel@ffmpeg.org Date: Thu, 20 Feb 2025 18:51:38 +0530 Message-ID: <20250220132138.96479-1-16567adigashreesh@gmail.com> X-Mailer: git-send-email 2.45.3 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] swscale/x86/rgb2rgb: optimize AVX2 version of uyvytoyuv422 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: Currently the AVX2 version of uyvytoyuv422 in the SIMD loop does the following: 4 vinsertq to have interleaving of the vector lanes during load from memory. 4 vperm2i128 inside 4 RSHIFT_COPY calls to achieve the desired layout. This patch replaces the above 8 instructions with 2 vpermq and 2 vpermd with a vector register similar to AVX512ICL version. Observed the following numbers on various microarchitectures: On AMD Zen3 laptop: Before: uyvytoyuv422_c: 51979.7 ( 1.00x) uyvytoyuv422_sse2: 5410.5 ( 9.61x) uyvytoyuv422_avx: 4642.7 (11.20x) uyvytoyuv422_avx2: 4249.0 (12.23x) After: uyvytoyuv422_c: 51659.8 ( 1.00x) uyvytoyuv422_sse2: 5420.8 ( 9.53x) uyvytoyuv422_avx: 4651.2 (11.11x) uyvytoyuv422_avx2: 3953.8 (13.07x) On Intel Macbook Pro 2019: Before: uyvytoyuv422_c: 185014.4 ( 1.00x) uyvytoyuv422_sse2: 22800.4 ( 8.11x) uyvytoyuv422_avx: 19796.9 ( 9.35x) uyvytoyuv422_avx2: 13141.9 (14.08x) After: uyvytoyuv422_c: 185093.4 ( 1.00x) uyvytoyuv422_sse2: 22795.4 ( 8.12x) uyvytoyuv422_avx: 19791.9 ( 9.35x) uyvytoyuv422_avx2: 12043.1 (15.37x) On AMD Zen4 desktop: Before: uyvytoyuv422_c: 29105.0 ( 1.00x) uyvytoyuv422_sse2: 3888.0 ( 7.49x) uyvytoyuv422_avx: 3374.2 ( 8.63x) uyvytoyuv422_avx2: 2649.8 (10.98x) uyvytoyuv422_avx512icl: 1615.0 (18.02x) After: uyvytoyuv422_c: 29093.4 ( 1.00x) uyvytoyuv422_sse2: 3874.4 ( 7.51x) uyvytoyuv422_avx: 3371.6 ( 8.63x) uyvytoyuv422_avx2: 2174.6 (13.38x) uyvytoyuv422_avx512icl: 1625.1 (17.90x) Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com> --- libswscale/x86/rgb_2_rgb.asm | 68 ++++++++++++++++++------------------ 1 file changed, 34 insertions(+), 34 deletions(-) diff --git a/libswscale/x86/rgb_2_rgb.asm b/libswscale/x86/rgb_2_rgb.asm index 6e4df17298..871bb21127 100644 --- a/libswscale/x86/rgb_2_rgb.asm +++ b/libswscale/x86/rgb_2_rgb.asm @@ -49,18 +49,21 @@ shuf_perm2b: db 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 97, 99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121, 123, 125, 127 %endif +%if HAVE_AVX2_EXTERNAL +; shuffle vector to rearrange packuswb result to be linear +shuf_packus_avx2: db 0, 0, 0, 0, 4, 0, 0, 0, 1, 0, 0, 0, 5, 0, 0, 0,\ + 2, 0, 0, 0, 6, 0, 0, 0, 3, 0, 0, 0, 7, 0, 0, 0, +%endif + SECTION .text -%macro RSHIFT_COPY 5 +%macro RSHIFT_COPY 3 ; %1 dst ; %2 src ; %3 shift -%if mmsize == 32 - vperm2i128 %1, %2, %3, %5 - RSHIFT %1, %4 -%elif cpuflag(avx) - psrldq %1, %2, %4 +%if cpuflag(avx) || cpuflag(avx2) || cpuflag(avx512icl) + psrldq %1, %2, %3 %else mova %1, %2 - RSHIFT %1, %4 + RSHIFT %1, %3 %endif %endmacro @@ -170,18 +173,16 @@ SHUFFLE_BYTES 1, 2, 0, 3 ; int lumStride, int chromStride, int srcStride) ;----------------------------------------------------------------------------------------------- %macro UYVY_TO_YUV422 0 -%if mmsize == 64 -; need two more registers to store shuffle vectors for AVX512ICL -cglobal uyvytoyuv422, 9, 14, 10, ydst, udst, vdst, src, w, h, lum_stride, chrom_stride, src_stride, wtwo, whalf, tmp, x, back_w -%else -cglobal uyvytoyuv422, 9, 14, 8, ydst, udst, vdst, src, w, h, lum_stride, chrom_stride, src_stride, wtwo, whalf, tmp, x, back_w -%endif +cglobal uyvytoyuv422, 9, 14, 8 + cpuflag(avx2) + cpuflag(avx512icl), ydst, udst, vdst, src, w, h, lum_stride, chrom_stride, src_stride, wtwo, whalf, tmp, x, back_w pxor m0, m0 %if mmsize == 64 vpternlogd m1, m1, m1, 0xff ; m1 = _mm512_set1_epi8(0xff) movu m8, [shuf_packus] movu m9, [shuf_perm2b] %else + %if cpuflag(avx2) + movu m8, [shuf_packus_avx2] + %endif pcmpeqw m1, m1 %endif psrlw m1, 8 @@ -295,21 +296,10 @@ cglobal uyvytoyuv422, 9, 14, 8, ydst, udst, vdst, src, w, h, lum_stride, chrom_s jge .end_line .loop_simd: -%if mmsize == 32 - movu xm2, [srcq + wtwoq ] - movu xm3, [srcq + wtwoq + 16 ] - movu xm4, [srcq + wtwoq + 16 * 2] - movu xm5, [srcq + wtwoq + 16 * 3] - vinserti128 m2, m2, [srcq + wtwoq + 16 * 4], 1 - vinserti128 m3, m3, [srcq + wtwoq + 16 * 5], 1 - vinserti128 m4, m4, [srcq + wtwoq + 16 * 6], 1 - vinserti128 m5, m5, [srcq + wtwoq + 16 * 7], 1 -%else movu m2, [srcq + wtwoq ] movu m3, [srcq + wtwoq + mmsize ] movu m4, [srcq + wtwoq + mmsize * 2] movu m5, [srcq + wtwoq + mmsize * 3] -%endif %if mmsize == 64 ; extract y part 1 @@ -323,23 +313,29 @@ cglobal uyvytoyuv422, 9, 14, 8, ydst, udst, vdst, src, w, h, lum_stride, chrom_s movu [ydstq + wq + mmsize], m7 %else ; extract y part 1 - RSHIFT_COPY m6, m2, m4, 1, 0x20 ; UYVY UYVY -> YVYU YVY... - pand m6, m1; YxYx YxYx... + RSHIFT_COPY m6, m2, 1 ; UYVY UYVY -> YVYU YVY... + pand m6, m1 ; YxYx YxYx... - RSHIFT_COPY m7, m3, m5, 1, 0x20 ; UYVY UYVY -> YVYU YVY... - pand m7, m1 ; YxYx YxYx... + RSHIFT_COPY m7, m3, 1 ; UYVY UYVY -> YVYU YVY... + pand m7, m1 ; YxYx YxYx... - packuswb m6, m7 ; YYYY YYYY... + packuswb m6, m7 ; YYYY YYYY... +%if mmsize == 32 + vpermq m6, m6, 0xd8 +%endif movu [ydstq + wq], m6 ; extract y part 2 - RSHIFT_COPY m6, m4, m2, 1, 0x13 ; UYVY UYVY -> YVYU YVY... - pand m6, m1; YxYx YxYx... + RSHIFT_COPY m6, m4, 1 ; UYVY UYVY -> YVYU YVY... + pand m6, m1 ; YxYx YxYx... - RSHIFT_COPY m7, m5, m3, 1, 0x13 ; UYVY UYVY -> YVYU YVY... - pand m7, m1 ; YxYx YxYx... + RSHIFT_COPY m7, m5, 1 ; UYVY UYVY -> YVYU YVY... + pand m7, m1 ; YxYx YxYx... - packuswb m6, m7 ; YYYY YYYY... + packuswb m6, m7 ; YYYY YYYY... +%if mmsize == 32 + vpermq m6, m6, 0xd8 +%endif movu [ydstq + wq + mmsize], m6 %endif @@ -359,6 +355,8 @@ cglobal uyvytoyuv422, 9, 14, 8, ydst, udst, vdst, src, w, h, lum_stride, chrom_s packuswb m6, m7 ; UUUU %if mmsize == 64 vpermb m6, m8, m6 +%elif mmsize == 32 + vpermd m6, m8, m6 %endif movu [udstq + whalfq], m6 @@ -369,6 +367,8 @@ cglobal uyvytoyuv422, 9, 14, 8, ydst, udst, vdst, src, w, h, lum_stride, chrom_s packuswb m2, m4 ; VVVV %if mmsize == 64 vpermb m2, m8, m2 +%elif mmsize == 32 + vpermd m2, m8, m2 %endif movu [vdstq + whalfq], m2 -- 2.45.3 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".