From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <ffmpeg-devel-bounces@ffmpeg.org> Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 663B04CF5C for <ffmpegdev@gitmailbox.com>; Fri, 30 May 2025 07:27:36 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id D124768D678; Fri, 30 May 2025 10:27:27 +0300 (EEST) Received: from EUR03-DBA-obe.outbound.protection.outlook.com (mail-dbaeur03olkn2106.outbound.protection.outlook.com [40.92.58.106]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id A4DA868D62E for <ffmpeg-devel@ffmpeg.org>; Fri, 30 May 2025 10:27:20 +0300 (EEST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=jN1LyuJ2nhyL/u1kIqSEAB3tZ06NBhhEoRuhTOwTJdzKchPqZOFd0uUQ7Xw8no0zPx5YaDsSK6aH/uA+RYW70d9kdmTnIEsCccJuVTcIwq3+G9E0DWm2+7u6hP5ZNFfuQsTG6DQ91/jGhSeXUMASNzSHR/6wKi5TKHxHkR9GFMQmwnBXgnaxaE+nCWnUSVEyDlc/Fi3QFZQxZxOsZfWKoDU0D5xPg8rI2axUYik/zHieD5ixW98uo1R9toCmWiC4TwWPp9/N0qScD2aGwBnxLAnmCCnNEqTK5wB3ocEkbhyXoXKrKKK8r0ypKjVg9aTRN0OA0KkKzX0R0qhXwbLWBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=7HBIZPE7D3rkMgHfk2CfuGYI8f2mQIpxfiDe/3FCelk=; b=bml3F4d3wYVyrb/R9/vzg21Z7gH4ftpr8QOsyVf/ywGunz1xcb7HbSj3P00hO7HRj7gppQ8a3/aERBhYetMFQ2x9IO5TtOhz2ySTSvZJpuitWi0oWwZuGfve5UVmOcnLasS653GTHRuh1CIbh757jLiRIlTrP2CAb/GrGusL0sOB5/72RKhnH5xxFGeNEkE+lGngs6fU5WtlEUNx9nLQK+KSxUwL7CvVGAJ4w4+htzcUNkpF7TRJGHP4zalp2IW+g4r1OJzcWjR24lhJyc5fIRWjoXMGy1VTJ0gtThR7LTAi/GLYQCFDr7LzWxKoY37X9vEhl4nCJSJHH0LtnDskbw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=7HBIZPE7D3rkMgHfk2CfuGYI8f2mQIpxfiDe/3FCelk=; b=awdg86g/H9ue1L9lVbFPICjBovgeMMdl26Wjobo1mrvWnW+zOC8teWOfgHKqIARV5cv61gOSc2hyZm9e6QOIKjjvbkN/afnjQGO64Z9ElMkl2ElxkW9UVPKIa22O36F1Bp1JlR9EELZgWcypr3FFmK6Xfg4UJ2PJ3NlsAv88SV2gwynOMGlATcaVh7UOT/2FAaPsYefcSzwmYgAtQQZIDMCA+5C5eGw3gw4e+67Cl5H2IDnwGvxBymuAIZsCzxmUvaoc25eEssVSu0HW9PH5Aap5QJacleTu4Wfh0DYVslKDGyzKZNeXnSrz5GrVqEX3RyaSZCOkPX2RF19cXx5nYw== Received: from DBAP193MB0956.EURP193.PROD.OUTLOOK.COM (2603:10a6:10:1c5::19) by GV1P193MB2037.EURP193.PROD.OUTLOOK.COM (2603:10a6:150:29::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8769.32; Fri, 30 May 2025 07:27:14 +0000 Received: from DBAP193MB0956.EURP193.PROD.OUTLOOK.COM ([fe80::ed13:9f9d:e088:ae31]) by DBAP193MB0956.EURP193.PROD.OUTLOOK.COM ([fe80::ed13:9f9d:e088:ae31%3]) with mapi id 15.20.8769.022; Fri, 30 May 2025 07:27:14 +0000 From: Dmitriy Kovalenko <dmtr.kovalenko@outlook.com> To: ffmpeg-devel@ffmpeg.org Date: Fri, 30 May 2025 09:27:05 +0200 Message-ID: <DBAP193MB0956E4FE4B60344958D908F58D61A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250530072706.15067-1-dmtr.kovalenko@outlook.com> References: <20250530072706.15067-1-dmtr.kovalenko@outlook.com> X-ClientProxiedBy: AM8P190CA0024.EURP190.PROD.OUTLOOK.COM (2603:10a6:20b:219::29) To DBAP193MB0956.EURP193.PROD.OUTLOOK.COM (2603:10a6:10:1c5::19) X-Microsoft-Original-Message-ID: <20250530072706.15067-2-dmtr.kovalenko@outlook.com> MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DBAP193MB0956:EE_|GV1P193MB2037:EE_ X-MS-Office365-Filtering-Correlation-Id: a74489e1-b9f1-47ae-daee-08dd9f4b65eb X-Microsoft-Antispam: BCL:0; ARA:14566002|7092599006|19110799006|5072599009|8060799009|15080799009|461199028|3412199025|440099028|1710799026; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?ShX8suFyUnP6zb42AfWaDGgyCG+uxRaGUnPELVk6G1Y9CiNCgxEmkxbhGIb7?= =?us-ascii?Q?b8SxbS1+z2xASRIZf1YgL1p7JWPwSia5X1bSOgKoaO9o1qGuaUJVsSTfvTBd?= =?us-ascii?Q?iWqycVm37XR/RDHZHvKXRqXLg+VvZUpskAEZrYXyrFkwSJEzx+VK0gXamfYA?= =?us-ascii?Q?0s3jRC4+40aPwtGnFpOVutPbxzSi4WlsMXLUp9ANSaL4e/z4d3Hqzb1sBx37?= =?us-ascii?Q?bpMitdyNev4irhP30iVuPnTRyg3vQAbVX0SAzxPXyYSYoCrIzXAfKneQfLad?= =?us-ascii?Q?WLURN7U70YkCN4dFYxABWcHJCp3av3oAG8YnLUcdbaMYLQBAjzLPBknV+XBh?= =?us-ascii?Q?WDU3R3DNufsoUYgFIQn11ukf2VX1VUirrowf0s+AmfKAMibjblM8Qny9JQFH?= =?us-ascii?Q?gfj7VfpZ7BPtDiEguGjtEUNzYv4btktXl+HC5O5LRXOeqiDRsZU1lxMZApCG?= =?us-ascii?Q?Zm4vCyjW5hQQpTrg0OjdnhAv08bTeJIXxGzyrIvkGcNaMNmhN+2/H71v0YCv?= =?us-ascii?Q?pw7sOAGsHsGM0pmfDTQvL2PBZcD+KnvJpocsfjCvrIK3Elebgj0Ao/OSjeSd?= =?us-ascii?Q?+fKQF5E0o+K6hH7stMmyK+woTof0ZOoprc/zZy2Q9PMyFD9ipQ9PuzcOqfJ1?= =?us-ascii?Q?UbyrPOBiHvXfchCm/udJ7UHIwP/+pp7IBtH+STS7eVEf8OL5Why9JPIxgiwO?= =?us-ascii?Q?VAa17+df1S6rlRMQEv+1moehhJ16KwrDcF5YOAUV0eJGfMYdli8K8ii0sv1E?= =?us-ascii?Q?V7cvFZ2Byl5/n52RKtKibaobbZDLczpPe+Z5o/pRz0UWKMDAOu05X6elaepz?= =?us-ascii?Q?VElbNa8N4k8CjByeGahqhIoNWIw/xHypsE6U1E67F6YmjFBUY6ap0rivQNXk?= =?us-ascii?Q?0eL0+SN/bYsc1T4QA4cnjQe0+0/XIDLhMgi2OrTPe5qLYrFkdyDEwaW/HojR?= =?us-ascii?Q?XgB3UrBrdP3z1XUytU+pXiZNAFKQ4q0CDG965xRHi2cENdB1j5Rq5muxRjm8?= =?us-ascii?Q?XEjv4cJqsUVpFWMs9FwLJSq8l+2EWJJnVRervhEmy06r70k0O6qsHAbYF6Ll?= =?us-ascii?Q?2/zs6DN5rd5PBQZ9bVkgk2iUteYvW9fYj5zW9fjFtNyN/Uy0OQ2mMxn9XuzG?= =?us-ascii?Q?gGdfTMeITSHTgmXv0/z7RYLFWKuVh4uhFQ=3D=3D?= X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?Ak/vFpttlUW/AyCYwrReX004MzoiBGmgWqIWqB8boVDTZ80PHifGhPjC0DpY?= =?us-ascii?Q?bhP03P1XtEcUrBlG2yc0I91FsqL0wRAW0yQ8aR5gqjAiLCnh+aeXSUd5yAyC?= =?us-ascii?Q?p+stuB3cMNa2RpUIbvx5sR1G9mhGoNE7k6Ejrq/Nh6auWuJ4VNkynhQ7WUiO?= =?us-ascii?Q?ei6asDD4vYZn7VsGKGRD/HJtzbfnMj4yVJyquRN4NusG93LnP8F1zqmEltwl?= =?us-ascii?Q?c7jG3/Vcy6Bgg9cdoF3Clm4Q6TznzqbXg9v7N4dCb446OG8h+3gVLTA7UxZa?= =?us-ascii?Q?4AyU9RAvj2ii4MS0bibKqTWBuWijxlDqCEHDcsSDrSmr+m6qXXtW18zWRDb5?= =?us-ascii?Q?5MLeRqzLQA46lJ6lJK16VRdPep9inKAZNAb79OcvRakzEwN9Nh/0QwYHnBgD?= =?us-ascii?Q?fr6KYRO1G8VSABG6y9R9WBgrH8gd9kn4ibcx6X5vL7/s1wSWqy+HEJ6ZYUxj?= =?us-ascii?Q?/JAFj50fRNBR/sya54ucYxd9Ly3K6rfcopLBXKBzdSDtkmEFdx6A9yIlvoLY?= =?us-ascii?Q?kfUpm8OZDEwMewOuC6Vpq3I8EFdrLaurXCfY+suPXlfWdLpVKEzr2tEI9QSI?= =?us-ascii?Q?1IQMtDZ17K9SFNTqGF/sQpnPjgkq+0iLf5udb3OHzEbg+a/GYG17pU55iEoD?= =?us-ascii?Q?z/WXogAE39CMuroLuWGqpoUebd0R9ZAMBXYSqv95CVhrTEj4hTowwkqtYDEf?= =?us-ascii?Q?HL6A99+HDamxYgxpYHN1m5NOLw8l++zFONNLrTyUVLLduQKonLFmK78Q6YOO?= =?us-ascii?Q?JrlqFDLmrUw0RdFTSf1urLntLjdEbls9/WWPQmU8UcR/N+inbltzYVLOVkG7?= =?us-ascii?Q?nG2OrM/t/FykpfwW5KHRWZudJCzKHZydgnMhKkWKDw1my2CVlBlozeiYqXRn?= =?us-ascii?Q?HidEYnc/SVQLSRgCyT+FRoFAjSlvQX4rqh/neWPdQhTRTDTn9LsgWmIcuMx9?= =?us-ascii?Q?nRRJLrCTcyqi13JolWMYg8z8CHi3oc9EuZRcMcyvKl0WbCeixOaGqd9NtXrW?= =?us-ascii?Q?uj6UxBSZZ4qU3yIf3XTfFTvEtytON3eE8UHd1JGhOYP8xzP2aWrG35z2V5Yj?= =?us-ascii?Q?8oO5VjNCYK33OqHaskuaDksV3SX21sdDtzrKbHT3YLUyfDILQP1HkrF2WflA?= =?us-ascii?Q?2manoqwHUvwMn1bJKtjgITFw70AZeuyiCeICWSBGF7/T2dZj6OQ3oFBSpiM1?= =?us-ascii?Q?A9CBfEefqpNJRYwEgJR+k8hs2dut0/qn6ZC6zRoH7AF61ZRfwqPcP5RTyQZK?= =?us-ascii?Q?wTGJyKbZKC93ECiThBNJ?= X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: a74489e1-b9f1-47ae-daee-08dd9f4b65eb X-MS-Exchange-CrossTenant-AuthSource: DBAP193MB0956.EURP193.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 May 2025 07:27:14.1156 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: GV1P193MB2037 Subject: [FFmpeg-devel] [PATCH v3 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org> List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>, <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe> List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel> List-Post: <mailto:ffmpeg-devel@ffmpeg.org> List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help> List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>, <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe> Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Cc: Dmitriy Kovalenko <dmtr.kovalenko@outlook.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org> Archived-At: <https://master.gitmailbox.com/ffmpegdev/DBAP193MB0956E4FE4B60344958D908F58D61A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM/> List-Archive: <https://master.gitmailbox.com/ffmpegdev/> List-Post: <mailto:ffmpegdev@gitmailbox.com> This patch integrates so called double bufferring when we are loading 2 batch of elements at a time and then processing them in parallel. On the moden arm processors especially Apple Silicon it gives a visible benefit, for subsampled pixel processing it is especially nice because it allows to read elements w/ 2 instructions and write with a single one (especially visible on a platforms with slower memory like ios). Including the previous patch in a stack on macbook pro m4 max rgb_to_yuv_half in checkasm goes up 2x of the c version --- libswscale/aarch64/input.S | 103 ++++++++++++++++++++++++++++--------- 1 file changed, 79 insertions(+), 24 deletions(-) diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S index dc07bd1b48..f305d87935 100644 --- a/libswscale/aarch64/input.S +++ b/libswscale/aarch64/input.S @@ -197,40 +197,94 @@ function ff_\fmt_rgb()ToUV_half_neon, export=1 ldp w12, w13, [x6, #20] // w12: bu, w13: rv ldp w14, w15, [x6, #28] // w14: gv, w15: bv 4: - cmp w5, #8 rgb_set_uv_coeff half=1 - b.lt 2f -1: // load 16 pixels and prefetch memory for the next block + + cmp w5, #16 + b.lt 2f // Go directly to scalar if < 16 + +1: .if \element == 3 - ld3 { v16.16b, v17.16b, v18.16b }, [x3], #48 - prfm pldl1strm, [x3, #48] + ld3 { v16.16b, v17.16b, v18.16b }, [x3], #48 // First 16 pixels + ld3 { v26.16b, v27.16b, v28.16b }, [x3], #48 // Second 16 pixels + prfm pldl1keep, [x3, #96] .else - ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64 - prfm pldl1strm, [x3, #64] + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64 // First 16 pixels + ld4 { v26.16b, v27.16b, v28.16b, v29.16b }, [x3], #64 // Second 16 pixels + prfm pldl1keep, [x3, #128] .endif + // **Sum adjacent pixel pairs** .if \alpha_first - uaddlp v21.8h, v19.16b // v21: summed b pairs - uaddlp v20.8h, v18.16b // v20: summed g pairs - uaddlp v19.8h, v17.16b // v19: summed r pairs + uaddlp v21.8h, v19.16b // Block 1: B sums + uaddlp v20.8h, v18.16b // Block 1: G sums + uaddlp v19.8h, v17.16b // Block 1: R sums + uaddlp v31.8h, v29.16b // Block 2: B sums + uaddlp v30.8h, v28.16b // Block 2: G sums + uaddlp v29.8h, v27.16b // Block 2: R sums .else - uaddlp v19.8h, v16.16b // v19: summed r pairs - uaddlp v20.8h, v17.16b // v20: summed g pairs - uaddlp v21.8h, v18.16b // v21: summed b pairs + uaddlp v19.8h, v16.16b // Block 1: R sums + uaddlp v20.8h, v17.16b // Block 1: G sums + uaddlp v21.8h, v18.16b // Block 1: B sums + uaddlp v29.8h, v26.16b // Block 2: R sums + uaddlp v30.8h, v27.16b // Block 2: G sums + uaddlp v31.8h, v28.16b // Block 2: B sums .endif - mov v22.16b, v6.16b // U first half - mov v23.16b, v6.16b // U second half - mov v24.16b, v6.16b // V first half - mov v25.16b, v6.16b // V second half + // init accumulatos for both blocks + mov v7.16b, v6.16b // U_low + mov v8.16b, v6.16b // U_high + mov v9.16b, v6.16b // V_low + mov v10.16b, v6.16b // V_high + mov v11.16b, v6.16b // U_low + mov v12.16b, v6.16b // U_high + mov v13.16b, v6.16b // V_low + mov v14.16b, v6.16b // V_high + + smlal v7.4s, v0.4h, v19.4h // U += ru * r (0-3) + smlal v9.4s, v3.4h, v19.4h // V += rv * r (0-3) + smlal v11.4s, v0.4h, v29.4h // U += ru * r (0-3) + smlal v13.4s, v3.4h, v29.4h // V += rv * r (0-3) + + smlal2 v8.4s, v0.8h, v19.8h // U += ru * r (4-7) + smlal2 v10.4s, v3.8h, v19.8h // V += rv * r (4-7) + smlal2 v12.4s, v0.8h, v29.8h // U += ru * r (4-7) + smlal2 v14.4s, v3.8h, v29.8h // V += rv * r (4-7) + + smlal v7.4s, v1.4h, v20.4h // U += gu * g (0-3) + smlal v9.4s, v4.4h, v20.4h // V += gv * g (0-3) + smlal v11.4s, v1.4h, v30.4h // U += gu * g (0-3) + smlal v13.4s, v4.4h, v30.4h // V += gv * g (0-3) + + smlal2 v8.4s, v1.8h, v20.8h // U += gu * g (4-7) + smlal2 v10.4s, v4.8h, v20.8h // V += gv * g (4-7) + smlal2 v12.4s, v1.8h, v30.8h // U += gu * g (4-7) + smlal2 v14.4s, v4.8h, v30.8h // V += gv * g (4-7) + + smlal v7.4s, v2.4h, v21.4h // U += bu * b (0-3) + smlal v9.4s, v5.4h, v21.4h // V += bv * b (0-3) + smlal v11.4s, v2.4h, v31.4h // U += bu * b (0-3) + smlal v13.4s, v5.4h, v31.4h // V += bv * b (0-3) + + smlal2 v8.4s, v2.8h, v21.8h // U += bu * b (4-7) + smlal2 v10.4s, v5.8h, v21.8h // V += bv * b (4-7) + smlal2 v12.4s, v2.8h, v31.8h // U += bu * b (4-7) + smlal2 v14.4s, v5.8h, v31.8h // V += bv * b (4-7) + + sqshrn v16.4h, v7.4s, #10 // U (0-3) + sqshrn v17.4h, v9.4s, #10 // V (0-3) + sqshrn v22.4h, v11.4s, #10 // U (0-3) + sqshrn v23.4h, v13.4s, #10 // V (0-3) + + sqshrn2 v16.8h, v8.4s, #10 // U (0-7) + sqshrn2 v17.8h, v10.4s, #10 // V (0-7) + sqshrn2 v22.8h, v12.4s, #10 // U (0-7) + sqshrn2 v23.8h, v14.4s, #10 // V (0-7) + + stp q16, q22, [x0], #32 // Store all 16 U values + stp q17, q23, [x1], #32 // Store all 16 V values - rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4, v5, v22, v23, v24, v25, v16, v17, #10 - - str q16, [x0], #16 // store dst_u - str q17, [x1], #16 // store dst_v - - sub w5, w5, #8 // width -= 8 - cmp w5, #8 // width >= 8 ? + sub w5, w5, #16 // width -= 16 + cmp w5, #16 // width >= 16 ? b.ge 1b cbz w5, 3f // No pixels left? Exit @@ -444,3 +498,4 @@ endfunc DISABLE_DOTPROD #endif + -- 2.49.0 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".