From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 260264D277 for ; Sat, 31 May 2025 09:17:02 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 0DD7568DC5A; Sat, 31 May 2025 12:16:54 +0300 (EEST) Received: from EUR05-DB8-obe.outbound.protection.outlook.com (mail-db8eur05olkn2026.outbound.protection.outlook.com [40.92.89.26]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id C69FE68D965 for ; Sat, 31 May 2025 12:16:46 +0300 (EEST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=R1nEuPszniGD7YbKYoi+iVhHxTwygAQ7OlB7PcmGCk1gSPlXqOb5KtjQ6JN5H5WTyiUAOYlFt3RVmDB0Wkj3meQRopsjjDqCgSr+5MHWthckrxWO9vXRpCY+1pvuDgfTfeW6JsIRgV0WmNu17ocfu9iRrkcxazzWceS3L8f5KZf4wWEHyfNNAJYmydOPd6UCLD4C9+JLmLk7eRM9Z+34xG1cv/Z9usokvxiQmgGJmTccPGHWPyXQVaf5cUcwtEaQmI3lnbl996d4Uqu5kq00ilYfkrpEzqbuI4OC1jHFnOHkcB5gBdw+yxU2PxUylq9tMGghnagOTKL+EFmucCbmvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=mHptvr0+zTrsG42LNUgHyKQS/xN1kTH+TCIapQ/qgaQ=; b=I4gHiRFmvviDMFEBfW4c2rggjlFIk2v7vDCUckhvC2tSZ4Es6ROHQWKsM7N0drYgoxnhK63J5hsB5xkwR4+qK4jCTr5cZDpLSoM3EiObsGILy9AOmFTUU7Zzts0jaro2DSrunSoatMMhv/GyaNw+Y35IwB4cImo58FHLMT58wmlz/d4W5Zg/rDXy/glLWMQ++JfIp5dabXiDL9cbdaCpJSIe4+rhcNFq8hpCit17/omf69FHgeqxcrtTs5eWEcIITiYshON7OeGBj/NQb2akEzALOPcPvCOU8k3xyrgeNj96hFm2b31ipA43Im7j74HZ9zrgpNHkQDwY0qFPE03cQQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=mHptvr0+zTrsG42LNUgHyKQS/xN1kTH+TCIapQ/qgaQ=; b=Ui1jGAVB6mhO5mKdHEtAObEfwzk2oV/feDPU83yoo/1ythlCEBsx3/13iOPIW+GbHdQpHvsnkY19N5IkAnDWToxKaygf8TLgzQADE8/r2xwevRDMq4kZBKgl37f/xcVwf/TJibLAZjw+H8sTh1YToylWTgB0S+Rh3SLi4mpLyDDjkwZB0/52UQ+9HsXFR9cP5NdXPWaqVngXS3MPPggk3lw4nxzGhap1LjeJuSn1GalR2CmQnQCKKAB1NtU0noSFZC2tyPp6mYeYHCbJBWcq1mpCcNC6VReGPg5Kecjnebhn18aqHaX6SUxHC9XjwpbSbFO89x8xaVgJYX10givo5g== Received: from DBAP193MB0956.EURP193.PROD.OUTLOOK.COM (2603:10a6:10:1c5::19) by PAXP193MB2106.EURP193.PROD.OUTLOOK.COM (2603:10a6:102:228::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8769.31; Sat, 31 May 2025 09:16:42 +0000 Received: from DBAP193MB0956.EURP193.PROD.OUTLOOK.COM ([fe80::ed13:9f9d:e088:ae31]) by DBAP193MB0956.EURP193.PROD.OUTLOOK.COM ([fe80::ed13:9f9d:e088:ae31%3]) with mapi id 15.20.8769.022; Sat, 31 May 2025 09:16:42 +0000 From: Dmitriy Kovalenko To: ffmpeg-devel@ffmpeg.org Date: Sat, 31 May 2025 11:11:44 +0200 Message-ID: X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250531091631.45342-1-dmtr.kovalenko@outlook.com> References: <20250531091631.45342-1-dmtr.kovalenko@outlook.com> X-ClientProxiedBy: AM0PR07CA0018.eurprd07.prod.outlook.com (2603:10a6:208:ac::31) To DBAP193MB0956.EURP193.PROD.OUTLOOK.COM (2603:10a6:10:1c5::19) X-Microsoft-Original-Message-ID: <20250531091631.45342-2-dmtr.kovalenko@outlook.com> MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DBAP193MB0956:EE_|PAXP193MB2106:EE_ X-MS-Office365-Filtering-Correlation-Id: 9a73a4c5-3d35-4691-42cf-08dda023db24 X-MS-Exchange-SLBlob-MailProps: 9IecXKUgicB0Vgu95vFSoXNdCFmhIzag/cLus7aHvayR19dZtLnsOgCusSPL909VL++MjJBi4t3QSXZCXobSskpagUUVogq++Iqn05Od3Y5U95qQY/yb6YZ5rQEkVqjxrf0cn6nJmHN+johA9QdVk8+AeQhXOwVDE1FtcHDci3Ta7/OHwJqhLuzK5ifOzTAf5T8HfYrSpT6H+T1BqG0nxbduOluU32ZsfJWIuHNQRVNmA2ke/uDAn9VARcCi3UDxLVI7PLowWVGEaoSiWgDDKjNgXZUuP+uAvuxiOI4n0kTCcc6e4a4DH1JmgdOguNQaJJATYu8CPIOQ/e3yW4PWAZVvDaf2UJLfJyUuT9uinPAEZH28PokALnKtFZOI7E3k5gFGBiGjEOZlaWfI+g4SzNjqy9pWZLZdcG8v/Z88qZZ4gfabyCWs9YU4pgiRoe7ii4UKm1z2gRfA8Nef+arkYYuvDQJwdRfqcO+XvzZDUOf2mK0Hz3qASRqbarekz7/8yvR7j3QCfYu9tP/aYqZvH63vpZouYJYMKJDRf8IKw/f4palNeLr8GD9bVCuWaIlk9Agiy9PkSFtwBvtx1jj3vFZQ44+82NO6pz0UhFVrJy80M6oVmTUxbXtRObC17O1OtyoxXGisMn/GhVVYGaULtFe1jQIqU4fCHzg7DUdPpSRNi1Ht2BBpGrwQRcfKkPW7V7e699/u0UEo7cOH9tsalzqrwIxMB+ilyVpq6aovvB2nMQO0HI5O4g== X-Microsoft-Antispam: BCL:0; ARA:14566002|461199028|8060799009|7092599006|15080799009|19110799006|5072599009|440099028|3412199025|1710799026; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?532RtWikueJRR/sNW/ZnTjZBGkwyhNiZahHdT1w3nyTMThanmNtbXZqB8TpN?= =?us-ascii?Q?U6V3rW30oe1nh9eibaclSCeg/KygDPzwTmksnoP6ItCUkRTXFDEr8C2oEGah?= =?us-ascii?Q?s8H7jVoyYetd6QFl9AIBqBVjPFJEhF4nCZbxF6dP+zXV9SyNDtSkYYuFwscf?= =?us-ascii?Q?ozuKZLjXi/ZiETVRUIR4K5IX7sQUTMtebKgboVS/cyh8HYM0y6C832caDYbQ?= =?us-ascii?Q?4RWIilzliN7Khj/7njEfwTc51sWXm6fiiJUEYHIMg6DU0sIMDiz6NfRv0kMX?= =?us-ascii?Q?kE6/b7hE3wO8Jhx2hPD5AyHNlCantDt4KZPeM7T16qWDTo0H40oPUtWS9cvn?= =?us-ascii?Q?Z8M+McHyQ3sxWy7sb0mNIv2eHn/oCAtWbW2OPeGjsRQ52I3y6WRlFKEyHXj1?= =?us-ascii?Q?uOSWbNIwl9UPf9aGr2A6AlL3hLQ6TL2aQXwL9nJtPNs8FgGi5KTqmp9V2HzL?= =?us-ascii?Q?RNlcqU6KfgKS8DvnAiGknfPJHjY3BzE575INjW6Y8bptMsg40AbxYu5nOL60?= =?us-ascii?Q?sz+zd9T/Zjy6nafvaDIr4Dpxjdif55a0xS5h57Nd2NKgsi/s1otRPQJHQKYB?= =?us-ascii?Q?JvEy6mBlM8gcBLcp1tTCHqqjqGbnOAyRsb+cvz1znhsMJdbNKjWhXbhCHfpF?= =?us-ascii?Q?70Cuqil1eZu1Wzxz2714ZaexhgD+mhgciP6m3pObReZPST6y1V4oA5O87eq5?= =?us-ascii?Q?XNIpaRkrp9HulFi+40DX2rBHP0ArCJJmeYGuOQwaI2KGgtZfq9Vn5e9KxC+a?= =?us-ascii?Q?3kYX2L1ZuHErB67HdTyL2dBwBpa+ml6JhdDp+24yCiPyGg2gnbhdGIKRSoDJ?= =?us-ascii?Q?UJPYwXgJb3y0yaCxq0NhlCp4CB8vZTSJ+XK9bGrRXfsL6y5pyM4szcU2Oiad?= =?us-ascii?Q?7PgulsIF+MPDY30A0NWxxjo1R7C22ZvdiOFU2dzA8MtxqdZxc3NUfERLBEYJ?= =?us-ascii?Q?0lkQVBtJ/IfmdnXWu0ZibMs74ah9tf39JF8k9JYXqjaQRHfDe+gsooA1QpQy?= =?us-ascii?Q?sCg58CEay0rTmkBgf3+d77CvHmFrW8yY0czhy5GzioDHALZpAX+Tr3hevpOO?= =?us-ascii?Q?oivITxfNtkvPytmRA2uwP+nWsZpDmV8Lw9J9mKAi1XfDH5/VbhidnY63Cjss?= =?us-ascii?Q?cTbv+vPvzr82jzeIGOcxIVm73b8p8R0IwQ=3D=3D?= X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?T7fI/9sguyIXBg86V1Pfa1UAWHo9lv6/NvMRxH5kMKmgJKia3+wPM8jGXQ9S?= =?us-ascii?Q?+LBhBTgmtpF3faT5xDAn4JdKCd6LCw2BzWBj8bGGusqsIgPCOvHiO0PcTkX9?= =?us-ascii?Q?i+W24MxjjR1FKUkosB3ED4H6RIeGTyucRUpauhX901TgeuEyfKkFkl0nUhmp?= =?us-ascii?Q?dVjSPTzjKHYiROftyDEfXRkUno2CPAo4TWPOBKQT/aPCXQ9GGHhv6K5oejr+?= =?us-ascii?Q?GkwNuPzI8xAzLXpONAthe51yxq4yfi60qbTyeLBJ8x5CkCIMzRjkQn04DNGr?= =?us-ascii?Q?DwhkwBg7xscafd5cF1uUly5gzHh0pXc2nNUdu58LPlCz9LdkB1Ojsn4e1Kik?= =?us-ascii?Q?wW5dfxiqlRezDSa+DV9ZR7xoD0ewrlkTb9+SQjnhLjpRMyowMgi6nL6ThgOp?= =?us-ascii?Q?s2hgBrL25f3xUc0Mdca7gQA8we4QPnlGPSAEHYX+f09VTJ64QeLrRk0o6CfG?= =?us-ascii?Q?S4Edtye5EHs66k7PJ+6WUIYj4CGtzqF+KsYbhGykdLTciyVDDcOkHWgMMgVV?= =?us-ascii?Q?4pTkAkirIDuChJ7z+YMBoAIizkil/e4yQrLtPIYtjk1NXN+vmLNAs9etVZSh?= =?us-ascii?Q?iXF2y2FDfb8725oruLf5DXecYGL1hpDyO1FcVKSJMUsCTHCqGGAX9cXosE8h?= =?us-ascii?Q?6VQrbzYKwC0nBKuJewT7y1hLgFBZE6HxwA0TS0B+NbZryJavtT2dzJrCXDQ5?= =?us-ascii?Q?wYnQ4tHdvqK4rR2plXl4pXNkQA3gsbdWUOEd/YdNFGCFI9C55ffZoFigAr1W?= =?us-ascii?Q?bhho3UMMQ+ryUnNT/h5O7M5czir8kPj8wPjePCAEI54mErc43AF8scthPZUD?= =?us-ascii?Q?YHYQ/ZPT8dVxd84z+14HyTI5/+3JEfJhczgXv8SBKoHPpmGHL6DLSUpgf7ah?= =?us-ascii?Q?E3i/J2PKo1go8fTAcH8nV9M5xbuHKEKku2Qk+OWYGUwURrVPX6zs3OBmfM6i?= =?us-ascii?Q?EJewAjO9xz1J4j9mP7wUU1Awwe55x/Y8fC4noZRptif5pbB7CxkepeD9nvOM?= =?us-ascii?Q?V6bWR7KkITL5f93TUZx4aUvRPbxVpaH68znp68dncEentSFX+17jK8LB02+1?= =?us-ascii?Q?O84SjcML8EkVEGPXgMqO40Gyey5KSwE9N1gPQGkKuuoogtaTWsWVtAbF+0BS?= =?us-ascii?Q?lfONaoJ+UXrm1gtNy+IW32d+5wK/BY3VLSrLRICNVn9Pp0GhA7v0lgZE6+7g?= =?us-ascii?Q?wrWByG85o27XHGmeC9aNzjHO9A+LVpNORt646eSnyM4U5oo24kmYGEkotU1r?= =?us-ascii?Q?OhnCi3MCKEAFczW8JpS6?= X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 9a73a4c5-3d35-4691-42cf-08dda023db24 X-MS-Exchange-CrossTenant-AuthSource: DBAP193MB0956.EURP193.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 31 May 2025 09:16:42.6255 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: PAXP193MB2106 Subject: [FFmpeg-devel] [PATCH 1/2] swscale: rgb_to_yuv neon optimizations X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Dmitriy Kovalenko Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: I've found quite a few ways to optimize existing ffmpeg's rgb to yuv subsampled conversion. In this patch stack I'll try to improve the perofrmance. This particular set of changes is a small improvement to all the existing functions and macro. The biggest performance gain is coming from post loading increment of the pointer and immediate ~~prefetching of the memory blocks~~(was moved to the next patch in the stack) and interleaving the multiplication shifting operations of different registers for better scheduling. Also changed a bunch of places where cmp + b.le was used instead of one instruction cbnz/tbnz and some other small cleanups. Here are checkasm results on the macbook pro with the latest M4 max bgra_to_uv_1080_c: 257.5 ( 1.00x) bgra_to_uv_1080_neon: 211.9 ( 1.22x) bgra_to_uv_1920_c: 467.1 ( 1.00x) bgra_to_uv_1920_neon: 379.3 ( 1.23x) bgra_to_uv_half_1080_c: 198.9 ( 1.00x) bgra_to_uv_half_1080_neon: 125.7 ( 1.58x) bgra_to_uv_half_1920_c: 346.3 ( 1.00x) bgra_to_uv_half_1920_neon: 223.7 ( 1.55x) bgra_to_uv_1080_c: 268.3 ( 1.00x) bgra_to_uv_1080_neon: 176.0 ( 1.53x) bgra_to_uv_1920_c: 456.6 ( 1.00x) bgra_to_uv_1920_neon: 307.7 ( 1.48x) bgra_to_uv_half_1080_c: 193.2 ( 1.00x) bgra_to_uv_half_1080_neon: 96.8 ( 2.00x) bgra_to_uv_half_1920_c: 347.2 ( 1.00x) bgra_to_uv_half_1920_neon: 182.6 ( 1.92x) With my proprietary test on IOS it gives around 70% of performance improvement converting bgra 1920x1920 image to yuv420p On my linux arm cortex-r processing the performance improvement not that visible but still consistently faster by 5-10% than the current implementation. --- libswscale/aarch64/input.S | 143 +++++++++++++++++++++++-------------- 1 file changed, 91 insertions(+), 52 deletions(-) diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S index c1c0adffc8..260a26e965 100644 --- a/libswscale/aarch64/input.S +++ b/libswscale/aarch64/input.S @@ -22,9 +22,9 @@ .macro rgb_to_yuv_load_rgb src, element=3 .if \element == 3 - ld3 { v16.16b, v17.16b, v18.16b }, [\src] + ld3 { v16.16b, v17.16b, v18.16b }, [\src], #48 .else - ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [\src] + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [\src], #64 .endif uxtl v19.8h, v16.8b // v19: r uxtl v20.8h, v17.8b // v20: g @@ -35,7 +35,7 @@ .endm .macro argb_to_yuv_load_rgb src - ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [\src] + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [\src], #64 uxtl v21.8h, v19.8b // v21: b uxtl2 v24.8h, v19.16b // v24: b uxtl v19.8h, v17.8b // v19: r @@ -57,20 +57,41 @@ sqshrn2 \dst\().8h, \dst2\().4s, \right_shift // dst_higher_half = dst2 >> right_shift .endm +// interleaved product version of the rgb to yuv gives slightly better performance on non-performant mobile cores +.macro rgb_to_uv_interleaved_product r, g, b, u_coef0, u_coef1, u_coef2, v_coef0, v_coef1, v_coef2, u_dst1, u_dst2, v_dst1, v_dst2, u_dst, v_dst, right_shift + smlal \u_dst1\().4s, \u_coef0\().4h, \r\().4h // U += ru * r (first 4) + smlal \v_dst1\().4s, \v_coef0\().4h, \r\().4h // V += rv * r (first 4) + smlal2 \u_dst2\().4s, \u_coef0\().8h, \r\().8h // U += ru * r (second 4) + smlal2 \v_dst2\().4s, \v_coef0\().8h, \r\().8h // V += rv * r (second 4) + + smlal \u_dst1\().4s, \u_coef1\().4h, \g\().4h // U += gu * g (first 4) + smlal \v_dst1\().4s, \v_coef1\().4h, \g\().4h // V += gv * g (first 4) + smlal2 \u_dst2\().4s, \u_coef1\().8h, \g\().8h // U += gu * g (second 4) + smlal2 \v_dst2\().4s, \v_coef1\().8h, \g\().8h // V += gv * g (second 4) + + smlal \u_dst1\().4s, \u_coef2\().4h, \b\().4h // U += bu * b (first 4) + smlal \v_dst1\().4s, \v_coef2\().4h, \b\().4h // V += bv * b (first 4) + smlal2 \u_dst2\().4s, \u_coef2\().8h, \b\().8h // U += bu * b (second 4) + smlal2 \v_dst2\().4s, \v_coef2\().8h, \b\().8h // V += bv * b (second 4) + + sqshrn \u_dst\().4h, \u_dst1\().4s, \right_shift // U first 4 pixels + sqshrn2 \u_dst\().8h, \u_dst2\().4s, \right_shift // U all 8 pixels + sqshrn \v_dst\().4h, \v_dst1\().4s, \right_shift // V first 4 pixels + sqshrn2 \v_dst\().8h, \v_dst2\().4s, \right_shift // V all 8 pixels +.endm + .macro rgbToY_neon fmt_bgr, fmt_rgb, element, alpha_first=0 function ff_\fmt_bgr\()ToY_neon, export=1 - cmp w4, #0 // check width > 0 + cbz w4, 3f // check width > 0 ldp w12, w11, [x5] // w12: ry, w11: gy ldr w10, [x5, #8] // w10: by - b.gt 4f - ret + b 4f endfunc function ff_\fmt_rgb\()ToY_neon, export=1 - cmp w4, #0 // check width > 0 + cbz w4, 3f // check width > 0 ldp w10, w11, [x5] // w10: ry, w11: gy ldr w12, [x5, #8] // w12: by - b.le 3f 4: mov w9, #256 // w9 = 1 << (RGB2YUV_SHIFT - 7) movk w9, #8, lsl #16 // w9 += 32 << (RGB2YUV_SHIFT - 1) @@ -90,7 +111,6 @@ function ff_\fmt_rgb\()ToY_neon, export=1 rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 sub w4, w4, #16 // width -= 16 - add x1, x1, #(16*\element) cmp w4, #16 // width >= 16 ? stp q16, q17, [x0], #32 // store to dst b.ge 1b @@ -158,8 +178,7 @@ rgbToY_neon abgr32, argb32, element=4, alpha_first=1 .macro rgbToUV_half_neon fmt_bgr, fmt_rgb, element, alpha_first=0 function ff_\fmt_bgr\()ToUV_half_neon, export=1 - cmp w5, #0 // check width > 0 - b.le 3f + cbz w5, 3f // check width > 0 ldp w12, w11, [x6, #12] ldp w10, w15, [x6, #20] @@ -168,7 +187,7 @@ function ff_\fmt_bgr\()ToUV_half_neon, export=1 endfunc function ff_\fmt_rgb\()ToUV_half_neon, export=1 - cmp w5, #0 // check width > 0 + cmp w5, #0 // check width > 0 b.le 3f ldp w10, w11, [x6, #12] // w10: ru, w11: gu @@ -178,32 +197,39 @@ function ff_\fmt_rgb\()ToUV_half_neon, export=1 cmp w5, #8 rgb_set_uv_coeff half=1 b.lt 2f -1: +1: // load 16 pixels .if \element == 3 - ld3 { v16.16b, v17.16b, v18.16b }, [x3] + ld3 { v16.16b, v17.16b, v18.16b }, [x3], #48 .else - ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3] + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64 .endif + .if \alpha_first - uaddlp v21.8h, v19.16b - uaddlp v20.8h, v18.16b - uaddlp v19.8h, v17.16b + uaddlp v21.8h, v19.16b // v21: summed b pairs + uaddlp v20.8h, v18.16b // v20: summed g pairs + uaddlp v19.8h, v17.16b // v19: summed r pairs .else - uaddlp v19.8h, v16.16b // v19: r - uaddlp v20.8h, v17.16b // v20: g - uaddlp v21.8h, v18.16b // v21: b + uaddlp v19.8h, v16.16b // v19: summed r pairs + uaddlp v20.8h, v17.16b // v20: summed g pairs + uaddlp v21.8h, v18.16b // v21: summed b pairs .endif - rgb_to_yuv_product v19, v20, v21, v22, v23, v16, v0, v1, v2, #10 - rgb_to_yuv_product v19, v20, v21, v24, v25, v17, v3, v4, v5, #10 - sub w5, w5, #8 // width -= 8 - add x3, x3, #(16*\element) - cmp w5, #8 // width >= 8 ? + mov v22.16b, v6.16b // U first half + mov v23.16b, v6.16b // U second half + mov v24.16b, v6.16b // V first half + mov v25.16b, v6.16b // V second half + + rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4, v5, v22, v23, v24, v25, v16, v17, #10 + str q16, [x0], #16 // store dst_u str q17, [x1], #16 // store dst_v + + sub w5, w5, #8 // width -= 8 + cmp w5, #8 // width >= 8 ? b.ge 1b - cbz w5, 3f -2: + cbz w5, 3f // No pixels left? Exit + +2: // Scalar fallback for remaining pixels .if \alpha_first rgb_load_add_half 1, 5, 2, 6, 3, 7 .else @@ -213,21 +239,24 @@ function ff_\fmt_rgb\()ToUV_half_neon, export=1 rgb_load_add_half 0, 4, 1, 5, 2, 6 .endif .endif - smaddl x8, w2, w10, x9 // dst_u = ru * r + const_offset + smaddl x16, w2, w13, x9 // dst_v = rv * r + const_offset (parallel) + smaddl x8, w4, w11, x8 // dst_u += gu * g + smaddl x16, w4, w14, x16 // dst_v += gv * g (parallel) + smaddl x8, w7, w12, x8 // dst_u += bu * b - asr x8, x8, #10 // dst_u >>= 10 + smaddl x16, w7, w15, x16 // dst_v += bv * b (parallel) + + asr w8, w8, #10 // dst_u >>= 10 + asr w16, w16, #10 // dst_v >>= 10 + strh w8, [x0], #2 // store dst_u + strh w16, [x1], #2 // store dst_v - smaddl x8, w2, w13, x9 // dst_v = rv * r + const_offset - smaddl x8, w4, w14, x8 // dst_v += gv * g - smaddl x8, w7, w15, x8 // dst_v += bv * b - asr x8, x8, #10 // dst_v >>= 10 - sub w5, w5, #1 - add x3, x3, #(2*\element) - strh w8, [x1], #2 // store dst_v - cbnz w5, 2b + sub w5, w5, #1 // width-- + add x3, x3, #(2*\element) // Advance source pointer + cbnz w5, 2b // Process next pixel if any left 3: ret endfunc @@ -244,9 +273,9 @@ function ff_\fmt_bgr\()ToUV_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f - ldp w12, w11, [x6, #12] - ldp w10, w15, [x6, #20] - ldp w14, w13, [x6, #28] + ldp w12, w11, [x6, #12] // bu, gu + ldp w10, w15, [x6, #20] // ru, bv + ldp w14, w13, [x6, #28] // gv, rv b 4f endfunc @@ -267,17 +296,26 @@ function ff_\fmt_rgb\()ToUV_neon, export=1 .else rgb_to_yuv_load_rgb x3, \element .endif - rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 - rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 - rgb_to_yuv_product v19, v20, v21, v25, v26, v18, v3, v4, v5, #9 - rgb_to_yuv_product v22, v23, v24, v27, v28, v19, v3, v4, v5, #9 - sub w5, w5, #16 - add x3, x3, #(16*\element) - cmp w5, #16 - stp q16, q17, [x0], #32 // store to dst_u - stp q18, q19, [x1], #32 // store to dst_v + // process 2 groups of 8 pixels + mov v25.16b, v6.16b // U_dst1 = const_offset (32-bit accumulators) + mov v26.16b, v6.16b // U_dst2 = const_offset + mov v27.16b, v6.16b // V_dst1 = const_offset + mov v28.16b, v6.16b // V_dst2 = const_offset + rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4, v5, v25, v26, v27, v28, v16, v18, #9 + + mov v25.16b, v6.16b + mov v26.16b, v6.16b + mov v27.16b, v6.16b + mov v28.16b, v6.16b + rgb_to_uv_interleaved_product v22, v23, v24, v0, v1, v2, v3, v4, v5, v25, v26, v27, v28, v17, v19, #9 + + sub w5, w5, #16 // width -= 16 + cmp w5, #16 // width >= 16 ? + stp q16, q17, [x0], #32 // store to dst_u (post-increment) + stp q18, q19, [x1], #32 // store to dst_v (post-increment) b.ge 1b - cbz w5, 3f + cbz w5, 3f // No pixels left? Exit + 2: .if \alpha_first ldrb w16, [x3, #1] // w16: r @@ -292,7 +330,7 @@ function ff_\fmt_rgb\()ToUV_neon, export=1 smaddl x8, w16, w10, x9 // x8 = ru * r + const_offset smaddl x8, w17, w11, x8 // x8 += gu * g smaddl x8, w4, w12, x8 // x8 += bu * b - asr w8, w8, #9 // x8 >>= 9 + asr x8, x8, #9 // x8 >>= 9 strh w8, [x0], #2 // store to dst_u smaddl x8, w16, w13, x9 // x8 = rv * r + const_offset @@ -401,3 +439,4 @@ endfunc DISABLE_DOTPROD #endif + -- 2.49.0 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".