From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id EEC6D4FD7F for ; Wed, 2 Jul 2025 09:27:48 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 8ACF468CEF7; Wed, 2 Jul 2025 12:27:44 +0300 (EEST) Received: from MA0PR01CU009.outbound.protection.outlook.com (mail-southindiaazon11020079.outbound.protection.outlook.com [52.101.227.79]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 283AF68CAEE for ; Wed, 2 Jul 2025 12:27:37 +0300 (EEST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=HJwW9ZfU5+QfzmTr4bj775/kerd0sggwyRsiz3+9WHIT7Ut1kIUtEJIXqm7Re4lEMlG92S3KAZ5wIRh4B9VWyWQMefOz8im7hk9m1dIZpH+CJfHjJnV/Z5r3x/km1b5kUfUEyeAxVCFJS33DCN1j1+Bg9Ng4Kw9ZbyXmExgxWKyYacTdMEUxwXe3s04DOHAqVRdOCYPyqErJlOFnrN2k7Jx2IqcnFLZEuKLc98vINprfEw42I8Bj4ZZcrydQut1CJL9dKqVF7qd40Yj/pqMG8q+Me93OPXAgGqGaOrYnkJ/5H30yEzJFBBrpKXcBtXT/djW4lE1G5mfTbzVYotwDBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=1Zup9FjB+X7dvRGxqiPWBE0H3Y8eJm7CLigHSLtNJ54=; b=Mx6ut+KjuXzPSOWHgG9szX9V1qhRic3foVQG+9TFIY9RJq+4HGvBQdw2JGoDW8CplvWdAv/iysmgIbj6VhmFwUIDeTnR7d18QkqdMaV1fht03xvjKvoRmqhowx9MoYMPOBrcNP2EHgR8fR9MbQdF+Oefz0UbpKcm23ToOxR2o3JtfRTZIf1Nusq9lv90eRynPBHsEsbx9dXueQO8Ay9xB/P2nmXyKW/5LNV+U9sj1D3F0qNO5d5pv7RZYN+8az/VSfPcopxeFvKxFbuzFLTRwqx239KbTVSSodDJLA7BQwDeDYE/L6A2el1I6hPwhcYIQD6CMNmxl1Xn7IoYFFPFGQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=multicorewareinc.com; dmarc=pass action=none header.from=multicorewareinc.com; dkim=pass header.d=multicorewareinc.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multicorewareinc.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=1Zup9FjB+X7dvRGxqiPWBE0H3Y8eJm7CLigHSLtNJ54=; b=wkGjLawq3PlllwnLusEUiDmXqJ8kvUjjHVId/n+QXCK+4hDcECqzFIWXTcNdbqVRerX4WQ4FIFPvQ5c6dLqYM8Sfu80cbFiBlaU2LrP/m3dsbOVnpibRszl3i9+OE/VPleN1N1OOuUOFh8DsvkY1VxHNfZGQ7vZKlLqmL2EG/D3hYKpeKFWHvtm1/NXZ4FRyKimGF2jytsAyCIQAi6cAVf+iLzuCQ0x6ClBDVdssvxV+gMPMbqzYm2as1L56shfZv5CRuH9JelQM/MVTDpVcE2VzmFYjRp+3/zdwv3n5zHPhT4mLMV/teLk4GjoprRW248p86mb0IUOzT1Ww4zAacg== Received: from PN3P287MB3339.INDP287.PROD.OUTLOOK.COM (2603:1096:c01:22d::6) by PN2P287MB0176.INDP287.PROD.OUTLOOK.COM (2603:1096:c01:ef::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8880.32; Wed, 2 Jul 2025 09:27:32 +0000 Received: from PN3P287MB3339.INDP287.PROD.OUTLOOK.COM ([fe80::4a22:77a3:8f7d:445]) by PN3P287MB3339.INDP287.PROD.OUTLOOK.COM ([fe80::4a22:77a3:8f7d:445%6]) with mapi id 15.20.8901.018; Wed, 2 Jul 2025 09:27:32 +0000 From: Logaprakash Ramajayam To: FFmpeg development discussions and patches Thread-Topic: [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() Thread-Index: AQHb6yCKxtFLA3amMUa9+JKtMTbTNbQej4s4 Date: Wed, 2 Jul 2025 09:27:32 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: undefined: 2401751 drawingcanvaselements: [] composetype: unknown authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=multicorewareinc.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: PN3P287MB3339:EE_|PN2P287MB0176:EE_ x-ms-office365-filtering-correlation-id: 5cbb3b7f-da65-4c2a-659c-08ddb94aac6e x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|376014|1800799024|366016|38070700018|8096899003; x-microsoft-antispam-message-info: =?iso-8859-1?Q?RJtcdYI4QfRVeqGKb4xte6kgpyfEMR4x7EXREgOuBZU+ElleOzAOlKK12h?= =?iso-8859-1?Q?e7r/UGO5nI6Nbe9trmbgTMJCn2FuMsVFrVpP7hb1YYFmbdPfStfeVEdnF7?= =?iso-8859-1?Q?vDPZ1GWiTunp0YzI0vBKv2AUC9aFaqikOqvyqqhNYXfL8tuaI6vGeNMOJl?= =?iso-8859-1?Q?Ggbp+q0/9hPrd8RZAFF+JWO5wvokHE7eHIuh5Nq/Wg41MyXHOY6kP20ozZ?= =?iso-8859-1?Q?kjrrCfqsCNzuXL05INErnoI8zrDLl0MfdD+FsKn+glakQwv39sc28kDSNp?= =?iso-8859-1?Q?1V6JJfIO6cd/GyuzA1iJdPsjkcWAXtMBiUIYPrn+4UMIH5fn3TNqVzHFSe?= =?iso-8859-1?Q?NhAu4PNHRFFdL14Tr0iKRvn+K7WShESIFySxHBUwm3GYTxDHrS+aocZ20R?= =?iso-8859-1?Q?9Ks6mW+Llrzvtm2VqgO/7libtesJG9AaPk+UMobArh4Dcs58NoAf6BZEo3?= =?iso-8859-1?Q?4yfVKFxsaRfbkamWLlFjR+24iJVkpLIa6koQ2tZPmGYnWfyk88EDzi4MjX?= =?iso-8859-1?Q?87bONEk4PxULBbpzaqyFAYae+QyFpNu79EzLaTrfIWlVHCWj8SCAInQyfK?= =?iso-8859-1?Q?hgNtmx1+RPGEqz1lkBZCxb3EDf7MknTtDiPhUyITkRmzdiTgfAZ/eoiwS7?= =?iso-8859-1?Q?l9BRs/FENv4UPJpNksCHsBmL3bUkOuqTQugDEkRUjc4Vt0J4NdvyHYY+ph?= =?iso-8859-1?Q?wsP5u30U/3dl4mmJ/DXzRBsmjZFgYb1TkLE7n8U/HAtcEZcyjln1DH5Mcj?= =?iso-8859-1?Q?3ftmgDpNp6cFzVEByGWOVRNgndTJiO7BqiciIHEs9ZNl46M9nX1FNCx0xO?= =?iso-8859-1?Q?yckxf5sEojpjnvsECPBad6fdArg7wwOFSGfBh7SOjtZ4qT0GQ600xnkVkh?= =?iso-8859-1?Q?x5bugNESxdEOSfad+oKMIHHlNMRMDLpOMdWt9bRyZmU9XIeNON9Dz+FfKP?= =?iso-8859-1?Q?TTgCIyE4Vj1WE7uZCYlQXivnTH51iVQOsHNOL5smL0ZtfVd4v9LhEMXZaT?= =?iso-8859-1?Q?z7DL0W3aQW5Rz2XkIhX4YmqFOTEezW4Znb0498PCGjPDdccPn4HpIwblGp?= =?iso-8859-1?Q?ic+2jOoSnKSKXr4+0EfVO6GuOehxRcOy1sPNSpFz360C7/7FfUoZMZcVCb?= =?iso-8859-1?Q?ofVXLNFIgFCHmTZPdlhsI1lSZZ+LAxf8W/WkSkoogX/uG5Ctbb9e20AugM?= =?iso-8859-1?Q?kb5YzEtwaRzEOoQ30Wa+u00rzOUy+tFfmSH+ab7QjoDM017vMUNl4MGwPx?= =?iso-8859-1?Q?yTigCm77tFoWhK+tlHT/nWsLyr+4hdV0W+DEIpf7fC7jIgl1Uj6JG0rWCE?= =?iso-8859-1?Q?Pm8TaNrqKebvsobyRuDLGzwz0s9IJEli455OxVGttLal16vUxyJUPQy/e1?= =?iso-8859-1?Q?jaV936Ru1B2uWU+0HW/0/ceYyWQo7G9XETWvVKZq1OeXiuPXxro2slLzFk?= =?iso-8859-1?Q?+9hnEIMYG3KTNdqL5Zp1WEKAPepwDcM0QCug7bDiLYPW448xCK+cvhZxfL?= =?iso-8859-1?Q?F2QPZ6+6uIT+lh1yq9mVSmDUA1bMonjKXvJIaQVV995w=3D=3D?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:PN3P287MB3339.INDP287.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFS:(13230040)(376014)(1800799024)(366016)(38070700018)(8096899003); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?0uy9jTy3lYMsUgeahzduvOumb3Qb/PxSK4GHZvfjTIiOzBo7ItR1h48Tzv?= =?iso-8859-1?Q?IPSy/mNaacmRFReIIMfwq7EGaCLFPIfdslgKPNTShVWzDJBKAjKUG22t6a?= =?iso-8859-1?Q?3ND+gTy1Xj8yGSpXToUYUYiUOKhH4s8mKKeAW3XIHjaLrY5UmU7iSf1tLv?= =?iso-8859-1?Q?x07eNtJCHJio/aA1JkS3b/NU2Umcc813RP1V6XI0Dr6w/fdtoO1Zpt5X0P?= =?iso-8859-1?Q?Gz6RzWTxJ3mMNFKhb2o+RUPj+vY0aOsxFPc+hX9GuuLNNDOsNlGjcsxqet?= =?iso-8859-1?Q?q+LlV+KWZBNxshPRAuidjncDOUZZyTzHdsBkSyeCch6r2BkHsUvkLKgbyc?= =?iso-8859-1?Q?T6nmQa1ZWFOGGJEal7CBRMEKTs7+MDAoTj9Ar9TIIhLnmn7ulg5pglZUZa?= =?iso-8859-1?Q?9b56QlF9walNH8ZAoThig3Gt7TCJTbKrrgEDohFQALE4u1K+SZKUwzuSg2?= =?iso-8859-1?Q?DQDnLHZx7ho6ohvgvJWQa1Rr0hrJMcbuerV6mITRois5Gl3iuVU7FZQ0A4?= =?iso-8859-1?Q?cTWhjPyjvY2XQpnlYAdRww6WuIgnKJ14Yc+WkeMzN/GZ6y4pED+PRh+gtD?= =?iso-8859-1?Q?jYOt5J1Anbi6oOc4K8IaaECWqPqxVxqZtqgUQZ2bpa9OCMsdlP8lSuYPSW?= =?iso-8859-1?Q?MGVUbbQMNi6OMJs95kIF44lWyEERdni+mksI1+CKOq7BYwGhtdrqOX+5Lq?= =?iso-8859-1?Q?/xFgWXkFJRjiXxs/63FyNkKe4m2M+zCFdHLrah9WY/5sSLsPALy8alAMkY?= =?iso-8859-1?Q?4MU4jf580i05qNclA/8K7bfDL6P2jYPOef0OMpv7jo17UtEKwMi/bS9RKE?= =?iso-8859-1?Q?zCzkEMUoVx7adpuyTnX4yBCCTgKpIgCrV6KWT27RV0R77Hk6H2KBGjmhFx?= =?iso-8859-1?Q?GQ40CTN5IytGeti5jYP9ai0wT2Lph+VT+YAbrVxTWBu/V8aDSSIWhuY/fx?= =?iso-8859-1?Q?ozDieJcdaWhKwDxYfAL5PHR2sQUgc6AiNStu51mn/+dlm7xGTYkWXQ5fWI?= =?iso-8859-1?Q?OZMbR/8yr6TVa/I//mK8NQx3P9n4G7moFI+xb/KKfcMHTIRANZAJ5/qQKu?= =?iso-8859-1?Q?UNfTNheRIb8QGM1v1TjWL5EHp/qCEypB31z9YXGpy+KCvWDNBpTbyBhkwi?= =?iso-8859-1?Q?0rsVwoi2gsLKVletNdWgYPxdFZzghFvp9yM1fKzqytC9WlXBTQspSEXwgj?= =?iso-8859-1?Q?/OtlMWrIqSV9D87FNdIKSHDtrcqjCswn7Uv8V1QCjTH8yGs8t1pySCOioA?= =?iso-8859-1?Q?vOnuOUBraoNPWxsnERt+sVWOEhMwFk+vF1k7Ve7ULz73ieC991It+aLZpk?= =?iso-8859-1?Q?liOSvaL548QonhKJEu8wKA9hOv2HZdKHn6U17brBrZpqnp02sBGQgk0t8M?= =?iso-8859-1?Q?+fWvKuC5isXdeI2ibNrE2n4B+Z3O1YueyQeoBIEJ+UujmLALZWsuia+66e?= =?iso-8859-1?Q?EmE0VIN9N5LkXo9stnO00WgrBRmjP43lYWQMkTs6E83uSd1U0UB6gatB4k?= =?iso-8859-1?Q?hPIWU/9mWN8XaBqYNYY/wgfA7DW/XmDcU0khOhVBhOPKGP3t59KhdRHjpV?= =?iso-8859-1?Q?NPOvilr2XIo+JrFCfTiUL25rBHgd0am5Z62V4A79HK/5QqUFNMjrIYp8HU?= =?iso-8859-1?Q?ZHvonIUhgpNy86CrwrvfOCnrHvWjFx8MjxRah2peKDJJryr/KcWM9gagG9?= =?iso-8859-1?Q?uR3B6S1YiMfbrZgeUVw=3D?= MIME-Version: 1.0 X-OriginatorOrg: multicorewareinc.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: PN3P287MB3339.INDP287.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-Network-Message-Id: 5cbb3b7f-da65-4c2a-659c-08ddb94aac6e X-MS-Exchange-CrossTenant-originalarrivaltime: 02 Jul 2025 09:27:32.7496 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: ffc5e88b-3fa2-4d69-a468-344b6b766e7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: Mv9wnNwjfzpShLO6l4pTTGX2f1SqXDc05hbJ/cSZVQhyMZs1D2GCrMD2hUPecpp0td8BZUm1kkXcp6xfqsFEL9v9x/Sho1MtHyc9ykQAX15KdKn8ItpueB7dDTj8odceCBGiSK53TrTcCZG6oZR9Qg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: PN2P287MB0176 X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: >From 3e14b4c2e763d2d0c8979e3e99578f5492b7130c Mon Sep 17 00:00:00 2001 From: Logaprakash Ramajayam Date: Tue, 1 Jul 2025 23:48:36 -0700 Subject: [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() --- libswscale/aarch64/output.S | 189 +++++++++++++++++++++++++++++++++++ libswscale/aarch64/swscale.c | 38 +++++++ tests/checkasm/sw_scale.c | 170 ++++++++++++++++++++----------- 3 files changed, 337 insertions(+), 60 deletions(-) diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S index 190c438870..2aad420db2 100644 --- a/libswscale/aarch64/output.S +++ b/libswscale/aarch64/output.S @@ -20,6 +20,195 @@ #include "libavutil/aarch64/asm.S" +function ff_yuv2planeX_10_neon, export=1 +// x0 = filter (int16_t*) +// w1 = filterSize +// x2 = src (int16_t**) +// x3 = dest (uint16_t*) +// w4 = dstW +// w5 = big_endian +// w6 = output_bits + + mov w8, #27 + sub w8, w8, w6 // shift = 11 + 16 - output_bits + + sub w9, w8, #1 + mov w10, #1 + lsl w9, w10, w9 // val = 1 << (shift - 1) + + dup v1.4s, w9 + dup v2.4s, w9 // Create vectors with val + + neg w16, w8 + dup v20.4s, w16 // Create (-shift) vector for right shift + + mov w10, #1 + lsl w10, w10, w6 + sub w10, w10, #1 // (1U << output_bits) - 1 + dup v21.4s, w10 // Create Clip vector for upper bound + + mov x7, #0 // i = 0 + +1: + cmp w4, #16 // Process 16-pixels if available + blt 4f + + mov v3.16b, v1.16b + mov v4.16b, v2.16b + mov v5.16b, v1.16b + mov v6.16b, v2.16b + + mov w11, w1 // tmpfilterSize = filterSize + mov x12, x2 // srcp = src + mov x13, x0 // filterp = filter + +2: // Filter loop + + ldp x14, x15, [x12], #16 // get 2 pointers: src[j] and src[j+1] + ldr s7, [x13], #4 // load filter coefficients + add x14, x14, x7, lsl #1 + add x15, x15, x7, lsl #1 + ld1 {v16.8h, v17.8h}, [x14] + ld1 {v18.8h, v19.8h}, [x15] + + // Multiply-accumulate + smlal v3.4s, v16.4h, v7.h[0] + smlal2 v4.4s, v16.8h, v7.h[0] + smlal v5.4s, v17.4h, v7.h[0] + smlal2 v6.4s, v17.8h, v7.h[0] + + smlal v3.4s, v18.4h, v7.h[1] + smlal2 v4.4s, v18.8h, v7.h[1] + smlal v5.4s, v19.4h, v7.h[1] + smlal2 v6.4s, v19.8h, v7.h[1] + + subs w11, w11, #2 // tmpfilterSize -= 2 + b.gt 2b // continue filter loop + + // Shift results + sshl v3.4s, v3.4s, v20.4s + sshl v4.4s, v4.4s, v20.4s + sshl v5.4s, v5.4s, v20.4s + sshl v6.4s, v6.4s, v20.4s + + // Clamp to upper bound + smin v3.4s, v3.4s, v21.4s + smin v4.4s, v4.4s, v21.4s + smin v5.4s, v5.4s, v21.4s + smin v6.4s, v6.4s, v21.4s + + // Narrow and clamp to 0 + sqxtun v23.4h, v3.4s + sqxtun2 v23.8h, v4.4s + sqxtun v24.4h, v5.4s + sqxtun2 v24.8h, v6.4s + + cbz w5, 3f // Check if big endian + rev16 v23.16b, v23.16b + rev16 v24.16b, v24.16b // Swap bits for big endian +3: + st1 {v23.8h, v24.8h}, [x3], #32 + + subs w4, w4, #16 // dstW = dstW - 16 + add x7, x7, #16 // i = i + 16 + b 1b // Continue loop + +4: + cmp w4, #8 // Process 8-pixels if available + blt 8f +5: + mov v3.16b, v1.16b + mov v4.16b, v2.16b + + mov w11, w1 // tmpfilterSize = filterSize + mov x12, x2 // srcp = src + mov x13, x0 // filterp = filter + +6: // Filter loop + + ldp x14, x15, [x12], #16 + ldr s7, [x13], #4 + add x14, x14, x7, lsl #1 + add x15, x15, x7, lsl #1 + ld1 {v5.8h}, [x14] + ld1 {v6.8h}, [x15] + + // Multiply-accumulate + smlal v3.4s, v5.4h, v7.h[0] + smlal2 v4.4s, v5.8h, v7.h[0] + smlal v3.4s, v6.4h, v7.h[1] + smlal2 v4.4s, v6.8h, v7.h[1] + + subs w11, w11, #2 // tmpfilterSize -= 2 + b.gt 6b // loop until filterSize consumed + + // Shift results + sshl v3.4s, v3.4s, v20.4s + sshl v4.4s, v4.4s, v20.4s + + // Clip upper bound + smin v3.4s, v3.4s, v21.4s + smin v4.4s, v4.4s, v21.4s + + // Narrow and clamp to 0 + sqxtun v25.4h, v3.4s + sqxtun v26.4h, v4.4s + + cbz w5, 7f // Check if big endian + rev16 v25.8b, v25.8b + rev16 v26.8b, v26.8b // Swap bits for big endian + +7: + // Store 8 pixels + st1 {v25.4h, v26.4h}, [x3], #16 + + subs w4, w4, #8 // dstW = dstW - 8 + add x7, x7, #8 // i = i + 8 + +8: + cbz w4, 12f // Scalar loop for remaining pixels +9: + mov w11, w1 // tmpfilterSize = filterSize + mov x12, x2 // srcp = src + mov x13, x0 // filterp = filter + sxtw x9, w9 + mov x17, x9 + +10: // Filter loop + ldr x14, [x12], #8 // Load src pointer + ldrsh w15, [x13], #2 // Load filter coefficient + add x14, x14, x7, lsl #1 // Add pixel offset + ldrh w16, [x14] + + sxtw x16, w16 + sxtw x15, w15 + madd x17, x16, x15, x17 + + subs w11, w11, #1 // tmpfilterSize -= 1 + b.gt 10b // loop until filterSize consumed + + sxtw x8, w8 + asr x17, x17, x8 + cmp x17, #0 + csel x17, x17, xzr, ge // Clamp to 0 if negative + + sxtw x10, w10 + cmp x17, x10 + csel x17, x10, x17, gt // Clamp to max if greater than max + + cbz w5, 11f // Check if big endian + rev16 x17, x17 // Swap bits for big endian +11: + strh w17, [x3], #2 + + subs w4, w4, #1 // dstW = dstW - 1 + add x7, x7, #1 // i = i + 1 + b.gt 9b // Loop if more pixels + +12: + ret +endfunc + function ff_yuv2planeX_8_neon, export=1 // x0 - const int16_t *filter, // x1 - int filterSize, diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index 6e5a721c1f..2c3f096a84 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -158,6 +158,29 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \ ALL_SCALE_FUNCS(neon); +void ff_yuv2planeX_10_neon(const int16_t *filter, int filterSize, + const int16_t **src, uint16_t *dest, int dstW, + int big_endian, int output_bits); + +#define yuv2NBPS(bits, BE_LE, is_be, template_size, typeX_t) \ +static void yuv2planeX_ ## bits ## BE_LE ## _neon(const int16_t *filter, int filterSize, \ + const int16_t **src, uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset) \ +{ \ + ff_yuv2planeX_## template_size ## _neon(filter, \ + filterSize, (const typeX_t **) src, \ + (uint16_t *) dest, dstW, is_be, bits); \ +} + +yuv2NBPS( 9, BE, 1, 10, int16_t) +yuv2NBPS( 9, LE, 0, 10, int16_t) +yuv2NBPS(10, BE, 1, 10, int16_t) +yuv2NBPS(10, LE, 0, 10, int16_t) +yuv2NBPS(12, BE, 1, 10, int16_t) +yuv2NBPS(12, LE, 0, 10, int16_t) +yuv2NBPS(14, BE, 1, 10, int16_t) +yuv2NBPS(14, LE, 0, 10, int16_t) + void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset); @@ -268,6 +291,8 @@ av_cold void ff_sws_init_range_convert_aarch64(SwsInternal *c) av_cold void ff_sws_init_swscale_aarch64(SwsInternal *c) { int cpu_flags = av_get_cpu_flags(); + enum AVPixelFormat dstFormat = c->opts.dst_format; + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(dstFormat); if (have_neon(cpu_flags)) { ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon); @@ -276,6 +301,19 @@ av_cold void ff_sws_init_swscale_aarch64(SwsInternal *c) if (c->dstBpc == 8) { c->yuv2planeX = ff_yuv2planeX_8_neon; } + + if (isNBPS(dstFormat) && !isSemiPlanarYUV(dstFormat)) { + if (desc->comp[0].depth == 9) { + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_9BE_neon : yuv2planeX_9LE_neon; + } else if (desc->comp[0].depth == 10) { + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_10BE_neon : yuv2planeX_10LE_neon; + } else if (desc->comp[0].depth == 12) { + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_12BE_neon : yuv2planeX_12LE_neon; + } else if (desc->comp[0].depth == 14) { + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_14BE_neon : yuv2planeX_14LE_neon; + } else + av_assert0(0); + } switch (c->opts.src_format) { case AV_PIX_FMT_ABGR: c->lumToYV12 = ff_abgr32ToY_neon; diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c index 11c9174a6b..5a659571df 100644 --- a/tests/checkasm/sw_scale.c +++ b/tests/checkasm/sw_scale.c @@ -52,50 +52,59 @@ static void yuv2planeX_8_ref(const int16_t *filter, int filterSize, } } -static int cmp_off_by_n(const uint8_t *ref, const uint8_t *test, size_t n, int accuracy) -{ - for (size_t i = 0; i < n; i++) { - if (abs(ref[i] - test[i]) > accuracy) - return 1; - } - return 0; +#define CMP_FUNC(bits) \ +static int cmp_off_by_##bits(const uint##bits##_t *ref, const uint##bits##_t *test, \ + size_t n, int accuracy) \ +{ \ + for (size_t i = 0; i < n; i++) { \ + if (abs((int)ref[i] - (int)test[i]) > accuracy) \ + return 1; \ + } \ + return 0; \ } -static void print_data(uint8_t *p, size_t len, size_t offset) -{ - size_t i = 0; - for (; i < len; i++) { - if (i % 8 == 0) { - printf("0x%04zx: ", i+offset); - } - printf("0x%02x ", (uint32_t) p[i]); - if (i % 8 == 7) { - printf("\n"); - } - } - if (i % 8 != 0) { - printf("\n"); - } +CMP_FUNC(8) +CMP_FUNC(16) + +#define SHOW_DIFF_FUNC(bits) \ +static void print_data_##bits(const uint##bits##_t *p, size_t len, size_t offset) \ +{ \ + size_t i = 0; \ + for (; i < len; i++) { \ + if (i % 8 == 0) { \ + printf("0x%04zx: ", i+offset); \ + } \ + printf("0x%02x ", (uint32_t) p[i]); \ + if (i % 8 == 7) { \ + printf("\n"); \ + } \ + } \ + if (i % 8 != 0) { \ + printf("\n"); \ + } \ +} \ +static size_t show_differences_##bits(const uint##bits##_t *a, const uint##bits##_t *b, \ + size_t len) \ +{ \ + for (size_t i = 0; i < len; i++) { \ + if (a[i] != b[i]) { \ + size_t offset_of_mismatch = i; \ + size_t offset; \ + if (i >= 8) i-=8; \ + offset = i & (~7); \ + printf("test a:\n"); \ + print_data_##bits(&a[offset], 32, offset); \ + printf("\ntest b:\n"); \ + print_data_##bits(&b[offset], 32, offset); \ + printf("\n"); \ + return offset_of_mismatch; \ + } \ + } \ + return len; \ } -static size_t show_differences(uint8_t *a, uint8_t *b, size_t len) -{ - for (size_t i = 0; i < len; i++) { - if (a[i] != b[i]) { - size_t offset_of_mismatch = i; - size_t offset; - if (i >= 8) i-=8; - offset = i & (~7); - printf("test a:\n"); - print_data(&a[offset], 32, offset); - printf("\ntest b:\n"); - print_data(&b[offset], 32, offset); - printf("\n"); - return offset_of_mismatch; - } - } - return len; -} +SHOW_DIFF_FUNC(8) +SHOW_DIFF_FUNC(16) static void check_yuv2yuv1(int accurate) { @@ -140,10 +149,10 @@ static void check_yuv2yuv1(int accurate) call_ref(src_pixels, dst0, dstW, dither, offset); call_new(src_pixels, dst1, dstW, dither, offset); - if (cmp_off_by_n(dst0, dst1, dstW * sizeof(dst0[0]), accurate ? 0 : 2)) { + if (cmp_off_by_8(dst0, dst1, dstW * sizeof(dst0[0]), accurate ? 0 : 2)) { fail(); printf("failed: yuv2yuv1_%d_%di_%s\n", offset, dstW, accurate_str); - fail_offset = show_differences(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0])); + fail_offset = show_differences_8(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0])); printf("failing values: src: 0x%04x dither: 0x%02x dst-c: %02x dst-asm: %02x\n", (int) src_pixels[fail_offset], (int) dither[(fail_offset + fail_offset) & 7], @@ -158,7 +167,7 @@ static void check_yuv2yuv1(int accurate) sws_freeContext(sws); } -static void check_yuv2yuvX(int accurate) +static void check_yuv2yuvX(int accurate, int bit_depth, int dst_pix_format) { SwsContext *sws; SwsInternal *c; @@ -179,8 +188,8 @@ static void check_yuv2yuvX(int accurate) const int16_t **src; LOCAL_ALIGNED_16(int16_t, src_pixels, [LARGEST_FILTER * LARGEST_INPUT_SIZE]); LOCAL_ALIGNED_16(int16_t, filter_coeff, [LARGEST_FILTER]); - LOCAL_ALIGNED_16(uint8_t, dst0, [LARGEST_INPUT_SIZE]); - LOCAL_ALIGNED_16(uint8_t, dst1, [LARGEST_INPUT_SIZE]); + LOCAL_ALIGNED_16(uint16_t, dst0, [LARGEST_INPUT_SIZE]); + LOCAL_ALIGNED_16(uint16_t, dst1, [LARGEST_INPUT_SIZE]); LOCAL_ALIGNED_16(uint8_t, dither, [LARGEST_INPUT_SIZE]); union VFilterData{ const int16_t *src; @@ -190,12 +199,14 @@ static void check_yuv2yuvX(int accurate) memset(dither, d_val, LARGEST_INPUT_SIZE); randomize_buffers((uint8_t*)src_pixels, LARGEST_FILTER * LARGEST_INPUT_SIZE * sizeof(int16_t)); sws = sws_alloc_context(); + sws->dst_format = dst_pix_format; if (accurate) sws->flags |= SWS_ACCURATE_RND; if (sws_init_context(sws, NULL, NULL) < 0) fail(); c = sws_internal(sws); + c->dstBpc = bit_depth; ff_sws_init_scale(c); for(isi = 0; isi < FF_ARRAY_ELEMS(input_sizes); ++isi){ dstW = input_sizes[isi]; @@ -227,24 +238,39 @@ static void check_yuv2yuvX(int accurate) for(j = 0; j < 4; ++j) vFilterData[i].coeff[j + 4] = filter_coeff[i]; } - if (check_func(c->yuv2planeX, "yuv2yuvX_%d_%d_%d_%s", filter_sizes[fsi], osi, dstW, accurate_str)){ + if (check_func(c->yuv2planeX, "yuv2yuvX_%d_%s_%d_%d_%d_%s", bit_depth, isBE(dst_pix_format) ? "BE" : "LE", filter_sizes[fsi], osi, dstW, accurate_str)){ // use vFilterData for the mmx function const int16_t *filter = c->use_mmx_vfilter ? (const int16_t*)vFilterData : &filter_coeff[0]; memset(dst0, 0, LARGEST_INPUT_SIZE * sizeof(dst0[0])); memset(dst1, 0, LARGEST_INPUT_SIZE * sizeof(dst1[0])); - // We can't use call_ref here, because we don't know if use_mmx_vfilter was set for that - // function or not, so we can't pass it the parameters correctly. - yuv2planeX_8_ref(&filter_coeff[0], filter_sizes[fsi], src, dst0, dstW - osi, dither, osi); + if(c->dstBpc == 8) + { + // We can't use call_ref here, because we don't know if use_mmx_vfilter was set for that + // function or not, so we can't pass it the parameters correctly. - call_new(filter, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi); - if (cmp_off_by_n(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]), accurate ? 0 : 2)) { - fail(); - printf("failed: yuv2yuvX_%d_%d_%d_%s\n", filter_sizes[fsi], osi, dstW, accurate_str); - show_differences(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0])); + yuv2planeX_8_ref(&filter_coeff[0], filter_sizes[fsi], src, (uint8_t*)dst0, dstW - osi, dither, osi); + call_new(filter, filter_sizes[fsi], src, (uint8_t*)dst1, dstW - osi, dither, osi); + + if (cmp_off_by_8((uint8_t*)dst0, (uint8_t*)dst1, LARGEST_INPUT_SIZE, accurate ? 0 : 2)) { + fail(); + printf("failed: yuv2yuvX_%d_%s_%d_%d_%d_%s\n", bit_depth, isBE(dst_pix_format) ? "BE" : "LE", filter_sizes[fsi], osi, dstW, accurate_str); + show_differences_8((uint8_t*)dst0, (uint8_t*)dst1, LARGEST_INPUT_SIZE); + } + } + else + { + call_ref(&filter_coeff[0], filter_sizes[fsi], src, (uint8_t*)dst0, dstW - osi, dither, osi); + call_new(&filter_coeff[0], filter_sizes[fsi], src, (uint8_t*)dst1, dstW - osi, dither, osi); + + if (cmp_off_by_16(dst0, dst1, LARGEST_INPUT_SIZE, accurate ? 0 : 2)) { + fail(); + printf("failed: yuv2yuvX_%d_%s_%d_%d_%d_%s\n", bit_depth, isBE(dst_pix_format) ? "BE" : "LE", filter_sizes[fsi], osi, dstW, accurate_str); + show_differences_16(dst0, dst1, LARGEST_INPUT_SIZE); + } } if(dstW == LARGEST_INPUT_SIZE) - bench_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi); + bench_new(filter, filter_sizes[fsi], src, (uint8_t*)dst1, dstW - osi, dither, osi); } av_freep(&src); @@ -311,10 +337,10 @@ static void check_yuv2nv12cX(int accurate) call_ref(sws->dst_format, dither, &filter_coeff[0], filter_size, srcU, srcV, dst0, dstW); call_new(sws->dst_format, dither, &filter_coeff[0], filter_size, srcU, srcV, dst1, dstW); - if (cmp_off_by_n(dst0, dst1, dstW * 2 * sizeof(dst0[0]), accurate ? 0 : 2)) { + if (cmp_off_by_8(dst0, dst1, dstW * 2 * sizeof(dst0[0]), accurate ? 0 : 2)) { fail(); printf("failed: yuv2nv12wX_%d_%d_%s\n", filter_size, dstW, accurate_str); - show_differences(dst0, dst1, dstW * 2 * sizeof(dst0[0])); + show_differences_8(dst0, dst1, dstW * 2 * sizeof(dst0[0])); } if (dstW == LARGEST_INPUT_SIZE) bench_new(sws->dst_format, dither, &filter_coeff[0], filter_size, srcU, srcV, dst1, dstW); @@ -441,9 +467,33 @@ void checkasm_check_sw_scale(void) check_yuv2yuv1(0); check_yuv2yuv1(1); report("yuv2yuv1"); - check_yuv2yuvX(0); - check_yuv2yuvX(1); - report("yuv2yuvX"); + check_yuv2yuvX(0, 8, AV_PIX_FMT_YUV420P); + check_yuv2yuvX(1, 8, AV_PIX_FMT_YUV420P); + report("yuv2yuvX_8"); + check_yuv2yuvX(0, 9, AV_PIX_FMT_YUV420P9LE); + check_yuv2yuvX(1, 9, AV_PIX_FMT_YUV420P9LE); + report("yuv2yuvX_9LE"); + check_yuv2yuvX(0, 9, AV_PIX_FMT_YUV420P9BE); + check_yuv2yuvX(1, 9, AV_PIX_FMT_YUV420P9BE); + report("yuv2yuvX_9BE"); + check_yuv2yuvX(0, 10, AV_PIX_FMT_YUV420P10LE); + check_yuv2yuvX(1, 10, AV_PIX_FMT_YUV420P10LE); + report("yuv2yuvX_10LE"); + check_yuv2yuvX(0, 10, AV_PIX_FMT_YUV420P10BE); + check_yuv2yuvX(1, 10, AV_PIX_FMT_YUV420P10BE); + report("yuv2yuvX_10BE"); + check_yuv2yuvX(0, 12, AV_PIX_FMT_YUV420P12LE); + check_yuv2yuvX(1, 12, AV_PIX_FMT_YUV420P12LE); + report("yuv2yuvX_12LE"); + check_yuv2yuvX(0, 12, AV_PIX_FMT_YUV420P12BE); + check_yuv2yuvX(1, 12, AV_PIX_FMT_YUV420P12BE); + report("yuv2yuvX_12BE"); + check_yuv2yuvX(0, 14, AV_PIX_FMT_YUV420P14LE); + check_yuv2yuvX(1, 14, AV_PIX_FMT_YUV420P14LE); + report("yuv2yuvX_14LE"); + check_yuv2yuvX(0, 14, AV_PIX_FMT_YUV420P14BE); + check_yuv2yuvX(1, 14, AV_PIX_FMT_YUV420P14BE); + report("yuv2yuvX_14BE"); check_yuv2nv12cX(0); check_yuv2nv12cX(1); report("yuv2nv12cX"); -- 2.34.1 ________________________________ From: Logaprakash Ramajayam Sent: Wednesday, July 2, 2025 1:01 PM To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() Handled all the comments and updated checkasm for yuv2planeX_10_c() Checkasm Benchmark results: yuv2yuvX_10_LE_16_0_512_accurate_c: 7836.9 ( 1.00x) yuv2yuvX_10_LE_16_0_512_accurate_neon: 840.4 ( 9.33x) yuv2yuvX_10_LE_16_0_512_approximate_c: 7930.8 ( 1.00x) yuv2yuvX_10_LE_16_0_512_approximate_neon: 838.5 ( 9.46x) yuv2yuvX_10_LE_16_16_512_accurate_c: 7594.3 ( 1.00x) yuv2yuvX_10_LE_16_16_512_accurate_neon: 815.2 ( 9.32x) yuv2yuvX_10_LE_16_16_512_approximate_c: 7687.0 ( 1.00x) yuv2yuvX_10_LE_16_16_512_approximate_neon: 811.9 ( 9.47x) yuv2yuvX_10_LE_16_32_512_accurate_c: 7366.4 ( 1.00x) yuv2yuvX_10_LE_16_32_512_accurate_neon: 785.8 ( 9.37x) yuv2yuvX_10_LE_16_32_512_approximate_c: 7426.5 ( 1.00x) yuv2yuvX_10_LE_16_32_512_approximate_neon: 786.4 ( 9.44x) yuv2yuvX_10_LE_16_48_512_accurate_c: 7123.1 ( 1.00x) yuv2yuvX_10_LE_16_48_512_accurate_neon: 761.7 ( 9.35x) yuv2yuvX_10_LE_16_48_512_approximate_c: 7182.7 ( 1.00x) yuv2yuvX_10_LE_16_48_512_approximate_neon: 763.0 ( 9.41x) yuv2yuvX_10_BE_16_0_512_accurate_c: 8092.6 ( 1.00x) yuv2yuvX_10_BE_16_0_512_accurate_neon: 860.2 ( 9.41x) yuv2yuvX_10_BE_16_0_512_approximate_c: 8183.5 ( 1.00x) yuv2yuvX_10_BE_16_0_512_approximate_neon: 861.4 ( 9.50x) yuv2yuvX_10_BE_16_16_512_accurate_c: 7837.4 ( 1.00x) yuv2yuvX_10_BE_16_16_512_accurate_neon: 834.0 ( 9.40x) yuv2yuvX_10_BE_16_16_512_approximate_c: 7927.9 ( 1.00x) yuv2yuvX_10_BE_16_16_512_approximate_neon: 834.6 ( 9.50x) yuv2yuvX_10_BE_16_32_512_accurate_c: 7605.1 ( 1.00x) yuv2yuvX_10_BE_16_32_512_accurate_neon: 807.5 ( 9.42x) yuv2yuvX_10_BE_16_32_512_approximate_c: 7691.4 ( 1.00x) yuv2yuvX_10_BE_16_32_512_approximate_neon: 807.3 ( 9.53x) yuv2yuvX_10_BE_16_48_512_accurate_c: 7344.3 ( 1.00x) yuv2yuvX_10_BE_16_48_512_accurate_neon: 782.7 ( 9.38x) yuv2yuvX_10_BE_16_48_512_approximate_c: 7440.1 ( 1.00x) yuv2yuvX_10_BE_16_48_512_approximate_neon: 781.9 ( 9.51x) _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".