From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 829584F122 for ; Fri, 16 May 2025 20:06:22 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 794B268D3F4; Fri, 16 May 2025 23:06:19 +0300 (EEST) Received: from SJ2PR03CU002.outbound.protection.outlook.com (mail-westusazon11023078.outbound.protection.outlook.com [52.101.44.78]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 9FC1068CE05 for ; Fri, 16 May 2025 23:06:12 +0300 (EEST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=vsk85wDzrasmixaqXO0iRIHfZVjRW56CrGUTwJTMA66MedPipY+Er1xdDmdrCXo/UCt1sGRDWt9Ui/Wfl/Vx5wxPWK6Acpmw+v1AZmFrAGUDdD/4PYLKC6cK4NSAQJBMNJxstlt+8fNwke3uEGVCVAt082aQvwNt7uMg51AlCwNpTRWgeyOLU64yPOBV+2GAqUeKG37LabxYNmP4Vo++XogNpRwGpfqTN/fTfZnxTIaitwTqNeOY65N5xuz6A9mlgsFqJUnBFpXx8o7cyvphGWqxN2Yi8Pk9hamlz3Ceag9jJwLOefdd1hz9wR7WysuKcbwQRUC10HKosH8d8Difzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=BRuQYuPrYNedQjvaPjgJEP1kl/fCn/BHr87jONrKLTg=; b=PgxdtpkkxTeAp9sAtSbaKbPrHNLaemRNGYYk0de7/r2sBZp732rq0mPqqRS948sPKe8/Am+5WgJU2TynnJONtHOBMc66IybhsJtqJ5b6vjXfuGVBuRTVNVPLRIoTzun3WUpY7T1LBAxsll+Ude2SUo6AsEAsSQChOrgUik941Euujdf6tWxfF9unhmN48e1R8gSbw/wUP6XAuNch/i1H5lljVt4sNi3Lh71eUEt6V6zu+/cS0qh6CUnN3nK+VW6m2B2iJNQFjCE1FBRkzefvO8d3P6gOFwDitGsyHjBz6lWA1Y2KPvWTTH6ZYEaPHNrToYJtWLE/E6HlObUdjIjtUw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=microsoft.com; dmarc=pass action=none header.from=microsoft.com; dkim=pass header.d=microsoft.com; arc=none Received: from DM4PR21MB4619.namprd21.prod.outlook.com (2603:10b6:8:244::8) by DM4PR21MB3440.namprd21.prod.outlook.com (2603:10b6:8:ad::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8769.12; Fri, 16 May 2025 20:06:06 +0000 Received: from DM4PR21MB4619.namprd21.prod.outlook.com ([fe80::7670:fa05:fb39:c4c9]) by DM4PR21MB4619.namprd21.prod.outlook.com ([fe80::7670:fa05:fb39:c4c9%4]) with mapi id 15.20.8746.006; Fri, 16 May 2025 20:06:05 +0000 To: FFmpeg development discussions and patches Thread-Topic: [EXTERNAL] Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float] Thread-Index: AQHbxIIyH+G0sDRphU2AVo5guQ8OI7PRfy5AgADVfICAAc/FgIAA4IiAgACtclA= Date: Fri, 16 May 2025 20:06:05 +0000 Message-ID: References: <20250514164003.GL29660@pb2> <20250515201957.GW29660@pb2> <20250516094335.GY29660@pb2> In-Reply-To: <20250516094335.GY29660@pb2> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_ActionId=e57a0d55-200f-41c4-ac46-cf84400119e9; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_ContentBits=0; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Enabled=true; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Method=Standard; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Name=Internal; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SetDate=2025-05-16T20:04:22Z; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SiteId=72f988bf-86f1-41af-91ab-2d7cd011db47; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Tag=10, 3, 0, 1; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: DM4PR21MB4619:EE_|DM4PR21MB3440:EE_ x-ms-office365-filtering-correlation-id: 20bc5390-09fd-4567-4750-08dd94b5176a x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|10070799003|1800799024|366016|376014|38070700018; x-microsoft-antispam-message-info: =?us-ascii?Q?zZp7ZZGzM4T4lLRm5Mqim8YjanPpY7/u8DJG0wg6NGuLL4U+cqPOiJa1Rcc1?= =?us-ascii?Q?mQsO4I3o+Fjy1C4dX4A67WP78KMa4kVe5SM96nszWdBqu0KPYTXQJU3guFSv?= =?us-ascii?Q?EhnD4v1U4+19qzwbWU7DawI0rJxrp3iktp21ihKQZXZwIuaIPf3UzleIq3op?= =?us-ascii?Q?G9cONrNPaM5EoDimJsh5U6slqA+HtODuLkEilrehxwJfyNHUkIchROzXSgOX?= =?us-ascii?Q?mTpeT1YQLSY9HAiAmBHMW9lQoUK1HFXoOTYEIB+tO0J7Ajnf8MMTrXYRr4tz?= =?us-ascii?Q?z+0IYJEYN5oeWvwM07TOFTgtJPfgKnzFFWverL+dno1Qzbdtl60xZ2xaogPp?= =?us-ascii?Q?rLTe21MreaiCcvbL8yQC7NBkZ33a2EGmt2RoGv6TF2bGjsvzSeNHXnOWUKt5?= =?us-ascii?Q?lMhFk/p2CLjpceMN5zU+hAzcsl/zQih1xvImFNZoHzl7RmzDPXNVTt8qP+Wx?= =?us-ascii?Q?Ll9pP+/ZZbk9n6l+eolYou1vXtWW4HUy4hDE8tBPy88RLzTR7ACm8C6G0E9h?= =?us-ascii?Q?57UJ3wnNGQUTPXc9bqqzzh6tx3ynqk1pf6ZitIkjkDx3EkZOkYQMjcjuDdyj?= =?us-ascii?Q?MQx3Zo+hhYE7HHf+t+h4xsuSzymAfMWl4IML6wrCMhuHEJkxdN/GE6YKvIc5?= =?us-ascii?Q?qjxZV5ZRg/c9rp3sW5XPPdznkL+HVAtARjWMycMyw/McLJkd6ydGDmSxADTN?= =?us-ascii?Q?jAb2Zh97nsBVqg4OMr0QZrP1e5OqxlTAnoPg4e/+/0hJvxpchTc51pI8uUR8?= =?us-ascii?Q?CWfqq83bRxOHISnpvF66yrZLJKRuSzK/nGTndHl6qYcJVl9DEoI3LcO9Obua?= =?us-ascii?Q?wyh4IY5quwZ6cOrm6+WPzSGAAVKEa69OdkCWxJXZ1BAdJYKgEmsQgx0vK/t2?= =?us-ascii?Q?3dtjNLIgWxOJ0S07FNhipELkR1gRMedF7GPPYmErUmhgNKFxVtqY8q4bLCL9?= =?us-ascii?Q?BLv7xaXHgp+c/99mGI/bNRHe7Pj8uV+JQ4oYB4g3Jb+gMOuifyQbPL4NUK6A?= =?us-ascii?Q?i3MMKqcOk+bQoGDD55SOnKNbT0eIIoujX5r5LcHTlEx/ysRkqYzR2H13NV2e?= =?us-ascii?Q?fkS9WxKihMLP9h5nlPc/YzGvVUdA+KYEXSHfIVk8Jl1d/thVSpVRAmQ0WLuN?= =?us-ascii?Q?o+zVERPSMI8PjEL8/Wrc1qxMDV9x2DLtQnYhJodcQBYuRCxVXMPdtdo2cl3t?= =?us-ascii?Q?FJGn9LAvuTPZAVKART9VuiViKsJmZ63LWoCQ2ZHz2U/WR+BXNe0Jmu1+au2i?= =?us-ascii?Q?hTuM3uVA4g2D4M2h3C6cBKQtzkOEjiNRyvC8ylgrS60NwsjoS/2zCTNM7Yap?= =?us-ascii?Q?wFz0z5gUkv51iVhvp4a0PKdi64PM83p5WrHeocmF0qqs0NoU50ujRb2WF/Fl?= =?us-ascii?Q?5SwU/7DbcF2hOXfNRBKIYKYe1kSTywZIi2SNt8IXhxXIdjGz0KbWV3wAWbQY?= =?us-ascii?Q?LxpeCGR6ClBRzAyTYv44Ycz6C76ONxhVb+JDVzc50Yi4zxgloWq2c0MFJSBs?= =?us-ascii?Q?1AZmUtY0sCZO4Mc=3D?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DM4PR21MB4619.namprd21.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(10070799003)(1800799024)(366016)(376014)(38070700018); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?D77noGnaIHqLeb5bh2LfpSVf3VrIjNuO6ZeZbtem9pnN50kNmI0lX9Y8BJe2?= =?us-ascii?Q?6vtCvRoN1E5zJVinz73MbJgKRJ1qe45TIDLscDQATiq+IVQrA/lIaE8sVRmb?= =?us-ascii?Q?YIZ0ds9NWWIXtvvjTCVNEkPz1QBkfIPPGO6U32EWxT7bvbNlZBo42wpA7trM?= =?us-ascii?Q?fepmxjxXbvlnEgUHAe3croHrbc25SpZSn/QhxJNiB/r/oiLCqBSKarR8s0ZN?= =?us-ascii?Q?NYgcqddVTkrfL7rcIXChRltb66plttg19pM5fj1X80zMMhyepmqCq/fDhSy8?= =?us-ascii?Q?j6mPBKH65jS5nijs7zScroo0NKJ+2QEOuMbJY7xahhBAPi2vHvZujG9CyAFo?= =?us-ascii?Q?xItWAL1V923iOpfRv0g1R9tLW6M3R9e426Wxr+O89ZkHossGpRRiD1kJnXyx?= =?us-ascii?Q?lNeUcV6t/agA7Dk2g3Kzc2/psQYB6dee2qu4E8HT9eCEK5xEBtFbfokgvqrY?= =?us-ascii?Q?fZvq0VNIHuDCn0B241lvvcHMaAN4Mtf0kTzbIRoH0yGlG84V5RTIS5srzdxZ?= =?us-ascii?Q?b54O5XXvguSiALlVxnRN7oh260J6hrYb8YvIuFWLpPDUEB08PNVIqfynPGLx?= =?us-ascii?Q?2Wyu16+oCpbHd4QffnmhWjHA1UVGn8p5c3k3ntalL+fxTE+hdRehWGgszJb2?= =?us-ascii?Q?qB5DDnPgHhfa7EN9WhL87ZA2m9YUY39ltAMpC8evPaK9e16nr9AQPAdLRw+7?= =?us-ascii?Q?PeljVTcgi04m8ZEEuivAr1RGOGjp1MbjjcUHngO7Sof2pjJdduDeeNLCobd5?= =?us-ascii?Q?lQrVY152dbyXtZRTf24Xojm2c6i9cK4IVinMsEwjWFcY2PAedYT91SI7mQMG?= =?us-ascii?Q?azSTkKxPNomMeeJ94ZrLeoDjmhe4CH3ka0N5p2z4ddt2RwRgHvLnGH9LroQY?= =?us-ascii?Q?dFqobZsrOwBRABL6Sd1MwecxyTzXsOXkLB2XuALty3ekLT5F5/KyxDqfm37N?= =?us-ascii?Q?YCAd98QR7E3BHea+knt1LJUP88pazZUEorG5yrHE9Dnm8CDi7pobKsiC57Q+?= =?us-ascii?Q?XRU2+XCRo4VgeI5IWv0QqVrfyLI4nrFBnre+U6rEsBPBzvUvP18Wl7WRuPpE?= =?us-ascii?Q?urJykpImxz2eihbpQs/2EFlJRVlUtjrBMZpQcLj2yfz5xZ9KKvwST+q2I1c+?= =?us-ascii?Q?l+uYrZ1Ys0dVdhoCMjFMHBXQatqCy3OlTmuBIVAFRhgS7m+5wFr/PpPYQZ7g?= =?us-ascii?Q?DO1zwoJ4y+nIhnO+JT8faRDA96kONdM/lhDHkn+MCy5DxkOMUqQWYPvXAzAF?= =?us-ascii?Q?zeNwPxnpX9p1aSYQQ/xVhsBR9ciTjbwKtjSzDs1CcjzHdrNwDRrBpIMWNDCw?= =?us-ascii?Q?c6U8CZsm5HPJH9sWnYPHese7MmaWyB4xNYzrW/RDrgKfEX5wwj4qlk+PzNu6?= =?us-ascii?Q?usX5yMFfGPUxdLtqZ6dOrM0lxeA0BULeD1w4RlOCbZ1SKYjiRTVLxPOnOBmG?= =?us-ascii?Q?pCwe9Zs246Zc2SQNlzmMgxSpOM+5/Op/ptSQP0RfCoJXE8IPOqKvDqlzjIVY?= =?us-ascii?Q?5LZ73StYnWVCAH2jZQiIatWD755JMZA1qJSDG7eBzreNK3Lv6bHBz3cJ+WgE?= =?us-ascii?Q?E0lFcbgVESVEuGZuNvSwgG2uE+THqUmXhyOQq6YYgx6kZSBaJ8TlhOQPLmr8?= =?us-ascii?Q?NV1NkoEmQvIOTVOoB+cZJJWNgLDDu1FmM7ZHIcSXFKyl?= MIME-Version: 1.0 X-OriginatorOrg: microsoft.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: DM4PR21MB4619.namprd21.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 20bc5390-09fd-4567-4750-08dd94b5176a X-MS-Exchange-CrossTenant-originalarrivaltime: 16 May 2025 20:06:05.9269 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: DO7XPnbiRcXCha5gPX3FHHRj/7ZQWqhBiqOn6bmFtB+4eIy3aMQn5T/CKrT12yPSylf4QZYWa9BLe2/feziY7Q== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR21MB3440 Subject: Re: [FFmpeg-devel] [EXTERNAL] Re: [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float] X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Chitra Dey Sarkar via ffmpeg-devel Reply-To: FFmpeg development discussions and patches Cc: Chitra Dey Sarkar , Michael Niedermayer , Linda Zhang , Vittal Pai Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: -----Original Message----- From: ffmpeg-devel On Behalf Of Michael Niedermayer Sent: Friday, May 16, 2025 2:44 AM To: FFmpeg development discussions and patches Subject: [EXTERNAL] Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float] On Thu, May 15, 2025 at 10:19:57PM +0200, Michael Niedermayer wrote: > Hi > > On Wed, May 14, 2025 at 06:40:03PM +0200, Michael Niedermayer wrote: > > Hi Chitra > > > > On Wed, May 14, 2025 at 03:55:59AM +0000, Chitra Dey Sarkar via ffmpeg-devel wrote: > > > Original Implementation: > > > --------------------------------- > > > In the original implementation, the "VER_SD" section processes image data stored in *data using strided memory access in a vertical fashion This leads to inefficient memory access patterns and cache thrashing due to non-sequential data access across multiple inner loops. > > > > > > Proposed Refactor: > > > --------------------------------- > > > The proposed refactor replaces this by allocating a cache-friendly 2D array buffer. This change eliminates strided memory access across the three inner loops, significantly improving cache locality and reducing cache thrashing. > > > > > > Additionally, the data is transposed outside the lp loop, which allows for efficient per-line access and write-back to the l buffer, further optimizing performance. > > > > > > Performance improvements > > > ------------------------------------------------------- > > > This change results in a substantial performance improvement > > > Sharing the FPS data benchmarked on our end for the file 'Tears of > > > Steel' using HandBrake > > > > > > Device / CPU Model Official FPS Optimized FPS % Improvement > > > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 3.18 6.15 +93% > > > Surface Laptop 11(10-core X1P64100, L2: 36MB) 5.16 7.31 +41% > > > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 5.57 9.21 +65% > > > AMD Ryzen + NVIDIA RTX 4060 Laptop (12C/24T) 9.97 11.22 +12% > > > Mac Mini Apple M4 Chip 9.00 12.00 +30% > > > > > > ------------------------------------------------------------------ > > > ------------------------------------- > > > --- > > > libavcodec/jpeg2000dwt.c | 72 > > > +++++++++++++++++++++++++++++++--------- > > > 1 file changed, 57 insertions(+), 15 deletions(-) > > > > > > diff --git a/libavcodec/jpeg2000dwt.c b/libavcodec/jpeg2000dwt.c > > > index 9ee8122658..45d7897893 100644 > > > --- a/libavcodec/jpeg2000dwt.c > > > +++ b/libavcodec/jpeg2000dwt.c > > > @@ -409,6 +409,15 @@ static void dwt_decode97_float(DWTContext *s, float *t) > > > + > > > for (lev = 0; lev < s->ndeclevels; lev++) { > > > int lh = s->linelen[lev][0], > > > lv = s->linelen[lev][1], @@ -431,23 +440,56 @@ static > > > void dwt_decode97_float(DWTContext *s, float *t) > > > for (i = 0; i < lh; i++) > > > data[w * lp + i] = l[i]; > > > } > > > - > > > - // VER_SD > > > - l = line + mv; > > > - for (lp = 0; lp < lh; lp++) { > > > - int i, j = 0; > > > - // copy with interleaving > > > - for (i = mv; i < lv; i += 2, j++) > > > - l[i] = data[w * j + lp]; > > > - for (i = 1 - mv; i < lv; i += 2, j++) > > > - l[i] = data[w * j + lp]; > > > - > > > > > - sr_1d97_float(line, mv, mv + lv); > > > > this should be run linewise not columnwise if you dont understand > > what i mean here, please say so and ill elaborate > > For the record: (may be interresting for others, or others may have > comments too) > > The 1D transform is made of 4 passes of lifting > transforms, currently all pixels of a column are run through the first > lifting step before anything runs through the 2nd. One could try to > run the first through the 4th lifting step after the 2nd finishes the > 3rd lifting step and the 3rd pixel of the column finishes the 2nd > lifting step and then do this for all pixels in a row so > that the whole transform is finished for the whole first row before > more than 5 rows or so have been touched this _COULD_ be > faster, but it needs to be tried to be sure > > basically, there would be a area covering s small number of whole rows > and this area would move down by 1 or 2 rows in each iteration above > it both horizontal and vertical transforms are finished below it its > the untouched input. > as long as this sliding window fits in the cache this should > outperform anything that copyies the data around > > I think its important to look into this before optimizing the code for > CPU or GPU because the structure of this is different than the > transpose based code. In fact the horizontal transform is quite > unfriendly for SIMD > > so the code as is in C might be to worst possible starting point for > SIMD it transforms the easy vertical transform with a transpose into a > horizontal one ... More elaboration, i think what i wrote is still unclear If one looks at spatial_decompose97i() in libavcodec/snow_dwt.c thats approximately what i had in mind. Its just one page of code ==========> Acknowledged . I will look at this part of the code to see if I am able to do something similar _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".