From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 49B5F4DBF6 for ; Sat, 1 Mar 2025 21:39:53 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A427F68E0B2; Sat, 1 Mar 2025 23:39:47 +0200 (EET) Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0536E68DD6E for ; Sat, 1 Mar 2025 23:39:40 +0200 (EET) Received: by mail-lf1-f49.google.com with SMTP id 2adb3069b0e04-5494bc4d741so2517501e87.2 for ; Sat, 01 Mar 2025 13:39:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1740865180; x=1741469980; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=phh8Jn5omQ35QbLvtArb3Eo1sPy06z6zyGDLIfD9DXg=; b=Vvh54cJifZeJO0JjDgcPPTuikV5n4a8GQ7roviw4F7+g/Lc6ZxVEssqiuxUw29Mj+3 cnql5Ub2104tgsRQAYOe23dSOuJKfrSdtp9kmj6A8m3yeSzhYCf5Bh2u3jAYB003vwLQ d3ALSF6m3x0WUNilyI9pn8u7bpwkldrTWNyHrVcoP5XFyVKWhVFO9ukmi6cfL0q0AMdJ e+QDai6PcODs+Pmc996Neh/tubwbUtGgQOhRTG/kcm5ViU74J7Z5fM+KysC9i+aJnoPR yJ7xqW3N5rTZ07ZZFu4wPVZQpm8t6CFWCZwopdOzgNyHYZ7skxKDExC7YbCUtYwzWBa1 AinQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740865180; x=1741469980; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=phh8Jn5omQ35QbLvtArb3Eo1sPy06z6zyGDLIfD9DXg=; b=KkGr6N1DPJfQJ7rLXrwwFQ0rceXh98akDyCXUlKxylP+znZpaTkR4jvusa3JuEzx25 1P/HW28Gv61/4aZbqtgYLfoWvwXewMKhXVgJG92ZnKDrqR+jocVtcpgNwjyfEKf4OU57 mgkOgqMcNs+1nUXJIhBMeTjwyBQXrZMulPpl/9kGbB/sRglgqMzydFLoF7vJQ8QZlRGE 1ASjBnKVSP/JPtoKqOuZOnamBuVokyvmyqjJVxoDNKOHi7TybYr6eQX5JJMa8BSX1mut m//PFVpv3bRqBolPooqkf/HKT7RFfMe641LqIBFJ2mXKhxzDUp3TAIkdzk+sLKCExNa1 cKTw== X-Gm-Message-State: AOJu0Yw2F9hdpruF+gyaDAuzW5mbyK/ochLiMaIG/D4vWKdvdY31Zxjf EWmvhaGrwgSn4i7RAcWlYF012dTucx9yefiBIwP64A+PGw5npvct9H8E9mk91VX4We52ftzx8Aj Pmg== X-Gm-Gg: ASbGncvSpdNTP/sciEvrm6RCcyfZh0hmgjLGqnndbU6zSJwd8n0j6Brsq+QG90vXHiP MFwHxuvyctviscN1d5reaEJ1IYULMzs8FMgT5ChsFKidDCi2c9qCdigK8umCWuffDqe7kOWroBe vvGvRw8a5HpVmS+UJ5Eu9ZeL6o9OaMPGOHl9ZAXonzxIxXsl7yaCGtgopLXai4PlnlPPmF5bWax gyZUukbDq+4T4/MCAcDDSgRDswew8PZRI6ouf8Ze5XMAbgGhw7bzdUg90LKQQUujJaMrtJhNoGo 1+atotiqGDxZIJSee0Xl58lWT6LsIhFLQ93qU2HPMgyvLxzhPhSbm2R9xv8mwzYSJGQ0qgn0uKo WzlY3JsQY8D04pq6zXWJt9JYQGxkAjcXA7GrwLi1X X-Google-Smtp-Source: AGHT+IG8/5QOu42u8dlq/UtYnTFBuPny7iqjdYQBUt8t3PaIA8jhdhxIhSgaMHYzTNhuhVDGvZZ1EA== X-Received: by 2002:a05:6512:1293:b0:545:a2f:22ba with SMTP id 2adb3069b0e04-5494c331e56mr3718153e87.37.1740865179419; Sat, 01 Mar 2025 13:39:39 -0800 (PST) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-54951ec8206sm536649e87.83.2025.03.01.13.39.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 01 Mar 2025 13:39:38 -0800 (PST) Date: Sat, 1 Mar 2025 23:39:34 +0200 (EET) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: FFmpeg development discussions and patches In-Reply-To: Message-ID: <40368528-9f44-ce93-4b4b-fefebe984bae@martin.st> References: MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH 1/2] aarch64/hevcdsp_idct_neon: Optimize idct dc X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Thu, 20 Feb 2025, Zhao Zhili wrote: > From: Zhao Zhili > > clang does better than the assembly code before the patch, especially > for small size: > > hevc_idct_4x4_dc_8_c: 11.2 ( 1.00x) > hevc_idct_4x4_dc_8_neon: 15.5 ( 0.73x) > hevc_idct_4x4_dc_10_c: 12.0 ( 1.00x) > hevc_idct_4x4_dc_10_neon: 15.2 ( 0.79x) > hevc_idct_8x8_dc_8_c: 13.2 ( 1.00x) > hevc_idct_8x8_dc_8_neon: 18.2 ( 0.73x) > hevc_idct_8x8_dc_10_c: 13.5 ( 1.00x) > hevc_idct_8x8_dc_10_neon: 17.2 ( 0.78x) > hevc_idct_16x16_dc_8_c: 41.8 ( 1.00x) > hevc_idct_16x16_dc_8_neon: 37.8 ( 1.11x) > hevc_idct_16x16_dc_10_c: 41.8 ( 1.00x) > hevc_idct_16x16_dc_10_neon: 37.8 ( 1.11x) > hevc_idct_32x32_dc_8_c: 130.2 ( 1.00x) > hevc_idct_32x32_dc_8_neon: 132.2 ( 0.98x) > hevc_idct_32x32_dc_10_c: 130.2 ( 1.00x) > hevc_idct_32x32_dc_10_neon: 132.2 ( 0.98x) > > This patch basically clone what the compiler does, so the performance > is the same. > --- > libavcodec/aarch64/hevcdsp_idct_neon.S | 59 ++++++++++++++------------ > 1 file changed, 33 insertions(+), 26 deletions(-) > > diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S b/libavcodec/aarch64/hevcdsp_idct_neon.S > index 3cac6e6db9..4543ab6b07 100644 > --- a/libavcodec/aarch64/hevcdsp_idct_neon.S > +++ b/libavcodec/aarch64/hevcdsp_idct_neon.S > @@ -888,38 +888,45 @@ function ff_hevc_transform_luma_4x4_neon_8, export=1 > ret > endfunc > > +.macro idct_8x8_dc_store offset > +.irp i, 0x0, 0x20, 0x40, 0x60 > + stp q0, q0, [x0, #(\offset + \i)] > +.endr > +.endm > + > +.macro idct_16x16_dc_store > +.irp index, 0x0, 0x80, 0x100, 0x180 > + idct_8x8_dc_store offset=\index > +.endr > +.endm > + > // void ff_hevc_idct_NxN_dc_DEPTH_neon(int16_t *coeffs) > .macro idct_dc size, bitdepth > function ff_hevc_idct_\size\()x\size\()_dc_\bitdepth\()_neon, export=1 > - ld1r {v4.8h}, [x0] > - srshr v4.8h, v4.8h, #1 > - srshr v0.8h, v4.8h, #(14 - \bitdepth) > - srshr v1.8h, v4.8h, #(14 - \bitdepth) > -.if \size > 4 > - srshr v2.8h, v4.8h, #(14 - \bitdepth) > - srshr v3.8h, v4.8h, #(14 - \bitdepth) > -.if \size > 16 /* dc 32x32 */ > - mov x2, #4 > + ldrsh w1, [x0] > + add w1, w1, #1 > + asr w1, w1, #1 > + add w1, w1, #(1 << (13 - \bitdepth)) > + asr w1, w1, #(14 - \bitdepth) > + dup v0.8h, w1 > + > +.if \size < 8 > + stp q0, q0, [x0] > +.else > +.if \size < 16 You can use .elseif instead of .else and a nested .if. Other than that, this patch looks good to me, thanks! // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".