From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id BE6B147B02 for ; Sun, 14 Sep 2025 19:56:23 +0000 (UTC) Authentication-Results: ffbox; dkim=fail (body hash mismatch (got b'4AmS3TQ7iZ1v8niUbnAvJK8gP4Z5trWeWiLcgyBVeYY=', expected b'I1L8n6mhdBjZclobbk8zCVlND2oq6FwwpMj/ONGB8HA=')) header.d=ffmpeg.org header.i=@ffmpeg.org header.a=rsa-sha256 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1757879777; h=mime-version : to : date : message-id : reply-to : subject : list-id : list-archive : list-archive : list-help : list-owner : list-post : list-subscribe : list-unsubscribe : from : cc : content-type : content-transfer-encoding : from; bh=4AmS3TQ7iZ1v8niUbnAvJK8gP4Z5trWeWiLcgyBVeYY=; b=Tz+uxRDo4Y+Lyc6IyDuB2tWYkKweY9QeW7mg8nXiiKKWqQC1qqQS6K0eG7E25KpY+deRY 1+yUCUlbgYAsZUx1wr7Pp1TctZQTp4PckC0/zLbdQssD3Z/87t/KdJwSwv8B/DPbocq4RfI tTW+TYFVWRCEKMjBLrm2EyJt7GBXSy8XEPBmCSfblvDXQpNVu+LCu44BkOHuolc3MRi1Pm6 olMrZLknTMFHaWdKCzJzRiAXou+jNgjA6hZGpdieQhVZ9eqt890+4faVUHGftMSBsew73vG mNqI8t0kZGIoOVYRMVwpNbELMdbzY12myiHGq+RN386KEjPu3CWYbRaec+qg== Received: from [172.19.0.4] (unknown [172.19.0.4]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id B5F3768E7C5; Sun, 14 Sep 2025 22:56:17 +0300 (EEST) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=ffmpeg.org; s=arc; t=1757879777; b=QNQgGLZn1VpyHu5VL9cuhn6Zwt2qv3zjxAWmpvrVjvB0UOEkfKY80XgBfspq1H8Ww2XTC Ww85lGSel4iI5iLKOjZjaf92mR8jwy3GH9jZBOUfSBaJrJWjELpFLoSrqLL7yHDy42VkaVu W+DYdDg/7D+d6vv3h7nbeJ9U4xjTXHHU0qwS6/8i/+dEu7NSq63Len7mRZD+Kgc0rkpzUQ5 DbfMbJf0SbIiZx1uEntIwQzEYFhnokcIy+7QPWDpznT8c7IOl0GDpJTieyr8eA5Us+CLvs/ B1gv1GriwhZbnyJNC81bfxUQNbaItavsqNtk7mt2xvV3nCCWk1LAk/MJaQAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=ffmpeg.org; s=arc; t=1757879777; h=from : sender : reply-to : subject : date : message-id : to : cc : mime-version : content-type : content-transfer-encoding : content-id : content-description : resent-date : resent-from : resent-sender : resent-to : resent-cc : resent-message-id : in-reply-to : references : list-id : list-help : list-unsubscribe : list-subscribe : list-post : list-owner : list-archive; bh=TgqWpBPTh/H1icDJdZlSpxF384Cu1vND9edzyNviD0s=; b=ay3MLHaKjwfnSUcV1AuGfM6C8ic+wM69qYVfXCu9Yk+nk8djWDIytekeq/ENYGNCnYtlL yD2Gw6yRmb4C9oH+lDdJhG9VoXx1a+phorof1mQs8ZWJDSnGIRcusC/lpuvPS/T9mlMKNaK K5ZitDQ0CljOTDzVN3VDnTb9mt28ppW+l8iQx/9DVASjQ8vq8mi4DF8kkx4dAH3t67ESGap wsFNKpIv18tkUKEuMwiyzqPuHy/mAQAYQzCII/5cRVmxlpVGMZxCRoUtqU64EHLOrkWs7O2 tsBHQSibiQwmXU2lXylQprFnG0UToMIZyHXcNh+AJuznAP5QFvkauJNYSO6g== ARC-Authentication-Results: i=1; ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none; dmarc=none Authentication-Results: ffmpeg.org; dkim=pass header.d=ffmpeg.org header.i=@ffmpeg.org; arc=none (Message is not ARC signed); dmarc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ffmpeg.org; i=@ffmpeg.org; q=dns/txt; s=mail; t=1757879771; h=content-type : mime-version : content-transfer-encoding : from : to : reply-to : subject : date : from; bh=I1L8n6mhdBjZclobbk8zCVlND2oq6FwwpMj/ONGB8HA=; b=hyxQTQkuoRJrtahBDLq6Di76X6e6t9bA+/zSiLgQxD29TogJzjuywdQBmOBGfAw/e3UXH JoLaDn/jiIPlwWkdp4DwyzFaolocRChzDlMdKOjnL5T2I/aePAomjNaNhv9F0B1z4Aozkkl 7I2xsgC7cAgywHayUiPzgSlkcxVgxQAQ6SRva5t5KdE8Gnnsx7YXWZMeB0KSFCufKVTi4L+ fOpAy9YxSDAIpzmJI9ofVT31BW4P6weURpu7CRDsEFQ5CoaWjtZcpaQB90q43XOaQE5WYHj HGu4Ao4cQAqTi2LEaAx6IFfdVXHU5eFtiaEkJtFcQTfI0xt2Qo2dk11KdyKA== Received: from ed19c606a818 (code.ffmpeg.org [188.245.149.3]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 57B5368E7AC for ; Sun, 14 Sep 2025 22:56:11 +0300 (EEST) MIME-Version: 1.0 To: ffmpeg-devel@ffmpeg.org Date: Sun, 14 Sep 2025 19:56:10 -0000 Message-ID: <175787977148.25.5147700478238832743@463a07221176> Message-ID-Hash: 7DEZUUVLOLYRIPOWZ6QUAPVIUWP333J5 X-Message-ID-Hash: 7DEZUUVLOLYRIPOWZ6QUAPVIUWP333J5 X-MailFrom: code@ffmpeg.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; loop; banned-address; header-match-ffmpeg-devel.ffmpeg.org-0; header-match-ffmpeg-devel.ffmpeg.org-1; header-match-ffmpeg-devel.ffmpeg.org-2; header-match-ffmpeg-devel.ffmpeg.org-3; emergency; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.10 Precedence: list Reply-To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH] avcodec/aarch64/vvc: Unroll vvc_bdof_grad_filter_8x_neon (PR #20519) List-Id: FFmpeg development discussions and patches Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: From: welder via ffmpeg-devel Cc: welder Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Archived-At: List-Archive: List-Post: PR #20519 opened by welder URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20519 Patch URL: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20519.patch I hope it's not an overkill, I unrolled the 16 width variant, interleaved the loads, stores and arithmetic ops to the best of my ability. Additionally I got rid of the internal loop and the mov/add preamble and epilogue. >>From 668ddf2d4a0b9213403f7468ab9d7542a0119afb Mon Sep 17 00:00:00 2001 From: Krzysztof Pyrkosz Date: Sun, 14 Sep 2025 19:13:24 +0200 Subject: [PATCH] avcodec/aarch64/vvc: Unroll vvc_bdof_grad_filter_8x_neon Before and after: A53: apply_bdof_8_16x8_neon: 2733.1 ( 4.88x) apply_bdof_8_16x16_neon: 5458.6 ( 4.86x) apply_bdof_10_16x8_neon: 2789.8 ( 4.64x) apply_bdof_10_16x16_neon: 5523.8 ( 4.68x) apply_bdof_12_16x8_neon: 2792.8 ( 4.58x) apply_bdof_12_16x16_neon: 5519.5 ( 4.63x) apply_bdof_8_16x8_neon: 2571.8 ( 5.12x) apply_bdof_8_16x16_neon: 5173.3 ( 5.12x) apply_bdof_10_16x8_neon: 2635.1 ( 4.87x) apply_bdof_10_16x16_neon: 5243.0 ( 4.89x) apply_bdof_12_16x8_neon: 2613.0 ( 4.89x) apply_bdof_12_16x16_neon: 5231.7 ( 4.90x) A78: apply_bdof_8_16x8_neon: 565.3 ( 8.43x) apply_bdof_8_16x16_neon: 1109.5 ( 8.60x) apply_bdof_10_16x8_neon: 568.2 ( 7.92x) apply_bdof_10_16x16_neon: 1114.1 ( 8.08x) apply_bdof_12_16x8_neon: 570.2 ( 7.87x) apply_bdof_12_16x16_neon: 1116.3 ( 8.03x) apply_bdof_8_16x8_neon: 541.4 ( 8.81x) apply_bdof_8_16x16_neon: 1065.9 ( 8.97x) apply_bdof_10_16x8_neon: 543.2 ( 8.32x) apply_bdof_10_16x16_neon: 1071.5 ( 8.39x) apply_bdof_12_16x8_neon: 544.2 ( 8.25x) apply_bdof_12_16x16_neon: 1074.1 ( 8.37x) --- libavcodec/aarch64/vvc/inter.S | 85 +++++++++++++++++++++++++++++++--- 1 file changed, 78 insertions(+), 7 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 01d2ff155c..47810ec3c1 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -827,10 +827,10 @@ function vvc_bdof_grad_filter_8x_neon, export=0 src1 .req x5 width .req w6 height .req w7 + tbnz w6, #4, 16f 1: mov x10, src0 - mov w11, width mov x12, gh0 mov x13, gv0 mov x14, src1 @@ -863,16 +863,11 @@ function vvc_bdof_grad_filter_8x_neon, export=0 // results of gradient_v1 sub v6.8h, v6.8h, v7.8h - add x10, x10, #16 - add x14, x14, #16 - // (gradient_h0 + gradient_h1) >> 1 shadd v1.8h, v0.8h, v4.8h // gradient_h0 - gradient_h1 sub v5.8h, v0.8h, v4.8h - subs w11, w11, #8 - // (gradient_v0 + gradient_v1) >> 1 shadd v3.8h, v2.8h, v6.8h // gradient_v0 - gradient_v1 @@ -882,7 +877,6 @@ function vvc_bdof_grad_filter_8x_neon, export=0 st1 {v5.8h}, [x15], #16 st1 {v3.8h}, [x13], #16 st1 {v7.8h}, [x16], #16 - b.ne 2b subs height, height, #1 add gh0, gh0, #(BDOF_BLOCK_SIZE << 1) @@ -894,6 +888,83 @@ function vvc_bdof_grad_filter_8x_neon, export=0 b.ne 1b ret +16: + ldur q0, [x4, #2] + ldur q1, [x4, #18] + ldur q16, [x4, #-2] + sshr v0.8h, v0.8h, #6 + ldur q17, [x4, #14] + sshr v1.8h, v1.8h, #6 + ldp q18, q19, [x4, #-(VVC_MAX_PB_SIZE << 1)] + sshr v16.8h, v16.8h, #6 + ldp q2, q3, [x4, #(VVC_MAX_PB_SIZE << 1)]! + ldur q20, [x5, #2] + sshr v17.8h, v17.8h, #6 + ldur q21, [x5, #18] + sshr v2.8h, v2.8h, #6 + ldur q22, [x5, #-2] + sshr v3.8h, v3.8h, #6 + ldur q23, [x5, #14] + sshr v18.8h, v18.8h, #6 + ldp q26, q27, [x5, #-(VVC_MAX_PB_SIZE << 1)] + sshr v19.8h, v19.8h, #6 + ldp q24, q25, [x5, #(VVC_MAX_PB_SIZE << 1)]! + + // results of gradient_h0 + sub v0.8h, v0.8h, v16.8h + sub v1.8h, v1.8h, v17.8h + + // results of gradient_v0 + sub v2.8h, v2.8h, v18.8h + sub v3.8h, v3.8h, v19.8h + + sshr v20.8h, v20.8h, #6 + sshr v21.8h, v21.8h, #6 + sshr v22.8h, v22.8h, #6 + sshr v23.8h, v23.8h, #6 + + // results of gradient_h1 + sub v20.8h, v20.8h, v22.8h + sub v21.8h, v21.8h, v23.8h + + sshr v24.8h, v24.8h, #6 + sshr v25.8h, v25.8h, #6 + + // gradient_h0 - gradient_h1 + sub v22.8h, v0.8h, v20.8h + sub v23.8h, v1.8h, v21.8h + + // (gradient_h0 + gradient_h1) >> 1 + shadd v16.8h, v0.8h, v20.8h + shadd v17.8h, v1.8h, v21.8h + + st1 {v22.8h, v23.8h}, [gh1], #32 + + sshr v26.8h, v26.8h, #6 + sshr v27.8h, v27.8h, #6 + + st1 {v16.8h, v17.8h}, [gh0], #32 + + // results of gradient_v1 + sub v24.8h, v24.8h, v26.8h + sub v25.8h, v25.8h, v27.8h + + // (gradient_v0 + gradient_v1) >> 1 + shadd v18.8h, v2.8h, v24.8h + shadd v19.8h, v3.8h, v25.8h + + // gradient_v0 - gradient_v1 + sub v26.8h, v2.8h, v24.8h + sub v27.8h, v3.8h, v25.8h + + st1 {v18.8h,v19.8h}, [gv0], #32 + + subs height, height, #1 + st1 {v26.8h,v27.8h}, [gv1], #32 + + b.ne 16b + ret + .unreq gh0 .unreq gh1 .unreq gv0 -- 2.49.1 _______________________________________________ ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org