From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 6AD314CEDE
	for <ffmpegdev@gitmailbox.com>; Thu, 29 May 2025 18:53:53 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id EBF5E68DC39;
	Thu, 29 May 2025 21:53:48 +0300 (EEST)
Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com
 [209.85.167.50])
 by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id A583968D724
 for <ffmpeg-devel@ffmpeg.org>; Thu, 29 May 2025 21:53:41 +0300 (EEST)
Received: by mail-lf1-f50.google.com with SMTP id
 2adb3069b0e04-54e816aeca6so1587678e87.2
 for <ffmpeg-devel@ffmpeg.org>; Thu, 29 May 2025 11:53:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1748544820; x=1749149620;
 darn=ffmpeg.org; 
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:from:to:cc:subject:date:message-id:reply-to;
 bh=o5ej/g3kLYHU98B+lRWcQaQAurzeJZ4ZZ16L2LdR2OI=;
 b=e4N6Yb4ZWPIqLLNQYN3gLJncMGAAZi/8q8Ea6xkae/zA97dAj1qGX7rNs6uB1LoC51
 W/GT+Gc6KtIdiXwx3pjCS4j4MLfjvJJKAg/vTfewdWl0Vn39DXbuyVvSWpF3nPDoc6Zf
 DCIp8VC9dSmYiNDAhJmAIfauz1G7cyQu4G6J+ffzmHo3bOhmGHnZLzJSDTkhZSaF9fkI
 8OfNv9VcCJLAxG/7eLmRalnTgyUk1cdcjJA6EdTU1gvSgX3JPULkldj3CQtHBEFmMpUi
 uxj8TnmwcaBn2XqcBJ04R5LgXD6jgVVkqVDI3aeI16Blr4jiKhVBjE4X5tGOOWEqDnTb
 qlQA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1748544820; x=1749149620;
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=o5ej/g3kLYHU98B+lRWcQaQAurzeJZ4ZZ16L2LdR2OI=;
 b=R/rY7+qAtKW5hHli2dJmebqRrjdE/kstylDrQKC+rb3pIixVZWkzKxANOJGNj68A5R
 aPXcg2G1d0gMPus12cFAU33A4VzHddNAPn0EtMqf7QwW1ETI6EIYb6ImgJezJVosPKmR
 Bk3P4MPAHPsMPFKg2/8qEUnH5pkwnuvKEQzIxMFdBc4auTJg1jFoCu4P3LAMT4eU8GBZ
 pbazAEW3uvKuulzt411M1Uq+ahmwQ4sDiNFKJnRh/shrLVG4jTaOa8CSzpupeUzkD5Su
 /tMbmLaMK6VnfdDATlnqfz+aAqvb4BJG4ki4jg+vV1Yapio9NaNiZ7w5c7lg60HW8JgO
 iCzQ==
X-Gm-Message-State: AOJu0Yyt6uwPDhp20D5NAV/fWaGz/J2MJtl0mS2VafdNm4TUx1CfIlGl
 PQaGdrMZXIy/J03m7wQOhgDwEy0d3NExJiSEjZBq2Zy+0/csdT+oA2ObR2UIxacz3AxA1/qgr5k
 9E6vzRg==
X-Gm-Gg: ASbGncuOvC9JK9yUySjX5Jz7O3XBPrPsdQOW7dqGvKLkcs5LpBjU0o6Uzm12N+02+tF
 8L2JWphSe1+yb7oZTsdbBaIJxFb5CDoe5j5xlmJYRpltPEBTBzkOnub6XwfrA3l+OgmazBZ1vmm
 T0oGpugYlLRe3aM/4EQG9ExlIeaOwqmbQolB2mr9AJxvhQzpqjSWYPMGjZDZHXPHsCRJa/LQJjx
 5AGugsDFiwIF6Qv+fVVhGWMozSKVtH+VNHYWjlcQ1FR9IPuter2dPYA+H0MgHs49wjllisQx4MH
 sJht7055IxNr0DlCGztPaBqsBwdzw/HtwyDWOXbfmwi89LvmenbImqBLEkI+1yX83TfH6zxed76
 7VneMjDxIf1qL07oE9S5856c8nrOZqE77XNj6SWzhZ3vNGQA=
X-Google-Smtp-Source: AGHT+IHhT4UNV59kXd7UiVlWeJeYn3Oo8r5+/2BUN7gsp4XSvoZNC+LkIH+zSXbJ+3b3ljkElu1zuQ==
X-Received: by 2002:a05:6512:1592:b0:553:1f8e:f27f with SMTP id
 2adb3069b0e04-5533b9080d0mr188167e87.30.1748544820119; 
 Thu, 29 May 2025 11:53:40 -0700 (PDT)
Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net
 (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2])
 by smtp.gmail.com with ESMTPSA id
 2adb3069b0e04-5533789d557sm421200e87.53.2025.05.29.11.53.39
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 29 May 2025 11:53:39 -0700 (PDT)
Date: Thu, 29 May 2025 21:53:38 +0300 (EEST)
From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
In-Reply-To: <DBAP193MB0956A2D8F1BB8868601E37AF8D64A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM>
Message-ID: <379d301d-1ee0-243a-256b-4a4b73e2ac99@martin.st>
References: <DBAP193MB0956A2D8F1BB8868601E37AF8D64A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM>
MIME-Version: 1.0
Subject: Re: [FFmpeg-devel] [PATCH 1/2] swscale: rgb_to_yuv neon
 optimizations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Dmitriy Kovalenko <dmtr.kovalenko@outlook.com>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/379d301d-1ee0-243a-256b-4a4b73e2ac99@martin.st/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

On Tue, 27 May 2025, Dmitriy Kovalenko wrote:

> This particular set of changes is a small improvement to all the
> existing functions and macro. The biggest performance gain is
> coming from post loading increment of the pointer and immediate
> prefetching

How certain are you about the prefetching actually making a notable 
difference here? I tried this patch, on an M3 Pro, and I see no difference 
in the benchmark numbers from checkasm if I remove those prefetch 
instructions. Do you have a setup where you can actually measure that they 
do make a difference?

So far, nothing within ffmpeg uses prefetch instructions anywhere, and I 
haven't seen a case where they would actually do anything beneficial. (The 
conventional wisdom I've heard is that they mostly don't do anything or 
actually end up harmful. In this case, they also add a direct dependence 
on the updated pointer register from the directly preceding instruction, 
which _is_ harmful on in-order cores, unless it entirely ignores the 
instruction.)

> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S
> index c1c0adffc8..ee8eb24c14 100644
> --- a/libswscale/aarch64/input.S
> +++ b/libswscale/aarch64/input.S
> @@ -1,5 +1,4 @@
> -/*
> - * Copyright (c) 2024 Zhao Zhili <quinkblack@foxmail.com>
> +/* Copyright (c) 2024 Zhao Zhili <quinkblack@foxmail.com>

This is an unrelated change

>  *
>  * This file is part of FFmpeg.
>  *
> @@ -57,20 +56,41 @@
>         sqshrn2         \dst\().8h, \dst2\().4s, \right_shift   // dst_higher_half = dst2 >> right_shift
> .endm
>
> +// interleaved product version of the rgb to yuv gives slightly better performance on non-performant mobile
> +.macro rgb_to_uv_interleaved_product r, g, b, u_coef0, u_coef1, u_coef2, v_coef0, v_coef1, v_coef2, u_dst1, u_dst2, v_dst1, v_dst2, u_dst, v_dst, right_shift
> +    smlal   \u_dst1\().4s, \u_coef0\().4h, \r\().4h     // U += ru * r (first 4)
> +    smlal   \v_dst1\().4s, \v_coef0\().4h, \r\().4h     // V += rv * r (first 4)
> +    smlal2  \u_dst2\().4s, \u_coef0\().8h, \r\().8h     // U += ru * r (second 4)
> +    smlal2  \v_dst2\().4s, \v_coef0\().8h, \r\().8h     // V += rv * r (second 4)
> +

The patch adds trailing whitespace here and in many other places; make 
sure you don't do that. (It is visible by doing e.g. "git show".)

Also, as a general rule, indent instructions within macros in the same way 
as elsewhere.

> +    smlal   \u_dst1\().4s, \u_coef1\().4h, \g\().4h     // U += gu * g (first 4)
> +    smlal   \v_dst1\().4s, \v_coef1\().4h, \g\().4h     // V += gv * g (first 4)
> +    smlal2  \u_dst2\().4s, \u_coef1\().8h, \g\().8h     // U += gu * g (second 4)
> +    smlal2  \v_dst2\().4s, \v_coef1\().8h, \g\().8h     // V += gv * g (second 4)

If you with "non-performant mobile" mean small in-order cores, most of 
them can handle repeated accumulation like these even faster, if you 
sequence these so that all accumulations to one register is sequentially. 
E.g. first all "smlal \u_dst1\().4s", followed by all "smlal 
\u_dst2\().4s", followed by \v_dst1, followed by \v_dst2. It's worth 
benchmarking if you do have access to such cores (e.g. Cortex-A53/A55; 
perhaps that's also the case on the Cortex-R you mentioned in the commit 
message).

That kind of code sequence is very counterintuitive, when considering 
instruction scheduling for an in-order core, but there is an explicit 
mention of it in the code optimization guides for them; it's usually 
worthwhile to test out such a form of these accumulations.

> +
> +    smlal   \u_dst1\().4s, \u_coef2\().4h, \b\().4h     // U += bu * b (first 4)
> +    smlal   \v_dst1\().4s, \v_coef2\().4h, \b\().4h     // V += bv * b (first 4)
> +    smlal2  \u_dst2\().4s, \u_coef2\().8h, \b\().8h     // U += bu * b (second 4)
> +    smlal2  \v_dst2\().4s, \v_coef2\().8h, \b\().8h     // V += bv * b (second 4)
> +
> +    sqshrn  \u_dst\().4h, \u_dst1\().4s, \right_shift   // U first 4 pixels
> +    sqshrn2 \u_dst\().8h, \u_dst2\().4s, \right_shift   // U all 8 pixels
> +    sqshrn  \v_dst\().4h, \v_dst1\().4s, \right_shift   // V first 4 pixels
> +    sqshrn2 \v_dst\().8h, \v_dst2\().4s, \right_shift   // V all 8 pixels
> +.endm
> +
> .macro rgbToY_neon fmt_bgr, fmt_rgb, element, alpha_first=0
> function ff_\fmt_bgr\()ToY_neon, export=1
> -        cmp             w4, #0                  // check width > 0
> +        cbz             w4, 3f                  // check width > 0
>         ldp             w12, w11, [x5]          // w12: ry, w11: gy
>         ldr             w10, [x5, #8]           // w10: by
> -        b.gt            4f
> -        ret
> +        b               4f
> endfunc
>
> function ff_\fmt_rgb\()ToY_neon, export=1
> -        cmp             w4, #0                  // check width > 0
> +        cbz             w4, 3f                  // check width > 0
>         ldp             w10, w11, [x5]          // w10: ry, w11: gy
>         ldr             w12, [x5, #8]           // w12: by
> -        b.le            3f
> 4:
>         mov             w9, #256                // w9 = 1 << (RGB2YUV_SHIFT - 7)
>         movk            w9, #8, lsl #16         // w9 += 32 << (RGB2YUV_SHIFT - 1)
> @@ -158,8 +178,7 @@ rgbToY_neon abgr32, argb32, element=4, alpha_first=1
>
> .macro rgbToUV_half_neon fmt_bgr, fmt_rgb, element, alpha_first=0
> function ff_\fmt_bgr\()ToUV_half_neon, export=1
> -        cmp             w5, #0          // check width > 0
> -        b.le            3f
> +        cbz             w5, 3f          // check width > 0
>
>         ldp             w12, w11, [x6, #12]
>         ldp             w10, w15, [x6, #20]
> @@ -168,7 +187,7 @@ function ff_\fmt_bgr\()ToUV_half_neon, export=1
> endfunc
>
> function ff_\fmt_rgb\()ToUV_half_neon, export=1
> -        cmp             w5, #0          // check width > 0
> +        cmp             w5, #0                  // check width > 0
>         b.le            3f
>
>         ldp             w10, w11, [x6, #12]     // w10: ru, w11: gu
> @@ -178,32 +197,41 @@ function ff_\fmt_rgb\()ToUV_half_neon, export=1
>         cmp             w5, #8
>         rgb_set_uv_coeff half=1
>         b.lt            2f
> -1:
> +1:  // load 16 pixels and prefetch memory for the next block
>     .if \element == 3
> -        ld3             { v16.16b, v17.16b, v18.16b }, [x3]
> +        ld3             { v16.16b, v17.16b, v18.16b }, [x3], #48
> +        prfm            pldl1strm, [x3, #48]
>     .else
> -        ld4             { v16.16b, v17.16b, v18.16b, v19.16b }, [x3]
> +        ld4             { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64
> +        prfm            pldl1strm, [x3, #64]
>     .endif
> +
>     .if \alpha_first
> -        uaddlp          v21.8h, v19.16b
> -        uaddlp          v20.8h, v18.16b
> -        uaddlp          v19.8h, v17.16b
> +        uaddlp          v21.8h, v19.16b         // v21: summed b pairs
> +        uaddlp          v20.8h, v18.16b         // v20: summed g pairs
> +        uaddlp          v19.8h, v17.16b         // v19: summed r pairs
>     .else
> -        uaddlp          v19.8h, v16.16b         // v19: r
> -        uaddlp          v20.8h, v17.16b         // v20: g
> -        uaddlp          v21.8h, v18.16b         // v21: b
> +        uaddlp          v19.8h, v16.16b         // v19: summed r pairs
> +        uaddlp          v20.8h, v17.16b         // v20: summed g pairs
> +        uaddlp          v21.8h, v18.16b         // v21: summed b pairs
>     .endif
>
> -        rgb_to_yuv_product v19, v20, v21, v22, v23, v16, v0, v1, v2, #10
> -        rgb_to_yuv_product v19, v20, v21, v24, v25, v17, v3, v4, v5, #10
> -        sub             w5, w5, #8              // width -= 8
> -        add             x3, x3, #(16*\element)
> -        cmp             w5, #8                  // width >= 8 ?
> +        mov             v22.16b, v6.16b         // U first half
> +        mov             v23.16b, v6.16b         // U second half
> +        mov             v24.16b, v6.16b         // V first half
> +        mov             v25.16b, v6.16b         // V second half
> +
> +        rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4, v5, v22, v23, v24, v25, v16, v17, #10
> +
>         str             q16, [x0], #16          // store dst_u
>         str             q17, [x1], #16          // store dst_v
> +
> +        sub             w5, w5, #8              // width -= 8
> +        cmp             w5, #8                  // width >= 8 ?
>         b.ge            1b
> -        cbz             w5, 3f
> -2:
> +        cbz             w5, 3f                  // No pixels left? Exit
> +
> +2:      // Scalar fallback for remaining pixels
> .if \alpha_first
>         rgb_load_add_half 1, 5, 2, 6, 3, 7
> .else
> @@ -213,21 +241,24 @@ function ff_\fmt_rgb\()ToUV_half_neon, export=1
>         rgb_load_add_half 0, 4, 1, 5, 2, 6
>     .endif
> .endif
> -
>         smaddl          x8, w2, w10, x9         // dst_u = ru * r + const_offset
> +        smaddl          x16, w2, w13, x9        // dst_v = rv * r + const_offset (parallel)
> +
>         smaddl          x8, w4, w11, x8         // dst_u += gu * g
> +        smaddl          x16, w4, w14, x16       // dst_v += gv * g (parallel)
> +
>         smaddl          x8, w7, w12, x8         // dst_u += bu * b
> -        asr             x8, x8, #10             // dst_u >>= 10
> +        smaddl          x16, w7, w15, x16       // dst_v += bv * b (parallel)
> +
> +        asr             w8, w8, #10             // dst_u >>= 10
> +        asr             w16, w16, #10           // dst_v >>= 10
> +
>         strh            w8, [x0], #2            // store dst_u
> -
> -        smaddl          x8, w2, w13, x9         // dst_v = rv * r + const_offset
> -        smaddl          x8, w4, w14, x8         // dst_v += gv * g
> -        smaddl          x8, w7, w15, x8         // dst_v += bv * b
> -        asr             x8, x8, #10             // dst_v >>= 10
> -        sub             w5, w5, #1
> -        add             x3, x3, #(2*\element)
> -        strh            w8, [x1], #2            // store dst_v
> -        cbnz            w5, 2b
> +        strh            w16, [x1], #2           // store dst_v
> +
> +        sub             w5, w5, #1              // width--
> +        add             x3, x3, #(2*\element)   // Advance source pointer
> +        cbnz            w5, 2b                  // Process next pixel if any left
> 3:
>         ret
> endfunc
> @@ -244,9 +275,9 @@ function ff_\fmt_bgr\()ToUV_neon, export=1
>         cmp             w5, #0                  // check width > 0
>         b.le            3f
>
> -        ldp             w12, w11, [x6, #12]
> -        ldp             w10, w15, [x6, #20]
> -        ldp             w14, w13, [x6, #28]
> +        ldp             w12, w11, [x6, #12]     // bu, gu
> +        ldp             w10, w15, [x6, #20]     // ru, bv
> +        ldp             w14, w13, [x6, #28]     // gv, rv
>         b               4f
> endfunc
>
> @@ -263,21 +294,48 @@ function ff_\fmt_rgb\()ToUV_neon, export=1
>         b.lt            2f
> 1:
>     .if \alpha_first
> -        argb_to_yuv_load_rgb x3
> +        ld4             { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64
> +        uxtl            v21.8h, v19.8b             // v21: b
> +        uxtl2           v24.8h, v19.16b            // v24: b
> +        uxtl            v19.8h, v17.8b             // v19: r
> +        uxtl            v20.8h, v18.8b             // v20: g
> +        uxtl2           v22.8h, v17.16b            // v22: r
> +        uxtl2           v23.8h, v18.16b            // v23: g

The code here, and below, is exactly the same as before, except for the 
postincrement on the load (plus the prefetch). Can we add the 
postincrement to the macro rather than unmacroing the code?

>     .else
> -        rgb_to_yuv_load_rgb x3, \element
> +        .if \element == 3
> +            ld3             { v16.16b, v17.16b, v18.16b }, [x3], #48
> +            prfm            pldl1strm, [x3, #48]

Instead of adding unusual indentation of the instructions here, you could 
use 2 instead of 4 spaces for the .if directives, to keep the vertical 
alignment of the instructions.

> +        .else // element == 4
> +            ld4             { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64
> +            prfm            pldl1strm, [x3, #64]
> +        .endif
> +        uxtl            v19.8h, v16.8b             // v19: r
> +        uxtl            v20.8h, v17.8b             // v20: g
> +        uxtl            v21.8h, v18.8b             // v21: b
> +        uxtl2           v22.8h, v16.16b            // v22: r
> +        uxtl2           v23.8h, v17.16b            // v23: g
> +        uxtl2           v24.8h, v18.16b            // v24: b
>     .endif
> -        rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9
> -        rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9
> -        rgb_to_yuv_product v19, v20, v21, v25, v26, v18, v3, v4, v5, #9
> -        rgb_to_yuv_product v22, v23, v24, v27, v28, v19, v3, v4, v5, #9
> -        sub             w5, w5, #16
> -        add             x3, x3, #(16*\element)
> -        cmp             w5, #16
> -        stp             q16, q17, [x0], #32     // store to dst_u
> -        stp             q18, q19, [x1], #32     // store to dst_v
> +        // process 2 groups of 8 pixels
> +        mov             v25.16b, v6.16b         // U_dst1 = const_offset (32-bit accumulators)
> +        mov             v26.16b, v6.16b         // U_dst2 = const_offset
> +        mov             v27.16b, v6.16b         // V_dst1 = const_offset
> +        mov             v28.16b, v6.16b         // V_dst2 = const_offset
> +        rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4, v5, v25, v26, v27, v28, v16, v18, #9
> +
> +        mov             v25.16b, v6.16b
> +        mov             v26.16b, v6.16b
> +        mov             v27.16b, v6.16b
> +        mov             v28.16b, v6.16b
> +        rgb_to_uv_interleaved_product v22, v23, v24, v0, v1, v2, v3, v4, v5, v25, v26, v27, v28, v17, v19, #9
> +
> +        sub             w5, w5, #16             // width -= 16
> +        cmp             w5, #16                 // width >= 16 ?
> +        stp             q16, q17, [x0], #32     // store to dst_u (post-increment)
> +        stp             q18, q19, [x1], #32     // store to dst_v (post-increment)
>         b.ge            1b
> -        cbz             w5, 3f
> +        cbz             w5, 3f                  // No pixels left? Exit
> +
> 2:
>     .if \alpha_first
>         ldrb            w16, [x3, #1]           // w16: r
> @@ -292,13 +350,13 @@ function ff_\fmt_rgb\()ToUV_neon, export=1
>         smaddl          x8, w16, w10, x9        // x8 = ru * r + const_offset
>         smaddl          x8, w17, w11, x8        // x8 += gu * g
>         smaddl          x8, w4, w12, x8         // x8 += bu * b
> -        asr             w8, w8, #9              // x8 >>= 9
> +        asr             x8, x8, #9              // x8 >>= 9
>         strh            w8, [x0], #2            // store to dst_u

Does this make any practical difference, as we're just storing the lower 
32 bits anyway?

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".