[FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon
@ 2022-07-13 20:48 Martin Storsjö
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable Martin Storsjö
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Martin Storsjö @ 2022-07-13 20:48 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Jonathan Swinney

Checkasm doesn't currently test this codepath.
---
 libavcodec/aarch64/me_cmp_neon.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index e49d049fc2..31db3793d9 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -189,11 +189,11 @@ function ff_pix_abs16_xy2_neon, export=1
         urshr           v16.8h, v16.8h, #2          // shift right by 2 0..7 (rounding shift right)
         urshr           v17.8h, v17.8h, #2          // shift right by 2 8..15
 
-        uxtl2           v8.8h, v1.16b               // 8->16 bits pix1 8..15
+        uxtl2           v7.8h, v1.16b               // 8->16 bits pix1 8..15
         uxtl            v1.8h, v1.8b                // 8->16 bits pix1 0..7
 
         uabd            v6.8h, v1.8h, v16.8h        // absolute difference 0..7
-        uaba            v6.8h, v8.8h, v17.8h        // absolute difference accumulate 8..15
+        uaba            v6.8h, v7.8h, v17.8h        // absolute difference accumulate 8..15
         mov             v2.16b, v18.16b             // pix3 -> pix2
         mov             v3.16b, v19.16b             // pix3+1 -> pix2+1
         uaddlv          s6, v6.8h                   // add up accumulator in v6
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable
  2022-07-13 20:48 [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Martin Storsjö
@ 2022-07-13 20:48 ` Martin Storsjö
  2022-07-15 19:35   ` Swinney, Jonathan
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 3/5] aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon Martin Storsjö
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Martin Storsjö @ 2022-07-13 20:48 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Jonathan Swinney

Don't use the last random offset, but a static one.
---
 tests/checkasm/motion.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tests/checkasm/motion.c b/tests/checkasm/motion.c
index 79e4358941..87b20d1c10 100644
--- a/tests/checkasm/motion.c
+++ b/tests/checkasm/motion.c
@@ -81,7 +81,8 @@ static void test_motion(const char *name, me_cmp_func test_func)
                 break;
             }
         }
-        // benchmark with the final value of ptr
+        // Test with a fixed offset, for benchmark stability
+        ptr = img2 + 3 * WIDTH + 3;
         bench_new(NULL, img1, ptr, WIDTH, 8);
     }
 }
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable Martin Storsjö
@ 2022-07-15 19:35   ` Swinney, Jonathan
  0 siblings, 0 replies; 20+ messages in thread
From: Swinney, Jonathan @ 2022-07-15 19:35 UTC (permalink / raw)
  To: Martin Storsjö, ffmpeg-devel

LGTM.

-- 

Jonathan Swinney

On 7/13/22, 3:49 PM, "Martin Storsjö" <martin@martin.st> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    Don't use the last random offset, but a static one.
    ---
     tests/checkasm/motion.c | 3 ++-
     1 file changed, 2 insertions(+), 1 deletion(-)

    diff --git a/tests/checkasm/motion.c b/tests/checkasm/motion.c
    index 79e4358941..87b20d1c10 100644
    --- a/tests/checkasm/motion.c
    +++ b/tests/checkasm/motion.c
    @@ -81,7 +81,8 @@ static void test_motion(const char *name, me_cmp_func test_func)
                     break;
                 }
             }
    -        // benchmark with the final value of ptr
    +        // Test with a fixed offset, for benchmark stability
    +        ptr = img2 + 3 * WIDTH + 3;
             bench_new(NULL, img1, ptr, WIDTH, 8);
         }
     }
    --
    2.25.1


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [FFmpeg-devel] [PATCH 3/5] aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon
  2022-07-13 20:48 [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Martin Storsjö
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable Martin Storsjö
@ 2022-07-13 20:48 ` Martin Storsjö
  2022-07-15 19:34   ` Swinney, Jonathan
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal " Martin Storsjö
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Martin Storsjö @ 2022-07-13 20:48 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Jonathan Swinney

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_3_neon:  183.7  112.7   97.5   41.2
After:
pix_abs_0_3_neon:  175.7  109.2   92.0   41.2
---
 libavcodec/aarch64/me_cmp_neon.S | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 31db3793d9..0ae23d8922 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -101,26 +101,32 @@ function ff_pix_abs16_xy2_neon, export=1
         ld1             {v6.16b}, [x5], x3          // load pix3
         ld1             {v16.16b}, [x1], x3         // load pix1
 
-        ldur            q19, [x5, #1]               // load pix3+1
-        ld1             {v18.16b}, [x5], x3         // load pix3
-        ld1             {v17.16b}, [x1], x3         // load pix1
-
-        ldur            q22, [x5, #1]               // load pix3+1
-        ld1             {v21.16b}, [x5], x3         // load pix3
-        ld1             {v20.16b}, [x1], x3         // load pix1
-
         // These blocks compute the average: avg(pix2[n], pix2[n+1], pix3[n], pix3[n+1])
         uaddl           v30.8h, v4.8b, v5.8b        // pix3 + pix3+1 0..7
         uaddl2          v31.8h, v4.16b, v5.16b      // pix3 + pix3+1 8..15
+
+        ldur            q19, [x5, #1]               // load pix3+1
+
         add             v23.8h, v2.8h, v30.8h       // add up 0..7, using pix2 + pix2+1 values from previous iteration
         add             v24.8h, v3.8h, v31.8h       // add up 8..15, using pix2 + pix2+1 values from previous iteration
+
+        ld1             {v18.16b}, [x5], x3         // load pix3
+        ld1             {v17.16b}, [x1], x3         // load pix1
+
         rshrn           v23.8b, v23.8h, #2          // shift right 2 0..7 (rounding shift right)
         rshrn2          v23.16b, v24.8h, #2         // shift right 2 8..15
 
         uaddl           v2.8h, v6.8b, v7.8b         // pix3 + pix3+1 0..7
         uaddl2          v3.8h, v6.16b, v7.16b       // pix3 + pix3+1 8..15
+
+        ldur            q22, [x5, #1]               // load pix3+1
+
         add             v26.8h, v30.8h, v2.8h       // add up 0..7, using pix2 + pix2+1 values from pix3 above
         add             v27.8h, v31.8h, v3.8h       // add up 8..15, using pix2 + pix2+1 values from pix3 above
+
+        ld1             {v21.16b}, [x5], x3         // load pix3
+        ld1             {v20.16b}, [x1], x3         // load pix1
+
         rshrn           v26.8b, v26.8h, #2          // shift right 2 0..7 (rounding shift right)
         rshrn2          v26.16b, v27.8h, #2         // shift right 2 8..15
 
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 3/5] aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 3/5] aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon Martin Storsjö
@ 2022-07-15 19:34   ` Swinney, Jonathan
  0 siblings, 0 replies; 20+ messages in thread
From: Swinney, Jonathan @ 2022-07-15 19:34 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Martin Storsjö

This looks good to me. As you can see it makes very little difference on Graviton 3, which is why I had not already done it. :)

-- 

Jonathan Swinney

On 7/13/22, 3:49 PM, "ffmpeg-devel on behalf of Martin Storsjö" <ffmpeg-devel-bounces@ffmpeg.org on behalf of martin@martin.st> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    Before:       Cortex A53    A72    A73   Graviton 3
    pix_abs_0_3_neon:  183.7  112.7   97.5   41.2
    After:
    pix_abs_0_3_neon:  175.7  109.2   92.0   41.2
    ---
     libavcodec/aarch64/me_cmp_neon.S | 22 ++++++++++++++--------
     1 file changed, 14 insertions(+), 8 deletions(-)

    diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
    index 31db3793d9..0ae23d8922 100644
    --- a/libavcodec/aarch64/me_cmp_neon.S
    +++ b/libavcodec/aarch64/me_cmp_neon.S
    @@ -101,26 +101,32 @@ function ff_pix_abs16_xy2_neon, export=1
             ld1             {v6.16b}, [x5], x3          // load pix3
             ld1             {v16.16b}, [x1], x3         // load pix1

    -        ldur            q19, [x5, #1]               // load pix3+1
    -        ld1             {v18.16b}, [x5], x3         // load pix3
    -        ld1             {v17.16b}, [x1], x3         // load pix1
    -
    -        ldur            q22, [x5, #1]               // load pix3+1
    -        ld1             {v21.16b}, [x5], x3         // load pix3
    -        ld1             {v20.16b}, [x1], x3         // load pix1
    -
             // These blocks compute the average: avg(pix2[n], pix2[n+1], pix3[n], pix3[n+1])
             uaddl           v30.8h, v4.8b, v5.8b        // pix3 + pix3+1 0..7
             uaddl2          v31.8h, v4.16b, v5.16b      // pix3 + pix3+1 8..15
    +
    +        ldur            q19, [x5, #1]               // load pix3+1
    +
             add             v23.8h, v2.8h, v30.8h       // add up 0..7, using pix2 + pix2+1 values from previous iteration
             add             v24.8h, v3.8h, v31.8h       // add up 8..15, using pix2 + pix2+1 values from previous iteration
    +
    +        ld1             {v18.16b}, [x5], x3         // load pix3
    +        ld1             {v17.16b}, [x1], x3         // load pix1
    +
             rshrn           v23.8b, v23.8h, #2          // shift right 2 0..7 (rounding shift right)
             rshrn2          v23.16b, v24.8h, #2         // shift right 2 8..15

             uaddl           v2.8h, v6.8b, v7.8b         // pix3 + pix3+1 0..7
             uaddl2          v3.8h, v6.16b, v7.16b       // pix3 + pix3+1 8..15
    +
    +        ldur            q22, [x5, #1]               // load pix3+1
    +
             add             v26.8h, v30.8h, v2.8h       // add up 0..7, using pix2 + pix2+1 values from pix3 above
             add             v27.8h, v31.8h, v3.8h       // add up 8..15, using pix2 + pix2+1 values from pix3 above
    +
    +        ld1             {v21.16b}, [x5], x3         // load pix3
    +        ld1             {v20.16b}, [x1], x3         // load pix1
    +
             rshrn           v26.8b, v26.8h, #2          // shift right 2 0..7 (rounding shift right)
             rshrn2          v26.16b, v27.8h, #2         // shift right 2 8..15

    --
    2.25.1

    _______________________________________________
    ffmpeg-devel mailing list
    ffmpeg-devel@ffmpeg.org
    https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

    To unsubscribe, visit link above, or email
    ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon
  2022-07-13 20:48 [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Martin Storsjö
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable Martin Storsjö
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 3/5] aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon Martin Storsjö
@ 2022-07-13 20:48 ` Martin Storsjö
  2022-07-15 19:38   ` Swinney, Jonathan
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration Martin Storsjö
  2022-07-15 19:35 ` [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Swinney, Jonathan
  4 siblings, 1 reply; 20+ messages in thread
From: Martin Storsjö @ 2022-07-13 20:48 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Jonathan Swinney

Using absolute-difference-accumulate does use twice the amount of
absolute-difference instructions, but avoids the need for the
uaddl and add instructions, reducing the total number of instructions
by 3.

These can be interleaved in the rest of the calculation, to avoid
tight dependencies at the end. Unfortunately, this is marginally
slower on Cortex A53, but faster on A72 and A73.

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_3_neon:  175.7  109.2   92.0   41.2
After:
pix_abs_0_3_neon:  179.7   96.7   87.5   41.2
---
 libavcodec/aarch64/me_cmp_neon.S | 32 +++++++++++---------------------
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 0ae23d8922..89546869fb 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -124,6 +124,9 @@ function ff_pix_abs16_xy2_neon, export=1
         add             v26.8h, v30.8h, v2.8h       // add up 0..7, using pix2 + pix2+1 values from pix3 above
         add             v27.8h, v31.8h, v3.8h       // add up 8..15, using pix2 + pix2+1 values from pix3 above
 
+        uabdl           v24.8h, v1.8b,  v23.8b      // absolute difference 0..7, i=0
+        uabdl2          v23.8h, v1.16b, v23.16b     // absolute difference 8..15, i=0
+
         ld1             {v21.16b}, [x5], x3         // load pix3
         ld1             {v20.16b}, [x1], x3         // load pix1
 
@@ -137,6 +140,9 @@ function ff_pix_abs16_xy2_neon, export=1
         rshrn           v28.8b, v28.8h, #2          // shift right 2 0..7 (rounding shift right)
         rshrn2          v28.16b, v29.8h, #2         // shift right 2 8..15
 
+        uabal           v24.8h, v16.8b,  v26.8b     // absolute difference 0..7, i=1
+        uabal2          v23.8h, v16.16b, v26.16b    // absolute difference 8..15, i=1
+
         uaddl           v2.8h, v21.8b, v22.8b       // pix3 + pix3+1 0..7
         uaddl2          v3.8h, v21.16b, v22.16b     // pix3 + pix3+1 8..15
         add             v30.8h, v4.8h, v2.8h        // add up 0..7, using pix2 + pix2+1 values from pix3 above
@@ -144,33 +150,17 @@ function ff_pix_abs16_xy2_neon, export=1
         rshrn           v30.8b, v30.8h, #2          // shift right 2 0..7 (rounding shift right)
         rshrn2          v30.16b, v31.8h, #2         // shift right 2 8..15
 
-        // Averages are now stored in these registers:
-        // v23, v16, v28, v30
-        // pix1 values in these registers:
-        // v1, v16, v17, v20
-        // available:
-        // v4, v5, v7, v18, v19, v24, v25, v27, v29, v31
+        uabal           v24.8h, v17.8b,  v28.8b     // absolute difference 0..7, i=2
+        uabal2          v23.8h, v17.16b, v28.16b    // absolute difference 8..15, i=2
 
         sub             w4, w4, #4                  // h -= 4
 
-        // Using absolute-difference instructions instead of absolute-difference-accumulate allows
-        // us to keep the results in 16b vectors instead of widening values with twice the instructions.
-        // This approach also has fewer data dependencies, allowing better instruction level parallelism.
-        uabd            v4.16b, v1.16b, v23.16b     // absolute difference 0..15, i=0
-        uabd            v5.16b, v16.16b, v26.16b    // absolute difference 0..15, i=1
-        uabd            v6.16b, v17.16b, v28.16b    // absolute difference 0..15, i=2
-        uabd            v7.16b, v20.16b, v30.16b    // absolute difference 0..15, i=3
+        uabal           v24.8h, v20.8b,  v30.8b     // absolute difference 0..7, i=3
+        uabal2          v23.8h, v20.16b, v30.16b    // absolute difference 8..15, i=3
 
         cmp             w4, #4                      // loop if h >= 4
 
-        // Now add up all the values in each vector, v4-v7 with widening adds
-        uaddl           v19.8h, v4.8b, v5.8b
-        uaddl2          v18.8h, v4.16b, v5.16b
-        uaddl           v4.8h, v6.8b, v7.8b
-        uaddl2          v5.8h, v6.16b, v7.16b
-        add             v4.8h, v4.8h, v5.8h
-        add             v4.8h, v4.8h, v18.8h
-        add             v4.8h, v4.8h, v19.8h
+        add             v4.8h, v23.8h, v24.8h
         uaddlv          s4, v4.8h                   // finish adding up accumulated values
         add             d0, d0, d4                  // add the value to the top level accumulator
 
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal " Martin Storsjö
@ 2022-07-15 19:38   ` Swinney, Jonathan
  0 siblings, 0 replies; 20+ messages in thread
From: Swinney, Jonathan @ 2022-07-15 19:38 UTC (permalink / raw)
  To: FFmpeg development discussions and patches, Martin Storsjö

LGTM.

-- 

Jonathan Swinney

On 7/13/22, 3:49 PM, "ffmpeg-devel on behalf of Martin Storsjö" <ffmpeg-devel-bounces@ffmpeg.org on behalf of martin@martin.st> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    Using absolute-difference-accumulate does use twice the amount of
    absolute-difference instructions, but avoids the need for the
    uaddl and add instructions, reducing the total number of instructions
    by 3.

    These can be interleaved in the rest of the calculation, to avoid
    tight dependencies at the end. Unfortunately, this is marginally
    slower on Cortex A53, but faster on A72 and A73.

    Before:       Cortex A53    A72    A73   Graviton 3
    pix_abs_0_3_neon:  175.7  109.2   92.0   41.2
    After:
    pix_abs_0_3_neon:  179.7   96.7   87.5   41.2
    ---
     libavcodec/aarch64/me_cmp_neon.S | 32 +++++++++++---------------------
     1 file changed, 11 insertions(+), 21 deletions(-)

    diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
    index 0ae23d8922..89546869fb 100644
    --- a/libavcodec/aarch64/me_cmp_neon.S
    +++ b/libavcodec/aarch64/me_cmp_neon.S
    @@ -124,6 +124,9 @@ function ff_pix_abs16_xy2_neon, export=1
             add             v26.8h, v30.8h, v2.8h       // add up 0..7, using pix2 + pix2+1 values from pix3 above
             add             v27.8h, v31.8h, v3.8h       // add up 8..15, using pix2 + pix2+1 values from pix3 above

    +        uabdl           v24.8h, v1.8b,  v23.8b      // absolute difference 0..7, i=0
    +        uabdl2          v23.8h, v1.16b, v23.16b     // absolute difference 8..15, i=0
    +
             ld1             {v21.16b}, [x5], x3         // load pix3
             ld1             {v20.16b}, [x1], x3         // load pix1

    @@ -137,6 +140,9 @@ function ff_pix_abs16_xy2_neon, export=1
             rshrn           v28.8b, v28.8h, #2          // shift right 2 0..7 (rounding shift right)
             rshrn2          v28.16b, v29.8h, #2         // shift right 2 8..15

    +        uabal           v24.8h, v16.8b,  v26.8b     // absolute difference 0..7, i=1
    +        uabal2          v23.8h, v16.16b, v26.16b    // absolute difference 8..15, i=1
    +
             uaddl           v2.8h, v21.8b, v22.8b       // pix3 + pix3+1 0..7
             uaddl2          v3.8h, v21.16b, v22.16b     // pix3 + pix3+1 8..15
             add             v30.8h, v4.8h, v2.8h        // add up 0..7, using pix2 + pix2+1 values from pix3 above
    @@ -144,33 +150,17 @@ function ff_pix_abs16_xy2_neon, export=1
             rshrn           v30.8b, v30.8h, #2          // shift right 2 0..7 (rounding shift right)
             rshrn2          v30.16b, v31.8h, #2         // shift right 2 8..15

    -        // Averages are now stored in these registers:
    -        // v23, v16, v28, v30
    -        // pix1 values in these registers:
    -        // v1, v16, v17, v20
    -        // available:
    -        // v4, v5, v7, v18, v19, v24, v25, v27, v29, v31
    +        uabal           v24.8h, v17.8b,  v28.8b     // absolute difference 0..7, i=2
    +        uabal2          v23.8h, v17.16b, v28.16b    // absolute difference 8..15, i=2

             sub             w4, w4, #4                  // h -= 4

    -        // Using absolute-difference instructions instead of absolute-difference-accumulate allows
    -        // us to keep the results in 16b vectors instead of widening values with twice the instructions.
    -        // This approach also has fewer data dependencies, allowing better instruction level parallelism.
    -        uabd            v4.16b, v1.16b, v23.16b     // absolute difference 0..15, i=0
    -        uabd            v5.16b, v16.16b, v26.16b    // absolute difference 0..15, i=1
    -        uabd            v6.16b, v17.16b, v28.16b    // absolute difference 0..15, i=2
    -        uabd            v7.16b, v20.16b, v30.16b    // absolute difference 0..15, i=3
    +        uabal           v24.8h, v20.8b,  v30.8b     // absolute difference 0..7, i=3
    +        uabal2          v23.8h, v20.16b, v30.16b    // absolute difference 8..15, i=3

             cmp             w4, #4                      // loop if h >= 4

    -        // Now add up all the values in each vector, v4-v7 with widening adds
    -        uaddl           v19.8h, v4.8b, v5.8b
    -        uaddl2          v18.8h, v4.16b, v5.16b
    -        uaddl           v4.8h, v6.8b, v7.8b
    -        uaddl2          v5.8h, v6.16b, v7.16b
    -        add             v4.8h, v4.8h, v5.8h
    -        add             v4.8h, v4.8h, v18.8h
    -        add             v4.8h, v4.8h, v19.8h
    +        add             v4.8h, v23.8h, v24.8h
             uaddlv          s4, v4.8h                   // finish adding up accumulated values
             add             d0, d0, d4                  // add the value to the top level accumulator

    --
    2.25.1

    _______________________________________________
    ffmpeg-devel mailing list
    ffmpeg-devel@ffmpeg.org
    https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

    To unsubscribe, visit link above, or email
    ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-13 20:48 [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Martin Storsjö
                   ` (2 preceding siblings ...)
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal " Martin Storsjö
@ 2022-07-13 20:48 ` Martin Storsjö
  2022-07-15 19:32   ` Swinney, Jonathan
  2022-07-15 19:35 ` [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Swinney, Jonathan
  4 siblings, 1 reply; 20+ messages in thread
From: Martin Storsjö @ 2022-07-13 20:48 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Jonathan Swinney

The max height is 16; the max difference per pixel is 255, and
a .8h element can easily contain 16*255, thus keep accumulating
in two .8h vectors, and just do the final accumulation at the
end.

This requires a minor register renumbering in ff_pix_abs16_xy2_neon.

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_0_neon:   97.7   47.0   37.5   22.7
pix_abs_0_1_neon:  154.0   59.0   52.0   25.0
pix_abs_0_3_neon:  179.7   96.7   87.5   41.2
After:
pix_abs_0_0_neon:   96.0   39.2   31.2   22.0
pix_abs_0_1_neon:  150.7   59.7   46.2   23.7
pix_abs_0_3_neon:  175.7   83.7   81.7   38.2

---
I remember suggesting this before, and having it dismissed for
some reason I don't remember - maybe that the element size wasn't
big enough to hold the intermediate results? At least as far as I
can see, it can hold the results just fine, and this passes all
tests.
---
 libavcodec/aarch64/me_cmp_neon.S | 102 +++++++++++++++----------------
 1 file changed, 49 insertions(+), 53 deletions(-)

diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index 89546869fb..cda7ce0408 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -27,15 +27,16 @@ function ff_pix_abs16_neon, export=1
         // x3           ptrdiff_t stride
         // w4           int h
         cmp             w4, #4                      // if h < 4, jump to completion section
-        movi            v18.4S, #0                  // clear result accumulator
+        movi            v16.8h, #0                  // clear result accumulator
+        movi            v17.8h, #0                  // clear result accumulator
         b.lt            2f
 1:
         ld1             {v0.16b}, [x1], x3          // load pix1
         ld1             {v4.16b}, [x2], x3          // load pix2
         ld1             {v1.16b}, [x1], x3          // load pix1
         ld1             {v5.16b}, [x2], x3          // load pix2
-        uabdl           v16.8h, v0.8b, v4.8b        // absolute difference accumulate
-        uabdl2          v17.8h, v0.16b, v4.16b
+        uabal           v16.8h, v0.8b, v4.8b        // absolute difference accumulate
+        uabal2          v17.8h, v0.16b, v4.16b
         ld1             {v2.16b}, [x1], x3          // load pix1
         ld1             {v6.16b}, [x2], x3          // load pix2
         uabal           v16.8h, v1.8b, v5.8b        // absolute difference accumulate
@@ -48,27 +49,26 @@ function ff_pix_abs16_neon, export=1
         uabal           v16.8h, v3.8b, v7.8b
         uabal2          v17.8h, v3.16b, v7.16b
         cmp             w4, #4                      // if h >= 4, loop
-        add             v16.8h, v16.8h, v17.8h
-        uaddlv          s16, v16.8h                 // add up everything in v16 accumulator
-        add             d18, d16, d18               // add to the end result register
 
         b.ge            1b
         cbnz            w4, 2f                      // if iterations remain, jump to completion section
 
-        fmov            w0, s18                     // copy result to general purpose register
+        add             v16.8h, v16.8h, v17.8h
+        uaddlv          s16, v16.8h                 // add up everything in v16 accumulator
+        fmov            w0, s16                     // copy result to general purpose register
         ret
 
 2:
         ld1             {v0.16b}, [x1], x3          // load pix1
         ld1             {v4.16b}, [x2], x3          // load pix2
-        uabdl           v16.8h, v0.8b, v4.8b        // absolute difference accumulate
-        uabal2          v16.8h, v0.16b, v4.16b
         subs            w4, w4, #1                  // h -= 1
-        addv            h16, v16.8h                 // add up v16
-        add             d18, d16, d18               // add to result
+        uabal           v16.8h, v0.8b, v4.8b        // absolute difference accumulate
+        uabal2          v17.8h, v0.16b, v4.16b
         b.ne            2b
 
-        fmov            w0, s18                     // copy result to general purpose register
+        add             v16.8h, v16.8h, v17.8h
+        uaddlv          s16, v16.8h                 // add up everything in v16 accumulator
+        fmov            w0, s16                     // copy result to general purpose register
         ret
 endfunc
 
@@ -80,7 +80,8 @@ function ff_pix_abs16_xy2_neon, export=1
         // w4           int h
 
         add             x5, x2, x3                  // use x5 to hold uint8_t *pix3
-        movi            v0.2d, #0                   // initialize the result register
+        movi            v21.8h, #0                  // initialize the result register
+        movi            v22.8h, #0                  // initialize the result register
 
         // Load initial pix2 values for either the unrolled version or completion version.
         ldur            q4, [x2, #1]                // load pix2+1
@@ -119,15 +120,15 @@ function ff_pix_abs16_xy2_neon, export=1
         uaddl           v2.8h, v6.8b, v7.8b         // pix3 + pix3+1 0..7
         uaddl2          v3.8h, v6.16b, v7.16b       // pix3 + pix3+1 8..15
 
-        ldur            q22, [x5, #1]               // load pix3+1
+        ldur            q7, [x5, #1]                // load pix3+1
 
         add             v26.8h, v30.8h, v2.8h       // add up 0..7, using pix2 + pix2+1 values from pix3 above
         add             v27.8h, v31.8h, v3.8h       // add up 8..15, using pix2 + pix2+1 values from pix3 above
 
-        uabdl           v24.8h, v1.8b,  v23.8b      // absolute difference 0..7, i=0
-        uabdl2          v23.8h, v1.16b, v23.16b     // absolute difference 8..15, i=0
+        uabal           v21.8h, v1.8b,  v23.8b      // absolute difference 0..7, i=0
+        uabal2          v22.8h, v1.16b, v23.16b     // absolute difference 8..15, i=0
 
-        ld1             {v21.16b}, [x5], x3         // load pix3
+        ld1             {v6.16b}, [x5], x3          // load pix3
         ld1             {v20.16b}, [x1], x3         // load pix1
 
         rshrn           v26.8b, v26.8h, #2          // shift right 2 0..7 (rounding shift right)
@@ -140,33 +141,33 @@ function ff_pix_abs16_xy2_neon, export=1
         rshrn           v28.8b, v28.8h, #2          // shift right 2 0..7 (rounding shift right)
         rshrn2          v28.16b, v29.8h, #2         // shift right 2 8..15
 
-        uabal           v24.8h, v16.8b,  v26.8b     // absolute difference 0..7, i=1
-        uabal2          v23.8h, v16.16b, v26.16b    // absolute difference 8..15, i=1
+        uabal           v21.8h, v16.8b,  v26.8b     // absolute difference 0..7, i=1
+        uabal2          v22.8h, v16.16b, v26.16b    // absolute difference 8..15, i=1
 
-        uaddl           v2.8h, v21.8b, v22.8b       // pix3 + pix3+1 0..7
-        uaddl2          v3.8h, v21.16b, v22.16b     // pix3 + pix3+1 8..15
+        uaddl           v2.8h, v6.8b,  v7.8b        // pix3 + pix3+1 0..7
+        uaddl2          v3.8h, v6.16b, v7.16b       // pix3 + pix3+1 8..15
         add             v30.8h, v4.8h, v2.8h        // add up 0..7, using pix2 + pix2+1 values from pix3 above
         add             v31.8h, v5.8h, v3.8h        // add up 8..15, using pix2 + pix2+1 values from pix3 above
         rshrn           v30.8b, v30.8h, #2          // shift right 2 0..7 (rounding shift right)
         rshrn2          v30.16b, v31.8h, #2         // shift right 2 8..15
 
-        uabal           v24.8h, v17.8b,  v28.8b     // absolute difference 0..7, i=2
-        uabal2          v23.8h, v17.16b, v28.16b    // absolute difference 8..15, i=2
-
         sub             w4, w4, #4                  // h -= 4
 
-        uabal           v24.8h, v20.8b,  v30.8b     // absolute difference 0..7, i=3
-        uabal2          v23.8h, v20.16b, v30.16b    // absolute difference 8..15, i=3
+        uabal           v21.8h, v17.8b,  v28.8b     // absolute difference 0..7, i=2
+        uabal2          v22.8h, v17.16b, v28.16b    // absolute difference 8..15, i=2
 
         cmp             w4, #4                      // loop if h >= 4
 
-        add             v4.8h, v23.8h, v24.8h
-        uaddlv          s4, v4.8h                   // finish adding up accumulated values
-        add             d0, d0, d4                  // add the value to the top level accumulator
+
+        uabal           v21.8h, v20.8b,  v30.8b     // absolute difference 0..7, i=3
+        uabal2          v22.8h, v20.16b, v30.16b    // absolute difference 8..15, i=3
 
         b.ge            1b
         cbnz            w4, 2f                      // if iterations remain jump to completion section
 
+        add             v4.8h, v21.8h, v22.8h
+        uaddlv          s0, v4.8h                   // finish adding up accumulated values
+
         fmov            w0, s0                      // copy result to general purpose register
         ret
 2:
@@ -182,20 +183,18 @@ function ff_pix_abs16_xy2_neon, export=1
         add             v16.8h, v2.8h, v18.8h       // add up 0..7, using pix2 + pix2+1 values from previous iteration
         add             v17.8h, v3.8h, v19.8h       // add up 8..15, using pix2 + pix2+1 values from previous iteration
         // divide by 4 to compute the average of values summed above
-        urshr           v16.8h, v16.8h, #2          // shift right by 2 0..7 (rounding shift right)
-        urshr           v17.8h, v17.8h, #2          // shift right by 2 8..15
-
-        uxtl2           v7.8h, v1.16b               // 8->16 bits pix1 8..15
-        uxtl            v1.8h, v1.8b                // 8->16 bits pix1 0..7
+        rshrn           v16.8b,  v16.8h, #2         // shift right by 2 0..7 (rounding shift right)
+        rshrn2          v16.16b, v17.8h, #2         // shift right by 2 8..15
 
-        uabd            v6.8h, v1.8h, v16.8h        // absolute difference 0..7
-        uaba            v6.8h, v7.8h, v17.8h        // absolute difference accumulate 8..15
+        uabal           v21.8h, v1.8b,  v16.8b      // absolute difference 0..7
+        uabal2          v22.8h, v1.16b, v16.16b     // absolute difference accumulate 8..15
         mov             v2.16b, v18.16b             // pix3 -> pix2
         mov             v3.16b, v19.16b             // pix3+1 -> pix2+1
-        uaddlv          s6, v6.8h                   // add up accumulator in v6
-        add             d0, d0, d6                  // add to the final result
 
         b.ne            2b                          // loop if h > 0
+
+        add             v4.8h, v21.8h, v22.8h
+        uaddlv          s0, v4.8h                   // finish adding up accumulated values
         fmov            w0, s0                      // copy result to general purpose register
         ret
 endfunc
@@ -209,7 +208,8 @@ function ff_pix_abs16_x2_neon, export=1
 
         cmp             w4, #4
         // initialize buffers
-        movi            d20, #0
+        movi            v16.8h, #0
+        movi            v17.8h, #0
         add             x5, x2, #1 // pix2 + 1
         b.lt            2f
 
@@ -224,9 +224,9 @@ function ff_pix_abs16_x2_neon, export=1
         ld1             {v2.16b}, [x5], x3
         urhadd          v30.16b, v1.16b, v2.16b
         ld1             {v0.16b}, [x1], x3
-        uabdl           v16.8h, v0.8b, v30.8b
+        uabal           v16.8h, v0.8b, v30.8b
         ld1             {v4.16b}, [x2], x3
-        uabdl2          v17.8h, v0.16b, v30.16b
+        uabal2          v17.8h, v0.16b, v30.16b
         ld1             {v5.16b}, [x5], x3
         urhadd          v29.16b, v4.16b, v5.16b
         ld1             {v3.16b}, [x1], x3
@@ -238,20 +238,15 @@ function ff_pix_abs16_x2_neon, export=1
         ld1             {v6.16b}, [x1], x3
         uabal           v16.8h, v6.8b, v28.8b
         ld1             {v24.16b}, [x2], x3
+        sub             w4, w4, #4
         uabal2          v17.8h, v6.16b, v28.16b
         ld1             {v25.16b}, [x5], x3
         urhadd          v27.16b, v24.16b, v25.16b
         ld1             {v23.16b}, [x1], x3
+        cmp             w4, #4
         uabal           v16.8h, v23.8b, v27.8b
         uabal2          v17.8h, v23.16b, v27.16b
 
-        sub             w4, w4, #4
-
-        add             v16.8h, v16.8h, v17.8h
-        uaddlv          s16, v16.8h
-        cmp             w4, #4
-        add             d20, d20, d16
-
         b.ge            1b
         cbz             w4, 3f
 
@@ -259,18 +254,19 @@ function ff_pix_abs16_x2_neon, export=1
 2:
         ld1             {v1.16b}, [x2], x3
         ld1             {v2.16b}, [x5], x3
+        subs            w4, w4, #1
         urhadd          v29.16b, v1.16b, v2.16b
         ld1             {v0.16b}, [x1], x3
-        uabd            v28.16b, v0.16b, v29.16b
+        uabal           v16.8h, v0.8b,  v29.8b
+        uabal2          v17.8h, v0.16b, v29.16b
 
-        uaddlv          h28, v28.16b
-        subs            w4, w4, #1
 
-        add             d20, d20, d28
         b.ne            2b
 
 3:
-        fmov            w0, s20
+        add             v16.8h, v16.8h, v17.8h
+        uaddlv          s16, v16.8h
+        fmov            w0, s16
 
         ret
 endfunc
-- 
2.25.1

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration Martin Storsjö
@ 2022-07-15 19:32   ` Swinney, Jonathan
  2022-07-15 19:56     ` Martin Storsjö
  0 siblings, 1 reply; 20+ messages in thread
From: Swinney, Jonathan @ 2022-07-15 19:32 UTC (permalink / raw)
  To: Martin Storsjö, ffmpeg-devel

If the max height is just 16, then this should be fine. I assumed that h could have a much higher value (>1024), but if that is not the case, then this is a useful optimization.

Thanks!

-- 

Jonathan Swinney

On 7/13/22, 3:49 PM, "Martin Storsjö" <martin@martin.st> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    The max height is 16; the max difference per pixel is 255, and
    a .8h element can easily contain 16*255, thus keep accumulating
    in two .8h vectors, and just do the final accumulation at the
    end.

    This requires a minor register renumbering in ff_pix_abs16_xy2_neon.

    Before:       Cortex A53    A72    A73   Graviton 3
    pix_abs_0_0_neon:   97.7   47.0   37.5   22.7
    pix_abs_0_1_neon:  154.0   59.0   52.0   25.0
    pix_abs_0_3_neon:  179.7   96.7   87.5   41.2
    After:
    pix_abs_0_0_neon:   96.0   39.2   31.2   22.0
    pix_abs_0_1_neon:  150.7   59.7   46.2   23.7
    pix_abs_0_3_neon:  175.7   83.7   81.7   38.2

    ---
    I remember suggesting this before, and having it dismissed for
    some reason I don't remember - maybe that the element size wasn't
    big enough to hold the intermediate results? At least as far as I
    can see, it can hold the results just fine, and this passes all
    tests.
    ---
     libavcodec/aarch64/me_cmp_neon.S | 102 +++++++++++++++----------------
     1 file changed, 49 insertions(+), 53 deletions(-)

    diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
    index 89546869fb..cda7ce0408 100644
    --- a/libavcodec/aarch64/me_cmp_neon.S
    +++ b/libavcodec/aarch64/me_cmp_neon.S
    @@ -27,15 +27,16 @@ function ff_pix_abs16_neon, export=1
             // x3           ptrdiff_t stride
             // w4           int h
             cmp             w4, #4                      // if h < 4, jump to completion section
    -        movi            v18.4S, #0                  // clear result accumulator
    +        movi            v16.8h, #0                  // clear result accumulator
    +        movi            v17.8h, #0                  // clear result accumulator
             b.lt            2f
     1:
             ld1             {v0.16b}, [x1], x3          // load pix1
             ld1             {v4.16b}, [x2], x3          // load pix2
             ld1             {v1.16b}, [x1], x3          // load pix1
             ld1             {v5.16b}, [x2], x3          // load pix2
    -        uabdl           v16.8h, v0.8b, v4.8b        // absolute difference accumulate
    -        uabdl2          v17.8h, v0.16b, v4.16b
    +        uabal           v16.8h, v0.8b, v4.8b        // absolute difference accumulate
    +        uabal2          v17.8h, v0.16b, v4.16b
             ld1             {v2.16b}, [x1], x3          // load pix1
             ld1             {v6.16b}, [x2], x3          // load pix2
             uabal           v16.8h, v1.8b, v5.8b        // absolute difference accumulate
    @@ -48,27 +49,26 @@ function ff_pix_abs16_neon, export=1
             uabal           v16.8h, v3.8b, v7.8b
             uabal2          v17.8h, v3.16b, v7.16b
             cmp             w4, #4                      // if h >= 4, loop
    -        add             v16.8h, v16.8h, v17.8h
    -        uaddlv          s16, v16.8h                 // add up everything in v16 accumulator
    -        add             d18, d16, d18               // add to the end result register

             b.ge            1b
             cbnz            w4, 2f                      // if iterations remain, jump to completion section

    -        fmov            w0, s18                     // copy result to general purpose register
    +        add             v16.8h, v16.8h, v17.8h
    +        uaddlv          s16, v16.8h                 // add up everything in v16 accumulator
    +        fmov            w0, s16                     // copy result to general purpose register
             ret

     2:
             ld1             {v0.16b}, [x1], x3          // load pix1
             ld1             {v4.16b}, [x2], x3          // load pix2
    -        uabdl           v16.8h, v0.8b, v4.8b        // absolute difference accumulate
    -        uabal2          v16.8h, v0.16b, v4.16b
             subs            w4, w4, #1                  // h -= 1
    -        addv            h16, v16.8h                 // add up v16
    -        add             d18, d16, d18               // add to result
    +        uabal           v16.8h, v0.8b, v4.8b        // absolute difference accumulate
    +        uabal2          v17.8h, v0.16b, v4.16b
             b.ne            2b

    -        fmov            w0, s18                     // copy result to general purpose register
    +        add             v16.8h, v16.8h, v17.8h
    +        uaddlv          s16, v16.8h                 // add up everything in v16 accumulator
    +        fmov            w0, s16                     // copy result to general purpose register
             ret
     endfunc

    @@ -80,7 +80,8 @@ function ff_pix_abs16_xy2_neon, export=1
             // w4           int h

             add             x5, x2, x3                  // use x5 to hold uint8_t *pix3
    -        movi            v0.2d, #0                   // initialize the result register
    +        movi            v21.8h, #0                  // initialize the result register
    +        movi            v22.8h, #0                  // initialize the result register

             // Load initial pix2 values for either the unrolled version or completion version.
             ldur            q4, [x2, #1]                // load pix2+1
    @@ -119,15 +120,15 @@ function ff_pix_abs16_xy2_neon, export=1
             uaddl           v2.8h, v6.8b, v7.8b         // pix3 + pix3+1 0..7
             uaddl2          v3.8h, v6.16b, v7.16b       // pix3 + pix3+1 8..15

    -        ldur            q22, [x5, #1]               // load pix3+1
    +        ldur            q7, [x5, #1]                // load pix3+1

             add             v26.8h, v30.8h, v2.8h       // add up 0..7, using pix2 + pix2+1 values from pix3 above
             add             v27.8h, v31.8h, v3.8h       // add up 8..15, using pix2 + pix2+1 values from pix3 above

    -        uabdl           v24.8h, v1.8b,  v23.8b      // absolute difference 0..7, i=0
    -        uabdl2          v23.8h, v1.16b, v23.16b     // absolute difference 8..15, i=0
    +        uabal           v21.8h, v1.8b,  v23.8b      // absolute difference 0..7, i=0
    +        uabal2          v22.8h, v1.16b, v23.16b     // absolute difference 8..15, i=0

    -        ld1             {v21.16b}, [x5], x3         // load pix3
    +        ld1             {v6.16b}, [x5], x3          // load pix3
             ld1             {v20.16b}, [x1], x3         // load pix1

             rshrn           v26.8b, v26.8h, #2          // shift right 2 0..7 (rounding shift right)
    @@ -140,33 +141,33 @@ function ff_pix_abs16_xy2_neon, export=1
             rshrn           v28.8b, v28.8h, #2          // shift right 2 0..7 (rounding shift right)
             rshrn2          v28.16b, v29.8h, #2         // shift right 2 8..15

    -        uabal           v24.8h, v16.8b,  v26.8b     // absolute difference 0..7, i=1
    -        uabal2          v23.8h, v16.16b, v26.16b    // absolute difference 8..15, i=1
    +        uabal           v21.8h, v16.8b,  v26.8b     // absolute difference 0..7, i=1
    +        uabal2          v22.8h, v16.16b, v26.16b    // absolute difference 8..15, i=1

    -        uaddl           v2.8h, v21.8b, v22.8b       // pix3 + pix3+1 0..7
    -        uaddl2          v3.8h, v21.16b, v22.16b     // pix3 + pix3+1 8..15
    +        uaddl           v2.8h, v6.8b,  v7.8b        // pix3 + pix3+1 0..7
    +        uaddl2          v3.8h, v6.16b, v7.16b       // pix3 + pix3+1 8..15
             add             v30.8h, v4.8h, v2.8h        // add up 0..7, using pix2 + pix2+1 values from pix3 above
             add             v31.8h, v5.8h, v3.8h        // add up 8..15, using pix2 + pix2+1 values from pix3 above
             rshrn           v30.8b, v30.8h, #2          // shift right 2 0..7 (rounding shift right)
             rshrn2          v30.16b, v31.8h, #2         // shift right 2 8..15

    -        uabal           v24.8h, v17.8b,  v28.8b     // absolute difference 0..7, i=2
    -        uabal2          v23.8h, v17.16b, v28.16b    // absolute difference 8..15, i=2
    -
             sub             w4, w4, #4                  // h -= 4

    -        uabal           v24.8h, v20.8b,  v30.8b     // absolute difference 0..7, i=3
    -        uabal2          v23.8h, v20.16b, v30.16b    // absolute difference 8..15, i=3
    +        uabal           v21.8h, v17.8b,  v28.8b     // absolute difference 0..7, i=2
    +        uabal2          v22.8h, v17.16b, v28.16b    // absolute difference 8..15, i=2

             cmp             w4, #4                      // loop if h >= 4

    -        add             v4.8h, v23.8h, v24.8h
    -        uaddlv          s4, v4.8h                   // finish adding up accumulated values
    -        add             d0, d0, d4                  // add the value to the top level accumulator
    +
    +        uabal           v21.8h, v20.8b,  v30.8b     // absolute difference 0..7, i=3
    +        uabal2          v22.8h, v20.16b, v30.16b    // absolute difference 8..15, i=3

             b.ge            1b
             cbnz            w4, 2f                      // if iterations remain jump to completion section

    +        add             v4.8h, v21.8h, v22.8h
    +        uaddlv          s0, v4.8h                   // finish adding up accumulated values
    +
             fmov            w0, s0                      // copy result to general purpose register
             ret
     2:
    @@ -182,20 +183,18 @@ function ff_pix_abs16_xy2_neon, export=1
             add             v16.8h, v2.8h, v18.8h       // add up 0..7, using pix2 + pix2+1 values from previous iteration
             add             v17.8h, v3.8h, v19.8h       // add up 8..15, using pix2 + pix2+1 values from previous iteration
             // divide by 4 to compute the average of values summed above
    -        urshr           v16.8h, v16.8h, #2          // shift right by 2 0..7 (rounding shift right)
    -        urshr           v17.8h, v17.8h, #2          // shift right by 2 8..15
    -
    -        uxtl2           v7.8h, v1.16b               // 8->16 bits pix1 8..15
    -        uxtl            v1.8h, v1.8b                // 8->16 bits pix1 0..7
    +        rshrn           v16.8b,  v16.8h, #2         // shift right by 2 0..7 (rounding shift right)
    +        rshrn2          v16.16b, v17.8h, #2         // shift right by 2 8..15

    -        uabd            v6.8h, v1.8h, v16.8h        // absolute difference 0..7
    -        uaba            v6.8h, v7.8h, v17.8h        // absolute difference accumulate 8..15
    +        uabal           v21.8h, v1.8b,  v16.8b      // absolute difference 0..7
    +        uabal2          v22.8h, v1.16b, v16.16b     // absolute difference accumulate 8..15
             mov             v2.16b, v18.16b             // pix3 -> pix2
             mov             v3.16b, v19.16b             // pix3+1 -> pix2+1
    -        uaddlv          s6, v6.8h                   // add up accumulator in v6
    -        add             d0, d0, d6                  // add to the final result

             b.ne            2b                          // loop if h > 0
    +
    +        add             v4.8h, v21.8h, v22.8h
    +        uaddlv          s0, v4.8h                   // finish adding up accumulated values
             fmov            w0, s0                      // copy result to general purpose register
             ret
     endfunc
    @@ -209,7 +208,8 @@ function ff_pix_abs16_x2_neon, export=1

             cmp             w4, #4
             // initialize buffers
    -        movi            d20, #0
    +        movi            v16.8h, #0
    +        movi            v17.8h, #0
             add             x5, x2, #1 // pix2 + 1
             b.lt            2f

    @@ -224,9 +224,9 @@ function ff_pix_abs16_x2_neon, export=1
             ld1             {v2.16b}, [x5], x3
             urhadd          v30.16b, v1.16b, v2.16b
             ld1             {v0.16b}, [x1], x3
    -        uabdl           v16.8h, v0.8b, v30.8b
    +        uabal           v16.8h, v0.8b, v30.8b
             ld1             {v4.16b}, [x2], x3
    -        uabdl2          v17.8h, v0.16b, v30.16b
    +        uabal2          v17.8h, v0.16b, v30.16b
             ld1             {v5.16b}, [x5], x3
             urhadd          v29.16b, v4.16b, v5.16b
             ld1             {v3.16b}, [x1], x3
    @@ -238,20 +238,15 @@ function ff_pix_abs16_x2_neon, export=1
             ld1             {v6.16b}, [x1], x3
             uabal           v16.8h, v6.8b, v28.8b
             ld1             {v24.16b}, [x2], x3
    +        sub             w4, w4, #4
             uabal2          v17.8h, v6.16b, v28.16b
             ld1             {v25.16b}, [x5], x3
             urhadd          v27.16b, v24.16b, v25.16b
             ld1             {v23.16b}, [x1], x3
    +        cmp             w4, #4
             uabal           v16.8h, v23.8b, v27.8b
             uabal2          v17.8h, v23.16b, v27.16b

    -        sub             w4, w4, #4
    -
    -        add             v16.8h, v16.8h, v17.8h
    -        uaddlv          s16, v16.8h
    -        cmp             w4, #4
    -        add             d20, d20, d16
    -
             b.ge            1b
             cbz             w4, 3f

    @@ -259,18 +254,19 @@ function ff_pix_abs16_x2_neon, export=1
     2:
             ld1             {v1.16b}, [x2], x3
             ld1             {v2.16b}, [x5], x3
    +        subs            w4, w4, #1
             urhadd          v29.16b, v1.16b, v2.16b
             ld1             {v0.16b}, [x1], x3
    -        uabd            v28.16b, v0.16b, v29.16b
    +        uabal           v16.8h, v0.8b,  v29.8b
    +        uabal2          v17.8h, v0.16b, v29.16b

    -        uaddlv          h28, v28.16b
    -        subs            w4, w4, #1

    -        add             d20, d20, d28
             b.ne            2b

     3:
    -        fmov            w0, s20
    +        add             v16.8h, v16.8h, v17.8h
    +        uaddlv          s16, v16.8h
    +        fmov            w0, s16

             ret
     endfunc
    --
    2.25.1


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-15 19:32   ` Swinney, Jonathan
@ 2022-07-15 19:56     ` Martin Storsjö
  2022-07-15 21:19       ` Michael Niedermayer
  0 siblings, 1 reply; 20+ messages in thread
From: Martin Storsjö @ 2022-07-15 19:56 UTC (permalink / raw)
  To: Swinney, Jonathan; +Cc: ffmpeg-devel

On Fri, 15 Jul 2022, Swinney, Jonathan wrote:

> If the max height is just 16, then this should be fine. I assumed that h 
> could have a much higher value (>1024), but if that is not the case, 
> then this is a useful optimization.

At least according to the me_cmp.h header, which says:

/* Motion estimation:
  * h is limited to { width / 2, width, 2 * width },
  * but never larger than 16 and never smaller than 2.
  * Although currently h < 4 is not used as functions with
  * width < 8 are neither used nor implemented. */

So with that in mind, I think this should be safe to do.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-15 19:56     ` Martin Storsjö
@ 2022-07-15 21:19       ` Michael Niedermayer
  2022-07-15 21:25         ` Martin Storsjö
  2022-07-16  9:18         ` Martin Storsjö
  0 siblings, 2 replies; 20+ messages in thread
From: Michael Niedermayer @ 2022-07-15 21:19 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 1035 bytes --]

On Fri, Jul 15, 2022 at 10:56:03PM +0300, Martin Storsjö wrote:
> On Fri, 15 Jul 2022, Swinney, Jonathan wrote:
> 
> > If the max height is just 16, then this should be fine. I assumed that h
> > could have a much higher value (>1024), but if that is not the case,
> > then this is a useful optimization.
> 
> At least according to the me_cmp.h header, which says:
> 
> /* Motion estimation:
>  * h is limited to { width / 2, width, 2 * width },
>  * but never larger than 16 and never smaller than 2.
>  * Although currently h < 4 is not used as functions with
>  * width < 8 are neither used nor implemented. */

These rules where written with support for encoding of all
standard formats in mind at the time that was written.
today it may make sense to extend these rules to cover the
things which where created since then

thx

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Republics decline into democracies and democracies degenerate into
despotisms. -- Aristotle

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-15 21:19       ` Michael Niedermayer
@ 2022-07-15 21:25         ` Martin Storsjö
  2022-07-16 11:23           ` Michael Niedermayer
  2022-07-16  9:18         ` Martin Storsjö
  1 sibling, 1 reply; 20+ messages in thread
From: Martin Storsjö @ 2022-07-15 21:25 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, 15 Jul 2022, Michael Niedermayer wrote:

> On Fri, Jul 15, 2022 at 10:56:03PM +0300, Martin Storsjö wrote:
>> On Fri, 15 Jul 2022, Swinney, Jonathan wrote:
>>
>>> If the max height is just 16, then this should be fine. I assumed that h
>>> could have a much higher value (>1024), but if that is not the case,
>>> then this is a useful optimization.
>>
>> At least according to the me_cmp.h header, which says:
>>
>> /* Motion estimation:
>>  * h is limited to { width / 2, width, 2 * width },
>>  * but never larger than 16 and never smaller than 2.
>>  * Although currently h < 4 is not used as functions with
>>  * width < 8 are neither used nor implemented. */
>
> These rules where written with support for encoding of all
> standard formats in mind at the time that was written.
> today it may make sense to extend these rules to cover the
> things which where created since then

Right, but if that suddenly changes, such a change also must expect that 
it might need updates to all assembly implementations that implement that 
interface currently. Right now, both the defacto case (any callers in the 
codebase) and the explicit documentation says that it can't be called with 
parameters outside of that range.

Even if it's raised from the current <= 16, this particular optimization 
should be fine as long as h <= 256 - which should be fine for at least all 
current-gen mainstream codecs since, I think?

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-15 21:25         ` Martin Storsjö
@ 2022-07-16 11:23           ` Michael Niedermayer
  2022-07-16 12:30             ` Martin Storsjö
  2022-07-16 12:50             ` Ronald S. Bultje
  0 siblings, 2 replies; 20+ messages in thread
From: Michael Niedermayer @ 2022-07-16 11:23 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 2138 bytes --]

On Sat, Jul 16, 2022 at 12:25:37AM +0300, Martin Storsjö wrote:
> On Fri, 15 Jul 2022, Michael Niedermayer wrote:
> 
> > On Fri, Jul 15, 2022 at 10:56:03PM +0300, Martin Storsjö wrote:
> > > On Fri, 15 Jul 2022, Swinney, Jonathan wrote:
> > > 
> > > > If the max height is just 16, then this should be fine. I assumed that h
> > > > could have a much higher value (>1024), but if that is not the case,
> > > > then this is a useful optimization.
> > > 
> > > At least according to the me_cmp.h header, which says:
> > > 
> > > /* Motion estimation:
> > >  * h is limited to { width / 2, width, 2 * width },
> > >  * but never larger than 16 and never smaller than 2.
> > >  * Although currently h < 4 is not used as functions with
> > >  * width < 8 are neither used nor implemented. */
> > 
> > These rules where written with support for encoding of all
> > standard formats in mind at the time that was written.
> > today it may make sense to extend these rules to cover the
> > things which where created since then
> 
> Right, but if that suddenly changes, such a change also must expect that it
> might need updates to all assembly implementations that implement that
> interface currently. Right now, both the defacto case (any callers in the
> codebase) and the explicit documentation says that it can't be called with
> parameters outside of that range.

What i meant was that newly added functions should be more flexible than
these old rules. That is 2 sets of rules
1. What a caller ATM can do and expect to work (thats whats written there)
2. What an implementor of new functions should make sure is supported


> 
> Even if it's raised from the current <= 16, this particular optimization
> should be fine as long as h <= 256 - which should be fine for at least all
> current-gen mainstream codecs since, I think?

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Many things microsoft did are stupid, but not doing something just because
microsoft did it is even more stupid. If everything ms did were stupid they
would be bankrupt already.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-16 11:23           ` Michael Niedermayer
@ 2022-07-16 12:30             ` Martin Storsjö
  2022-07-16 13:20               ` Michael Niedermayer
  2022-07-16 12:50             ` Ronald S. Bultje
  1 sibling, 1 reply; 20+ messages in thread
From: Martin Storsjö @ 2022-07-16 12:30 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Sat, 16 Jul 2022, Michael Niedermayer wrote:

> On Sat, Jul 16, 2022 at 12:25:37AM +0300, Martin Storsjö wrote:
>> On Fri, 15 Jul 2022, Michael Niedermayer wrote:
>>
>>> On Fri, Jul 15, 2022 at 10:56:03PM +0300, Martin Storsjö wrote:
>>>> On Fri, 15 Jul 2022, Swinney, Jonathan wrote:
>>>>
>>>>> If the max height is just 16, then this should be fine. I assumed that h
>>>>> could have a much higher value (>1024), but if that is not the case,
>>>>> then this is a useful optimization.
>>>>
>>>> At least according to the me_cmp.h header, which says:
>>>>
>>>> /* Motion estimation:
>>>>  * h is limited to { width / 2, width, 2 * width },
>>>>  * but never larger than 16 and never smaller than 2.
>>>>  * Although currently h < 4 is not used as functions with
>>>>  * width < 8 are neither used nor implemented. */
>>>
>>> These rules where written with support for encoding of all
>>> standard formats in mind at the time that was written.
>>> today it may make sense to extend these rules to cover the
>>> things which where created since then
>>
>> Right, but if that suddenly changes, such a change also must expect that it
>> might need updates to all assembly implementations that implement that
>> interface currently. Right now, both the defacto case (any callers in the
>> codebase) and the explicit documentation says that it can't be called with
>> parameters outside of that range.
>
> What i meant was that newly added functions should be more flexible than
> these old rules. That is 2 sets of rules
> 1. What a caller ATM can do and expect to work (thats whats written there)
> 2. What an implementor of new functions should make sure is supported

With 2., do you mean if adding a new function into the same struct, or if 
implementing the existing pix_abs[0][..] functions?

If you mean new implementations of the existing function interface, you 
say they "should be more flexible". How flexible must they be? Is it ok to 
assume h<=256 for the w=16 functions?

Gradually increasing the requirements for existing function interfaces 
like you suggest is really problematic.

If we want to require more of the functions, we should document it, and 
extend the checkasm test to test that new requirement - which also extends 
the requirement to the existing functions. If we don't have a checkasm 
test for the required behaviour, we can pretty much assume it's broken, 
even in new implementations.

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-16 12:30             ` Martin Storsjö
@ 2022-07-16 13:20               ` Michael Niedermayer
  2022-07-16 14:23                 ` Martin Storsjö
  0 siblings, 1 reply; 20+ messages in thread
From: Michael Niedermayer @ 2022-07-16 13:20 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 3401 bytes --]

On Sat, Jul 16, 2022 at 03:30:23PM +0300, Martin Storsjö wrote:
> On Sat, 16 Jul 2022, Michael Niedermayer wrote:
> 
> > On Sat, Jul 16, 2022 at 12:25:37AM +0300, Martin Storsjö wrote:
> > > On Fri, 15 Jul 2022, Michael Niedermayer wrote:
> > > 
> > > > On Fri, Jul 15, 2022 at 10:56:03PM +0300, Martin Storsjö wrote:
> > > > > On Fri, 15 Jul 2022, Swinney, Jonathan wrote:
> > > > > 
> > > > > > If the max height is just 16, then this should be fine. I assumed that h
> > > > > > could have a much higher value (>1024), but if that is not the case,
> > > > > > then this is a useful optimization.
> > > > > 
> > > > > At least according to the me_cmp.h header, which says:
> > > > > 
> > > > > /* Motion estimation:
> > > > >  * h is limited to { width / 2, width, 2 * width },
> > > > >  * but never larger than 16 and never smaller than 2.
> > > > >  * Although currently h < 4 is not used as functions with
> > > > >  * width < 8 are neither used nor implemented. */
> > > > 
> > > > These rules where written with support for encoding of all
> > > > standard formats in mind at the time that was written.
> > > > today it may make sense to extend these rules to cover the
> > > > things which where created since then
> > > 
> > > Right, but if that suddenly changes, such a change also must expect that it
> > > might need updates to all assembly implementations that implement that
> > > interface currently. Right now, both the defacto case (any callers in the
> > > codebase) and the explicit documentation says that it can't be called with
> > > parameters outside of that range.
> > 
> > What i meant was that newly added functions should be more flexible than
> > these old rules. That is 2 sets of rules
> > 1. What a caller ATM can do and expect to work (thats whats written there)
> > 2. What an implementor of new functions should make sure is supported
> 
> With 2., do you mean if adding a new function into the same struct, or if
> implementing the existing pix_abs[0][..] functions?

i would say both


> 
> If you mean new implementations of the existing function interface, you say
> they "should be more flexible". How flexible must they be? Is it ok to
> assume h<=256 for the w=16 functions?

i think thats fine


> 
> Gradually increasing the requirements for existing function interfaces like
> you suggest is really problematic.

why ?
iam really just saying
"when you add new code, dont base it on old limitations"

And i suggest this because its much easier to do when a function is added than
when later one wants to use it 

When i originally wrote this list of restrictions it didnt list what our
code actually used but tried to look ahead to what we would need when we
write encoders for all standard formats at that time. We never wrote a
h264 encoder but this API was designed to support it.


> 
> If we want to require more of the functions, we should document it, and
> extend the checkasm test to test that new requirement - which also extends
> the requirement to the existing functions. If we don't have a checkasm test
> for the required behaviour, we can pretty much assume it's broken, even in
> new implementations.

yes

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No snowflake in an avalanche ever feels responsible. -- Voltaire

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-16 13:20               ` Michael Niedermayer
@ 2022-07-16 14:23                 ` Martin Storsjö
  0 siblings, 0 replies; 20+ messages in thread
From: Martin Storsjö @ 2022-07-16 14:23 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Sat, 16 Jul 2022, Michael Niedermayer wrote:

> On Sat, Jul 16, 2022 at 03:30:23PM +0300, Martin Storsjö wrote:
>> On Sat, 16 Jul 2022, Michael Niedermayer wrote:
>>
>>> On Sat, Jul 16, 2022 at 12:25:37AM +0300, Martin Storsjö wrote:
>>>> On Fri, 15 Jul 2022, Michael Niedermayer wrote:
>>>>
>>>>> On Fri, Jul 15, 2022 at 10:56:03PM +0300, Martin Storsjö wrote:
>>>>>> On Fri, 15 Jul 2022, Swinney, Jonathan wrote:
>>>>>>
>>>>>>> If the max height is just 16, then this should be fine. I assumed that h
>>>>>>> could have a much higher value (>1024), but if that is not the case,
>>>>>>> then this is a useful optimization.
>>>>>>
>>>>>> At least according to the me_cmp.h header, which says:
>>>>>>
>>>>>> /* Motion estimation:
>>>>>>  * h is limited to { width / 2, width, 2 * width },
>>>>>>  * but never larger than 16 and never smaller than 2.
>>>>>>  * Although currently h < 4 is not used as functions with
>>>>>>  * width < 8 are neither used nor implemented. */
>>>>>
>>>>> These rules where written with support for encoding of all
>>>>> standard formats in mind at the time that was written.
>>>>> today it may make sense to extend these rules to cover the
>>>>> things which where created since then
>>>>
>>>> Right, but if that suddenly changes, such a change also must expect that it
>>>> might need updates to all assembly implementations that implement that
>>>> interface currently. Right now, both the defacto case (any callers in the
>>>> codebase) and the explicit documentation says that it can't be called with
>>>> parameters outside of that range.
>>>
>>> What i meant was that newly added functions should be more flexible than
>>> these old rules. That is 2 sets of rules
>>> 1. What a caller ATM can do and expect to work (thats whats written there)
>>> 2. What an implementor of new functions should make sure is supported
>>
>> With 2., do you mean if adding a new function into the same struct, or if
>> implementing the existing pix_abs[0][..] functions?
>
> i would say both
>
>
>>
>> If you mean new implementations of the existing function interface, you say
>> they "should be more flexible". How flexible must they be? Is it ok to
>> assume h<=256 for the w=16 functions?
>
> i think thats fine

Ok, I'll go ahead and push this then.

>>
>> Gradually increasing the requirements for existing function interfaces like
>> you suggest is really problematic.
>
> why ?
> iam really just saying
> "when you add new code, dont base it on old limitations"

For this case, I just quoted what the header said, which seemed 
authoritative to me - but it's fine for me with a wider spec too, up to 
h=256. But saying arbitrarily "any height" really inhibits what you can do 
in asm.

Also if I shouldn't reference that limitation in the header, please 
update/reword it, as it's actively misleading for anyone working on 
this right now.

And in general, we can design for an intended use case and calculate 
whether it should work or not - but as long as it's not tested, those 
cases will often have hidden bugs - see e.g. patch 1/5 in this series.

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-16 11:23           ` Michael Niedermayer
  2022-07-16 12:30             ` Martin Storsjö
@ 2022-07-16 12:50             ` Ronald S. Bultje
  2022-07-16 13:06               ` Michael Niedermayer
  1 sibling, 1 reply; 20+ messages in thread
From: Ronald S. Bultje @ 2022-07-16 12:50 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

Hi,

On Sat, Jul 16, 2022 at 7:23 PM Michael Niedermayer <michael@niedermayer.cc>
wrote:

> What i meant was that newly added functions should be more flexible than
> these old rules. That is 2 sets of rules
> 1. What a caller ATM can do and expect to work (thats whats written there)
> 2. What an implementor of new functions should make sure is supported
>

What's the use case exactly that you'd like to support that we currently
don't?

Ronald
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-16 12:50             ` Ronald S. Bultje
@ 2022-07-16 13:06               ` Michael Niedermayer
  0 siblings, 0 replies; 20+ messages in thread
From: Michael Niedermayer @ 2022-07-16 13:06 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 1074 bytes --]

On Sat, Jul 16, 2022 at 08:50:24PM +0800, Ronald S. Bultje wrote:
> Hi,
> 
> On Sat, Jul 16, 2022 at 7:23 PM Michael Niedermayer <michael@niedermayer.cc>
> wrote:
> 
> > What i meant was that newly added functions should be more flexible than
> > these old rules. That is 2 sets of rules
> > 1. What a caller ATM can do and expect to work (thats whats written there)
> > 2. What an implementor of new functions should make sure is supported
> >
> 
> What's the use case exactly that you'd like to support that we currently
> don't?

The idea is to have general purpose functions which can be used if someone
wants to write a new encoder. 
(this is kind of not a different goal, its rather that the original
 limitations where based on mpeg2/mpeg4/h264 which was the most flexible at
 the time)
 
thx


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Take away the freedom of one citizen and you will be jailed, take away
the freedom of all citizens and you will be congratulated by your peers
in Parliament.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration
  2022-07-15 21:19       ` Michael Niedermayer
  2022-07-15 21:25         ` Martin Storsjö
@ 2022-07-16  9:18         ` Martin Storsjö
  1 sibling, 0 replies; 20+ messages in thread
From: Martin Storsjö @ 2022-07-16  9:18 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, 15 Jul 2022, Michael Niedermayer wrote:

> On Fri, Jul 15, 2022 at 10:56:03PM +0300, Martin Storsjö wrote:
>> On Fri, 15 Jul 2022, Swinney, Jonathan wrote:
>>
>>> If the max height is just 16, then this should be fine. I assumed that h
>>> could have a much higher value (>1024), but if that is not the case,
>>> then this is a useful optimization.
>>
>> At least according to the me_cmp.h header, which says:
>>
>> /* Motion estimation:
>>  * h is limited to { width / 2, width, 2 * width },
>>  * but never larger than 16 and never smaller than 2.
>>  * Although currently h < 4 is not used as functions with
>>  * width < 8 are neither used nor implemented. */
>
> These rules where written with support for encoding of all
> standard formats in mind at the time that was written.
> today it may make sense to extend these rules to cover the
> things which where created since then

Also - if extending this, I would expect that you want other widths too. 
Right now, most of the functions seem to be arranged such as [0] is w=16 
and [0] is w=8. For those, for w=8, it seems to be mostly hardcoded to 
only assume h=8, while the w=16 functions actually honor the h parameter.

If it ever would be relevant with h>256, that wouldn't be for the existing 
w=8 or w=16 functions, but for newer functions with a larger width too.

So I think this patch is safe (which works for h up to 256), and if 
someone wants to extend the interface later, that can be done.

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon
  2022-07-13 20:48 [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Martin Storsjö
                   ` (3 preceding siblings ...)
  2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration Martin Storsjö
@ 2022-07-15 19:35 ` Swinney, Jonathan
  4 siblings, 0 replies; 20+ messages in thread
From: Swinney, Jonathan @ 2022-07-15 19:35 UTC (permalink / raw)
  To: Martin Storsjö, ffmpeg-devel

LGTM.

-- 

Jonathan Swinney

On 7/13/22, 3:49 PM, "Martin Storsjö" <martin@martin.st> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    Checkasm doesn't currently test this codepath.
    ---
     libavcodec/aarch64/me_cmp_neon.S | 4 ++--
     1 file changed, 2 insertions(+), 2 deletions(-)

    diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
    index e49d049fc2..31db3793d9 100644
    --- a/libavcodec/aarch64/me_cmp_neon.S
    +++ b/libavcodec/aarch64/me_cmp_neon.S
    @@ -189,11 +189,11 @@ function ff_pix_abs16_xy2_neon, export=1
             urshr           v16.8h, v16.8h, #2          // shift right by 2 0..7 (rounding shift right)
             urshr           v17.8h, v17.8h, #2          // shift right by 2 8..15

    -        uxtl2           v8.8h, v1.16b               // 8->16 bits pix1 8..15
    +        uxtl2           v7.8h, v1.16b               // 8->16 bits pix1 8..15
             uxtl            v1.8h, v1.8b                // 8->16 bits pix1 0..7

             uabd            v6.8h, v1.8h, v16.8h        // absolute difference 0..7
    -        uaba            v6.8h, v8.8h, v17.8h        // absolute difference accumulate 8..15
    +        uaba            v6.8h, v7.8h, v17.8h        // absolute difference accumulate 8..15
             mov             v2.16b, v18.16b             // pix3 -> pix2
             mov             v3.16b, v19.16b             // pix3+1 -> pix2+1
             uaddlv          s6, v6.8h                   // add up accumulator in v6
    --
    2.25.1


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-07-16 14:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-13 20:48 [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Martin Storsjö
2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable Martin Storsjö
2022-07-15 19:35   ` Swinney, Jonathan
2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 3/5] aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon Martin Storsjö
2022-07-15 19:34   ` Swinney, Jonathan
2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal " Martin Storsjö
2022-07-15 19:38   ` Swinney, Jonathan
2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration Martin Storsjö
2022-07-15 19:32   ` Swinney, Jonathan
2022-07-15 19:56     ` Martin Storsjö
2022-07-15 21:19       ` Michael Niedermayer
2022-07-15 21:25         ` Martin Storsjö
2022-07-16 11:23           ` Michael Niedermayer
2022-07-16 12:30             ` Martin Storsjö
2022-07-16 13:20               ` Michael Niedermayer
2022-07-16 14:23                 ` Martin Storsjö
2022-07-16 12:50             ` Ronald S. Bultje
2022-07-16 13:06               ` Michael Niedermayer
2022-07-16  9:18         ` Martin Storsjö
2022-07-15 19:35 ` [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Swinney, Jonathan

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git