From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id C07284220B for ; Mon, 28 Feb 2022 20:41:53 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id DA25068AF85; Mon, 28 Feb 2022 22:41:50 +0200 (EET) Received: from smtp-fw-9103.amazon.com (smtp-fw-9103.amazon.com [207.171.188.200]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BB88768AB96 for ; Mon, 28 Feb 2022 22:41:43 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1646080909; x=1677616909; h=from:to:subject:date:message-id: content-transfer-encoding:mime-version; bh=wLUzhrVmiSHbofqdJBK78NDzrvcSqUTSeIQjeofiUFE=; b=lDnS0kgemMf76HLE1aMLY6/8yYhJT5+OCyERCOviuQQdYmVrMMcvhdHc pt9DSqFG7bsu5yZKD/PiQVp5P4ju57Q5WQdl11tgxOoojBo06y2ZyVPZv FEDxHgEdKZk7voL8JrUjL+/k9CRzF8TS6gUHawT4GYypGecZhgGDvnx8n A=; X-IronPort-AV: E=Sophos;i="5.90,144,1643673600"; d="scan'208";a="995595018" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-pdx-2c-b09ea7fa.us-west-2.amazon.com) ([10.25.36.210]) by smtp-border-fw-9103.sea19.amazon.com with ESMTP; 28 Feb 2022 20:41:40 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2c-b09ea7fa.us-west-2.amazon.com (Postfix) with ESMTPS id 403A741ADD for ; Mon, 28 Feb 2022 20:41:40 +0000 (UTC) Received: from EX13D07UWB004.ant.amazon.com (10.43.161.196) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.28; Mon, 28 Feb 2022 20:41:39 +0000 Received: from EX13D07UWB004.ant.amazon.com (10.43.161.196) by EX13D07UWB004.ant.amazon.com (10.43.161.196) with Microsoft SMTP Server (TLS) id 15.0.1497.28; Mon, 28 Feb 2022 20:41:39 +0000 Received: from EX13D07UWB004.ant.amazon.com ([10.43.161.196]) by EX13D07UWB004.ant.amazon.com ([10.43.161.196]) with mapi id 15.00.1497.028; Mon, 28 Feb 2022 20:41:39 +0000 From: "Swinney, Jonathan" To: "ffmpeg-devel@ffmpeg.org" Thread-Topic: [PATCH] swscale/aarch64: add hscale specializations Thread-Index: AdgqlYlVD0RgImpmSr6Sxw6I7MzmFg== Date: Mon, 28 Feb 2022 20:41:39 +0000 Message-ID: <097207e51edb42fdbb79fd3f36b42254@EX13D07UWB004.ant.amazon.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [10.43.161.217] MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] swscale/aarch64: add hscale specializations X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: This patch adds specializations for hscale for filterSize =3D=3D 4 and 8 and converts the existing implementation for the X8 version. For the old code, = now used for the X8 version, it improves the efficiency of the final summations= by reducing 11 instructions to 7. ff_hscale8to15_8_neon is mostly unchanged from the original except for a few changes. - The loads for the filter data were consolidated into a single 64 byte ld1 instruction. - The final summations were improved. - The inner loop on filterSize was completely removed ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck here= is loading the data from src, this data is loaded a whole block ahead and stor= ed back to the stack to be loaded again with ld4. This arranges the data for m= ost efficient use of the vector instructions and removes the need for completion adds at the end. The number of iterations of the C per iteration of the ass= embly is increased from 4 to 8, but because of the prefetching, it can only be us= ed when dstW is >=3D 16. This improves speed by 26% on Graviton 2 (Neoverse N1) ffmpeg -nostats -f lavfi -i testsrc2=3D4k:d=3D2 -vf bench=3Dstart,scale=3D1= 024x1024,bench=3Dstop -f null - before: t:0.001796 avg:0.001839 max:0.002756 min:0.001733 after: t:0.001690 avg:0.001352 max:0.002171 min:0.001292 In direct micro benchmarks I wrote the benefit is more dramatic when filter= Size =3D=3D 4. | (seconds) | c6g | | | ----------- | ------- | ----- | | filterSize | 4 | 8 | | original | 7.554 | 7.621 | | optimized | 3.736 | 7.054 | | improvement | 102.19% | 8.04% | Signed-off-by: Jonathan Swinney --- libswscale/aarch64/hscale.S | 263 +++++++++++++++++++++++++++++++++-- libswscale/aarch64/swscale.c | 41 ++++-- libswscale/utils.c | 2 +- 3 files changed, 284 insertions(+), 22 deletions(-) diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S index af55ffe2b7..a934653a46 100644 --- a/libswscale/aarch64/hscale.S +++ b/libswscale/aarch64/hscale.S @@ -1,5 +1,7 @@ /* * Copyright (c) 2016 Cl=C3=A9ment B=C5"sch + * Copyright (c) 2019-2021 Sebastian Pop + * Copyright (c) 2022 Jonathan Swinney * * This file is part of FFmpeg. * @@ -20,7 +22,25 @@ = #include "libavutil/aarch64/asm.S" = -function ff_hscale_8_to_15_neon, export=3D1 +/* +;-------------------------------------------------------------------------= ---- +; horizontal line scaling +; +; void hscaleto__ +; (SwsContext *c, int{16,32}_t *dst, +; int dstW, const uint{8,16}_t *src, +; const int16_t *filter, +; const int32_t *filterPos, int filterSize); +; +; Scale one horizontal line. Input is either 8-bit width or 16-bit width +; ($source_width can be either 8, 9, 10 or 16, difference is whether we ha= ve to +; downscale before multiplying). Filter is 14 bits. Output is either 15 bi= ts +; (in int16_t) or 19 bits (in int32_t), as given in $intermediate_nbits. E= ach +; output pixel is generated from $filterSize input pixels, the position of +; the first pixel is given in filterPos[nOutputPixel]. +;-------------------------------------------------------------------------= ---- */ + +function ff_hscale8to15_X8_neon, export=3D1 sbfiz x7, x6, #1, #32 // filterSize*2 (*= 2 because int16) 1: ldr w8, [x5], #4 // filterPos[idx] ldr w0, [x5], #4 // filterPos[idx += 1] @@ -61,20 +81,239 @@ function ff_hscale_8_to_15_neon, export=3D1 smlal v3.4S, v18.4H, v19.4H // v3 accumulates = srcp[filterPos[3] + {0..3}] * filter[{0..3}] smlal2 v3.4S, v18.8H, v19.8H // v3 accumulates = srcp[filterPos[3] + {4..7}] * filter[{4..7}] b.gt 2b // inner loop if f= ilterSize not consumed completely - addp v0.4S, v0.4S, v0.4S // part0 horizonta= l pair adding - addp v1.4S, v1.4S, v1.4S // part1 horizonta= l pair adding - addp v2.4S, v2.4S, v2.4S // part2 horizonta= l pair adding - addp v3.4S, v3.4S, v3.4S // part3 horizonta= l pair adding - addp v0.4S, v0.4S, v0.4S // part0 horizonta= l pair adding - addp v1.4S, v1.4S, v1.4S // part1 horizonta= l pair adding - addp v2.4S, v2.4S, v2.4S // part2 horizonta= l pair adding - addp v3.4S, v3.4S, v3.4S // part3 horizonta= l pair adding - zip1 v0.4S, v0.4S, v1.4S // part01 =3D zip = values from part0 and part1 - zip1 v2.4S, v2.4S, v3.4S // part23 =3D zip = values from part2 and part3 - mov v0.d[1], v2.d[0] // part0123 =3D zi= p values from part01 and part23 + uzp1 v4.4S, v0.4S, v1.4S // unzip low parts= 0 and 1 + uzp2 v5.4S, v0.4S, v1.4S // unzip high part= s 0 and 1 + uzp1 v6.4S, v2.4S, v3.4S // unzip low parts= 2 and 3 + uzp2 v7.4S, v2.4S, v3.4S // unzip high part= s 2 and 3 + add v16.4S, v4.4S, v5.4S // add half of eac= h of part 0 and 1 + add v17.4S, v6.4S, v7.4S // add half of eac= h of part 2 and 3 + addp v0.4S, v16.4S, v17.4S // pairwise add to= complete half adds in earlier steps subs w2, w2, #4 // dstW -=3D 4 sqshrn v0.4H, v0.4S, #7 // shift and clip = the 2x16-bit final values st1 {v0.4H}, [x1], #8 // write to destin= ation part0123 b.gt 1b // loop until end = of line ret endfunc + + +function ff_hscale8to15_8_neon, export=3D1 +// x0 SwsContext *c (not used) +// x1 int16_t *dst +// x2 int dstW +// x3 const uint8_t *src +// x4 const int16_t *filter +// x5 const int32_t *filterPos +// x6 int filterSize +// x8-x11 filterPos values + +// v0-v3 multiply add accumulators +// v4-v7 filter data, temp for final horizontal sum +// v16-v19 src data +1: + ld1 {v4.8H, v5.8H, v6.8H, v7.8H}, [x4], #64 // loa= d filter[idx=3D0..3, j=3D0..7] + ldp w8, w9, [x5] // filterPos[idx += 0], [idx + 1] + ldp w10, w11, [x5, 8] // filterPos[idx += 2], [idx + 3] + movi v0.2D, #0 // val sum part 1 = (for dst[0]) + movi v1.2D, #0 // val sum part 2 = (for dst[1]) + add x5, x5, #16 // increment filte= rPos + + add x8, x3, w8, UXTW // srcp + filterPo= s[0] + add x9, x3, w9, UXTW // srcp + filterPo= s[1] + add x10, x3, w10, UXTW // srcp + filterPo= s[2] + add x11, x3, w11, UXTW // srcp + filterPo= s[3] + + ld1 {v16.8B}, [x8], #8 // srcp[filterPos[= 0] + {0..7}] + ld1 {v17.8B}, [x9], #8 // srcp[filterPos[= 1] + {0..7}] + + movi v2.2D, #0 // val sum part 3 = (for dst[2]) + movi v3.2D, #0 // val sum part 4 = (for dst[3]) + + uxtl v16.8H, v16.8B // unpack part 1 t= o 16-bit + uxtl v17.8H, v17.8B // unpack part 2 t= o 16-bit + + smlal v0.4S, v16.4H, v4.4H // v0 accumulates = srcp[filterPos[0] + {0..3}] * filter[{0..3}] + smlal v1.4S, v17.4H, v5.4H // v1 accumulates = srcp[filterPos[1] + {0..3}] * filter[{0..3}] + + ld1 {v18.8B}, [x10], #8 // srcp[filterPos[= 2] + {0..7}] + ld1 {v19.8B}, [x11], #8 // srcp[filterPos[= 3] + {0..7}] + + smlal2 v0.4S, v16.8H, v4.8H // v0 accumulates = srcp[filterPos[0] + {4..7}] * filter[{4..7}] + smlal2 v1.4S, v17.8H, v5.8H // v1 accumulates = srcp[filterPos[1] + {4..7}] * filter[{4..7}] + + uxtl v18.8H, v18.8B // unpack part 3 t= o 16-bit + uxtl v19.8H, v19.8B // unpack part 4 t= o 16-bit + + smlal v2.4S, v18.4H, v6.4H // v2 accumulates = srcp[filterPos[2] + {0..3}] * filter[{0..3}] + smlal v3.4S, v19.4H, v7.4H // v3 accumulates = srcp[filterPos[3] + {0..3}] * filter[{0..3}] + + smlal2 v2.4S, v18.8H, v6.8H // v2 accumulates = srcp[filterPos[2] + {4..7}] * filter[{4..7}] + smlal2 v3.4S, v19.8H, v7.8H // v3 accumulates = srcp[filterPos[3] + {4..7}] * filter[{4..7}] + + uzp1 v4.4S, v0.4S, v1.4S // unzip low parts= 0 and 1 + uzp2 v5.4S, v0.4S, v1.4S // unzip high part= s 0 and 1 + uzp1 v6.4S, v2.4S, v3.4S // unzip low parts= 2 and 3 + uzp2 v7.4S, v2.4S, v3.4S // unzip high part= s 2 and 3 + + add v0.4S, v4.4S, v5.4S // add half of eac= h of part 0 and 1 + add v1.4S, v6.4S, v7.4S // add half of eac= h of part 2 and 3 + + addp v4.4S, v0.4S, v1.4S // pairwise add to= complete half adds in earlier steps + + subs w2, w2, #4 // dstW -=3D 4 + sqshrn v0.4H, v4.4S, #7 // shift and clip = the 2x16-bit final values + st1 {v0.4H}, [x1], #8 // write to destin= ation part0123 + b.gt 1b // loop until end = of line + ret +endfunc + +function ff_hscale8to15_4_neon, export=3D1 +// x0 SwsContext *c (not used) +// x1 int16_t *dst +// x2 int dstW +// x3 const uint8_t *src +// x4 const int16_t *filter +// x5 const int32_t *filterPos +// x6 int filterSize +// x8-x15 registers for gathering src data + +// v0 madd accumulator 4S +// v1-v4 filter values (16 bit) 8H +// v5 madd accumulator 4S +// v16-v19 src values (8 bit) 8B + +// This implementation has 4 sections: +// 1. Prefetch src data +// 2. Interleaved prefetching src data and madd +// 3. Complete madd +// 4. Complete remaining iterations when dstW % 8 !=3D 0 + + add sp, sp, #-32 // allocate 32 byt= es on the stack + cmp w2, #16 // if dstW <16, sk= ip to the last block used for wrapping up + b.lt 2f + + // load 8 values from filterPos to be used as offsets into src + ldp w8, w9, [x5] // filterPos[idx += 0], [idx + 1] + ldp w10, w11, [x5, 8] // filterPos[idx += 2], [idx + 3] + ldp w12, w13, [x5, 16] // filterPos[idx += 4], [idx + 5] + ldp w14, w15, [x5, 24] // filterPos[idx += 6], [idx + 7] + add x5, x5, #32 // advance filterP= os + + // gather random access data from src into contiguous memory + ldr w8, [x3, w8, UXTW] // src[filterPos[i= dx + 0]][0..3] + ldr w9, [x3, w9, UXTW] // src[filterPos[i= dx + 1]][0..3] + ldr w10, [x3, w10, UXTW] // src[filterPos[i= dx + 2]][0..3] + ldr w11, [x3, w11, UXTW] // src[filterPos[i= dx + 3]][0..3] + ldr w12, [x3, w12, UXTW] // src[filterPos[i= dx + 4]][0..3] + ldr w13, [x3, w13, UXTW] // src[filterPos[i= dx + 5]][0..3] + ldr w14, [x3, w14, UXTW] // src[filterPos[i= dx + 6]][0..3] + ldr w15, [x3, w15, UXTW] // src[filterPos[i= dx + 7]][0..3] + stp w8, w9, [sp] // *scratch_mem = =3D { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] } + stp w10, w11, [sp, 8] // *scratch_mem = =3D { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] } + stp w12, w13, [sp, 16] // *scratch_mem = =3D { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] } + stp w14, w15, [sp, 24] // *scratch_mem = =3D { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] } + +1: + ld4 {v16.8B, v17.8B, v18.8B, v19.8B}, [sp] // tran= spose 8 bytes each from src into 4 registers + + // load 8 values from filterPos to be used as offsets into src + ldp w8, w9, [x5] // filterPos[idx += 0][0..3], [idx + 1][0..3], next iteration + ldp w10, w11, [x5, 8] // filterPos[idx += 2][0..3], [idx + 3][0..3], next iteration + ldp w12, w13, [x5, 16] // filterPos[idx += 4][0..3], [idx + 5][0..3], next iteration + ldp w14, w15, [x5, 24] // filterPos[idx += 6][0..3], [idx + 7][0..3], next iteration + + movi v0.2D, #0 // Clear madd accu= mulator for idx 0..3 + movi v5.2D, #0 // Clear madd accu= mulator for idx 4..7 + + ld4 {v1.8H, v2.8H, v3.8H, v4.8H}, [x4], #64 // loa= d filter idx + 0..7 + + add x5, x5, #32 // advance filterP= os + + // interleaved SIMD and prefetching intended to keep ld/st and vec= tor pipelines busy + uxtl v16.8H, v16.8B // unsigned extend= long, covert src data to 16-bit + uxtl v17.8H, v17.8B // unsigned extend= long, covert src data to 16-bit + ldr w8, [x3, w8, UXTW] // src[filterPos[i= dx + 0]], next iteration + ldr w9, [x3, w9, UXTW] // src[filterPos[i= dx + 1]], next iteration + uxtl v18.8H, v18.8B // unsigned extend= long, covert src data to 16-bit + uxtl v19.8H, v19.8B // unsigned extend= long, covert src data to 16-bit + ldr w10, [x3, w10, UXTW] // src[filterPos[i= dx + 2]], next iteration + ldr w11, [x3, w11, UXTW] // src[filterPos[i= dx + 3]], next iteration + + smlal v0.4S, v1.4H, v16.4H // multiply accumu= late inner loop j =3D 0, idx =3D 0..3 + smlal v0.4S, v2.4H, v17.4H // multiply accumu= late inner loop j =3D 1, idx =3D 0..3 + ldr w12, [x3, w12, UXTW] // src[filterPos[i= dx + 4]], next iteration + ldr w13, [x3, w13, UXTW] // src[filterPos[i= dx + 5]], next iteration + smlal v0.4S, v3.4H, v18.4H // multiply accumu= late inner loop j =3D 2, idx =3D 0..3 + smlal v0.4S, v4.4H, v19.4H // multiply accumu= late inner loop j =3D 3, idx =3D 0..3 + ldr w14, [x3, w14, UXTW] // src[filterPos[i= dx + 6]], next iteration + ldr w15, [x3, w15, UXTW] // src[filterPos[i= dx + 7]], next iteration + + smlal2 v5.4S, v1.8H, v16.8H // multiply accumu= late inner loop j =3D 0, idx =3D 4..7 + smlal2 v5.4S, v2.8H, v17.8H // multiply accumu= late inner loop j =3D 1, idx =3D 4..7 + stp w8, w9, [sp] // *scratch_mem = =3D { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] } + stp w10, w11, [sp, 8] // *scratch_mem = =3D { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] } + smlal2 v5.4S, v3.8H, v18.8H // multiply accumu= late inner loop j =3D 2, idx =3D 4..7 + smlal2 v5.4S, v4.8H, v19.8H // multiply accumu= late inner loop j =3D 3, idx =3D 4..7 + stp w12, w13, [sp, 16] // *scratch_mem = =3D { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] } + stp w14, w15, [sp, 24] // *scratch_mem = =3D { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] } + + sub w2, w2, #8 // dstW -=3D 8 + sqshrn v0.4H, v0.4S, #7 // shift and clip = the 2x16-bit final values + sqshrn v1.4H, v5.4S, #7 // shift and clip = the 2x16-bit final values + st1 {v0.4H, v1.4H}, [x1], #16 // write to dst[id= x + 0..7] + cmp w2, #16 // continue on mai= n loop if there are at least 16 iterations left + b.ge 1b + + // last full iteration + ld4 {v16.8B, v17.8B, v18.8B, v19.8B}, [sp] + ld4 {v1.8H, v2.8H, v3.8H, v4.8H}, [x4], #64 // loa= d filter idx + 0..7 + + movi v0.2D, #0 // Clear madd accu= mulator for idx 0..3 + movi v5.2D, #0 // Clear madd accu= mulator for idx 4..7 + + uxtl v16.8H, v16.8B // unsigned extend= long, covert src data to 16-bit + uxtl v17.8H, v17.8B // unsigned extend= long, covert src data to 16-bit + uxtl v18.8H, v18.8B // unsigned extend= long, covert src data to 16-bit + uxtl v19.8H, v19.8B // unsigned extend= long, covert src data to 16-bit + + smlal v0.4S, v1.4H, v16.4H // multiply accumu= late inner loop j =3D 0, idx =3D 0..3 + smlal v0.4S, v2.4H, v17.4H // multiply accumu= late inner loop j =3D 1, idx =3D 0..3 + smlal v0.4S, v3.4H, v18.4H // multiply accumu= late inner loop j =3D 2, idx =3D 0..3 + smlal v0.4S, v4.4H, v19.4H // multiply accumu= late inner loop j =3D 3, idx =3D 0..3 + + smlal2 v5.4S, v1.8H, v16.8H // multiply accumu= late inner loop j =3D 0, idx =3D 4..7 + smlal2 v5.4S, v2.8H, v17.8H // multiply accumu= late inner loop j =3D 1, idx =3D 4..7 + smlal2 v5.4S, v3.8H, v18.8H // multiply accumu= late inner loop j =3D 2, idx =3D 4..7 + smlal2 v5.4S, v4.8H, v19.8H // multiply accumu= late inner loop j =3D 3, idx =3D 4..7 + + subs w2, w2, #8 // dstW -=3D 8 + sqshrn v0.4H, v0.4S, #7 // shift and clip = the 2x16-bit final values + sqshrn v1.4H, v5.4S, #7 // shift and clip = the 2x16-bit final values + st1 {v0.4H, v1.4H}, [x1], #16 // write to dst[id= x + 0..7] + + cbnz w2, 2f // if >0 iteration= s remain, jump to the wrap up section + + add sp, sp, #32 // clean up stack + ret + + // finish up when dstW % 8 !=3D 0 or dstW < 16 +2: + // load src + ldr w8, [x5], #4 // filterPos[i] + ldr w9, [x3, w8, UXTW] // src[filterPos[i= ] + 0..3] + ins v5.S[0], w9 // move to simd re= gister + // load filter + ld1 {v6.4H}, [x4], #8 // filter[filterSi= ze * i + 0..3] + + uxtl v5.8H, v5.8B // unsigned exten = long, convert src data to 16-bit + smull v0.4S, v5.4H, v6.4H // 4 iterations of= src[...] * filter[...] + addp v0.4S, v0.4S, v0.4S // accumulate the = smull results + addp v0.4S, v0.4S, v0.4S // accumulate the = smull results + sqshrn v0.4H, v0.4S, #7 // shift and clip = the 2x16-bit final values + mov w10, v0.S[0] // move back to ge= neral register (only one value from simd reg is used) + strh w10, [x1], #2 // dst[i] =3D ... + sub w2, w2, #1 // dstW-- + cbnz w2, 2b + + add sp, sp, #32 // clean up stack + ret +endfunc diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index 09d0a7130e..2ea4ccb3a6 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -22,25 +22,48 @@ #include "libswscale/swscale_internal.h" #include "libavutil/aarch64/cpu.h" = -void ff_hscale_8_to_15_neon(SwsContext *c, int16_t *dst, int dstW, - const uint8_t *src, const int16_t *filter, - const int32_t *filterPos, int filterSize); +#define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \ +void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \ + SwsContext *c, int16_t *da= ta, \ + int dstW, const uint8_t *s= rc, \ + const int16_t *filter, \ + const int32_t *filterPos, = int filterSize) +#define SCALE_FUNCS(filter_n, opt) \ + SCALE_FUNC(filter_n, 8, 15, opt); +#define ALL_SCALE_FUNCS(opt) \ + SCALE_FUNCS(4, opt); \ + SCALE_FUNCS(8, opt); \ + SCALE_FUNCS(X8, opt) + +ALL_SCALE_FUNCS(neon); = void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset); = +#define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do { \ + if (c->srcBpc =3D=3D 8 && c->dstBpc <=3D 14) { = \ + hscalefn =3D \ + ff_hscale8to15_ ## filtersize ## _ ## opt; \ + } \ +} while (0) + +#define ASSIGN_SCALE_FUNC(hscalefn, filtersize, opt) \ + switch (filtersize) { \ + case 4: ASSIGN_SCALE_FUNC2(hscalefn, 4, opt); break; \ + case 8: ASSIGN_SCALE_FUNC2(hscalefn, 8, opt); break; \ + default: if (filtersize % 8 =3D=3D 0) = \ + ASSIGN_SCALE_FUNC2(hscalefn, X8, opt); \ + break; \ + } + av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) { int cpu_flags =3D av_get_cpu_flags(); = if (have_neon(cpu_flags)) { - if (c->srcBpc =3D=3D 8 && c->dstBpc <=3D 14 && - (c->hLumFilterSize % 8) =3D=3D 0 && - (c->hChrFilterSize % 8) =3D=3D 0) - { - c->hyScale =3D c->hcScale =3D ff_hscale_8_to_15_neon; - } + ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon); + ASSIGN_SCALE_FUNC(c->hcScale, c->hChrFilterSize, neon); if (c->dstBpc =3D=3D 8) { c->yuv2planeX =3D ff_yuv2planeX_8_neon; } diff --git a/libswscale/utils.c b/libswscale/utils.c index c5ea8853d5..2f2b8e73a9 100644 --- a/libswscale/utils.c +++ b/libswscale/utils.c @@ -1825,7 +1825,7 @@ av_cold int sws_init_context(SwsContext *c, SwsFilter= *srcFilter, { const int filterAlign =3D X86_MMX(cpu_flags) ? 4 : PPC_ALTIVEC(cpu_flags) ? 8 : - have_neon(cpu_flags) ? 8 : 1; + have_neon(cpu_flags) ? 4 : 1; = if ((ret =3D initFilter(&c->hLumFilter, &c->hLumFilterPos, &c->hLumFilterSize, c->lumXInc, -- = 2.32.0 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".