From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTP id 6A1504256F
	for <ffmpegdev@gitmailbox.com>; Fri, 18 Mar 2022 19:10:23 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AD95868B0C3;
	Fri, 18 Mar 2022 21:10:20 +0200 (EET)
Received: from EUR05-DB8-obe.outbound.protection.outlook.com
 (mail-db8eur05olkn2107.outbound.protection.outlook.com [40.92.89.107])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 9BC3C68020F
 for <ffmpeg-devel@ffmpeg.org>; Fri, 18 Mar 2022 21:10:14 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=Ri+ZBR5IdeKxsFay4w+U11IaMySMX1jERZfseZTpHiPETL/YNyYlf+4Do7G0+elJb6rnghOmU+XuagcK6O7kKG/5yW+VT4OPIH6WlUmM0CbKlMnRoe83vF5g0Fj1ne63bDC2K7Kst0IMiJnSmQzp7CFnKdGKlSsfliquowToTsvWRbUHx3MvvhIlefAQI2y9PWprnGz/Ua0BUl2N8Si1E9wrD54d8tlkhA6plWdSlv1rrB1xn0xUGh/M9PD86SofYAPOQcXW0TOzFmmtgF1zbHVWiYBtC26u6Kbvmtn+lTOhPDn8bAPAgF1rWfkCgyqr9pN3WCArQSYf6fpNFb3OIQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; 
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=+5BOKAEuw/k/M3aibH9FVxr1DQdotu3GB0uKvGFUK3Q=;
 b=Dn0HiLP98P2vlkeN3tKX45x0KwVWduOtmrbweE3+fwEs3a+kaItAdtNfrw1MKAFHjSGuc+w27prm7igJ6vWShGOTB4NnVwFO3BEDvMGkckuZ/z8gKIRzxw0tAk/YUTKPbsgq47VT9E2DRkLV0TWHNPdppRO6GqTwHM90163tQuLePu2bx03yc3RlnFMwTjm/AnhROGRxtCSrWkPbNW3qjmy2qQdY3FXHVuWLzZ2sGQUs+QNTmzzMAFPDLTuXaSmsJ32u6QJrl77jYqLD3Zf5wmez6l4odgny2aahLMTSRsKzsxYYVk1SmNJnjYtzvHG+lIE6UZiqMCo5s+iW28s8OQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=+5BOKAEuw/k/M3aibH9FVxr1DQdotu3GB0uKvGFUK3Q=;
 b=cdjoIWQivezcvJkTeOnP2Zf2m2ZkhXV9u6wspa41oNNLo3OL3tYuEBfXA9aEcZeUytOmqr9JDApsY0yWCS77SILQEjrQn/Kr4R+ObAWTPeBTXd6u1+teXnHoQNK9Nj24xB7AizQCkKWDyq/SDdwD42V1/CZiY2dNE5Iw+mYX57glF0GfRnqX4167kcyI0l8q3D89wspu+K9jgh1StZG546SdLwGmmQFp3WEQZD0qdrXt/ZOad37Rd12A6EMXi03ppsXsG5hBDfhmNnXPcwee/DGjwVrB9+ZUaBxr7E5PU6O37Qotz8oUecM7LGqwou4gtpgEcaA5y6Hp/HbHXjtIAQ==
Received: from AS1PR01MB9564.eurprd01.prod.exchangelabs.com
 (2603:10a6:20b:4d1::16) by AM0PR01MB5923.eurprd01.prod.exchangelabs.com
 (2603:10a6:208:15b::27) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5081.15; Fri, 18 Mar
 2022 19:10:12 +0000
Received: from AS1PR01MB9564.eurprd01.prod.exchangelabs.com
 ([fe80::9070:a5fd:e532:bdf8]) by AS1PR01MB9564.eurprd01.prod.exchangelabs.com
 ([fe80::9070:a5fd:e532:bdf8%3]) with mapi id 15.20.5081.016; Fri, 18 Mar 2022
 19:10:12 +0000
Message-ID: <AS1PR01MB9564305644E9C9578DA3F0E08F139@AS1PR01MB9564.eurprd01.prod.exchangelabs.com>
Date: Fri, 18 Mar 2022 20:10:11 +0100
Content-Language: en-US
To: ffmpeg-devel@ffmpeg.org
References: <20220317185819.466470-1-bavison@riscosopen.org>
 <20220317185819.466470-7-bavison@riscosopen.org>
From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
In-Reply-To: <20220317185819.466470-7-bavison@riscosopen.org>
X-TMN: [fCDwVA0lD1CQCMrZ8TiO8VCxVC9eqDiS]
X-ClientProxiedBy: AM5PR0201CA0022.eurprd02.prod.outlook.com
 (2603:10a6:203:3d::32) To AS1PR01MB9564.eurprd01.prod.exchangelabs.com
 (2603:10a6:20b:4d1::16)
X-Microsoft-Original-Message-ID: <1cb6e1bb-f699-f754-0ff9-ffb049b44838@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: aa14ad0e-7397-4e29-0bf2-08da0912ed65
X-MS-TrafficTypeDiagnostic: AM0PR01MB5923:EE_
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: rvK3h6vKkH2c1fiaI9Zh0emFLuy8CJZ0Fpr0um9Rrxr7dh/5d0ane2CxbSzhvtPId0l5GYeW9FADtnDFEJKsUinfWOEDqxLnJk2xStjbey+pQ/4reNodIRKRbSjFZMGtyk1TguOeef3cxdfgWfh/wAgvKTvd1KyPgk+rtbwnY0H6Vwk/CHjYsaqd7Yev9ZkWvESqRI3JFU3+6AMIs3QdflIyHDW2OYhlp1xM4tPBx7lfLZhe0vSvQfUkyLA3RdjvuSsD5G/TixUdSIXfeplbV/BK1L0yEctiS1sim0TSokN2pcwbPa0xEphqP2X7luPRTuGLXsiva8jSZn2tqrTlqbML5DkrvIcqN66sd56bMZ4tatGdWH1T2EFFmBTF25+5SHgKq8bnVxuZCEk8lPLpMESAhoUV33+CBIiOMcSERZXIfjELj1lHxvjvcM8QmfGXDNavF11zrffEn39+5vhACT42pORqTs72sLBYuTST9YeJzz1XmenEXtiYhTghCDfRzXqM6ZEfFdFtXHVmpg9vWvleEYOxMN1IcQcAGforkL5S8LS6zGsEzw4fsjEo+vHHPDddzZY5zE9FLuf8hjAzPzjUtAaR6x7+an9G6AMv5ypV32NCLdpo1qiSH7qrJ6qx
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?bVJod215TytPc2tSczV0Sm5CaWtwUXBZK0dMazdPZFNjcEZNSURGN0JUZjYw?=
 =?utf-8?B?UVhzanJVMTl0cnpISFJXRWZNVmhtWSt1VHIwNWpnNnJGQXlZZ2ppckdyUlNt?=
 =?utf-8?B?RVkyTzdTemQyVnB0UjBLZUJvQ0V0a1RQVndSeUNja1Zwb3lFUktmcUp2Tmlx?=
 =?utf-8?B?d05PS0hHMXByNm1sSHJ5MTEvd0VjNHF0Z21oSzhDcUUwOWFraGZKdjV4YUhG?=
 =?utf-8?B?dkx6Y0FsQk0vMFVVdXpCMlVITTd1YjU3alk3NHl5Q1RzektNSXZTd2x2bVlR?=
 =?utf-8?B?NytRaEgzYjBFQ1F1REtOajZlQ2JPY1kvdEpGOVdCM2RNMEZ2NUNpWGpnK3FV?=
 =?utf-8?B?WjY5clVyK3BoRk9QcERxNklKZGRtZXFQbGNDbTdFMVkwQmJ0c0ZBMkg2WDVl?=
 =?utf-8?B?b2dNTTdxUTdOZDF3NzN5em0wUklwZTZCajl6ODNkS0EzalZ2THNjU2JmZ2dn?=
 =?utf-8?B?dHIxWFJVVEhzL1JrU2NBUW1oL1V6ZHRjcmpRRjJUaTdwaUhtL09zRkxoOUFs?=
 =?utf-8?B?SVFzM0lpNjkyS0x1Sjl2Sjl2NDkxWGFnb01EM0ZtSDBZaGF3cDJtd1BpSWlq?=
 =?utf-8?B?Tk5LSytOUHVYa0ZXQTZqajRTdFMrQk5ubER3RkNSMFV5djNvMnAwZ1BnWVhk?=
 =?utf-8?B?ZmVYQmJKOUp3Q3F1cVhhU2RINHVjUUxKMi9OWXljSmhGQ2RuMnVnK0lPeTVU?=
 =?utf-8?B?N210YXlvWGJvM0xFS0J0dXpmMkxEcVJqZDdZVzV6bG9vdmZ3cVZXYXczV2dT?=
 =?utf-8?B?Y2pVZUp1aG5oTktOeC9UemtiKzJOaUlCOUYvQjJBcjNudTRmNzB1Ky9JUC94?=
 =?utf-8?B?MWRQdzlaTy9zSGMzTVhiaHg4V1E3UmN2LzlCZERDVU9YK05OZG9PSlMrR3dr?=
 =?utf-8?B?YUZKSDV6Q1ZOV2s4ZWpXZDh3OHpoSWJyUXo3d0Q1VG0xTldzRFNlaG5uVVFn?=
 =?utf-8?B?em56ZkhXWGRBeC9CWlVscytiMGpEdXBWTm5UdkNQdkE2NkVsUFFXSVc5bitu?=
 =?utf-8?B?NE93WlJudzRvckRPWkNzT0lyZ1orZm5EWkVQbGd5WkJnZWRyc3E2SldIUEtw?=
 =?utf-8?B?cU1SeTI0SGpzQ0M0Vm5NK3ZxV2lXV0Jyd0FjVFU2bDVQOGRaUnhONHlHa1ZZ?=
 =?utf-8?B?MHRvWXBweS9WeHY5RFhCelZWNmhYWGFvMzVFOWlvOFNrekIyNkFrU0p1UGFM?=
 =?utf-8?B?cHJkTnZjMmNpV0dtN1p1Y0g2ellFaTY4SVVNaXRhcENBdDdlaFc4VDIwOGVr?=
 =?utf-8?B?dlhaUW81ZUdCRFVLeGM1bVlyWE5zMUVkVXIyaDFOZUZyRjFyQXJQdXpReFR6?=
 =?utf-8?B?dzJYVFVXTXVWdkJUT1hiT2NTU2VHSzc5Z2xlZDNCWS9BRVh1WFpqSlBIQit1?=
 =?utf-8?B?akc1VWNZckFjdWl4WWJraFA5aHV4d1ovNURJVU1ucFhGaHU5SzliL29uQUxt?=
 =?utf-8?B?dWxvRys4WVA2SU1MVCtnb1RJQzZDYTNGbE9Pa1JoNk1WbysySXZLcUtaNkdB?=
 =?utf-8?B?aHZQaTFkc2x3dS9xR0ppUitoNTd6ZkRTa1RwbTI4N3hkK29wWGV1VEFVMFNT?=
 =?utf-8?Q?uQ0ksrri8pHecH8O8Mj8sXxRc=3D?=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: aa14ad0e-7397-4e29-0bf2-08da0912ed65
X-MS-Exchange-CrossTenant-AuthSource: AS1PR01MB9564.eurprd01.prod.exchangelabs.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 18 Mar 2022 19:10:12.6177 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM0PR01MB5923
Subject: Re: [FFmpeg-devel] [PATCH 6/6] avcodec/vc1: Introduce fast path for
 unescaping bitstream buffer
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/AS1PR01MB9564305644E9C9578DA3F0E08F139@AS1PR01MB9564.eurprd01.prod.exchangelabs.com/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

Ben Avison:
> Populate with implementations suitable for 32-bit and 64-bit Arm.
> 
> Signed-off-by: Ben Avison <bavison@riscosopen.org>
> ---
>  libavcodec/aarch64/vc1dsp_init_aarch64.c |  60 ++++++++
>  libavcodec/aarch64/vc1dsp_neon.S         | 176 +++++++++++++++++++++++
>  libavcodec/arm/vc1dsp_init_neon.c        |  60 ++++++++
>  libavcodec/arm/vc1dsp_neon.S             | 118 +++++++++++++++
>  libavcodec/vc1dec.c                      |  20 +--
>  libavcodec/vc1dsp.c                      |   2 +
>  libavcodec/vc1dsp.h                      |   3 +
>  7 files changed, 429 insertions(+), 10 deletions(-)
> 
> diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b/libavcodec/aarch64/vc1dsp_init_aarch64.c
> index b672b2aa99..2fc2d5d1d3 100644
> --- a/libavcodec/aarch64/vc1dsp_init_aarch64.c
> +++ b/libavcodec/aarch64/vc1dsp_init_aarch64.c
> @@ -51,6 +51,64 @@ void ff_put_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
>  void ff_avg_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
>                                  int h, int x, int y);
>  
> +int ff_vc1_unescape_buffer_helper_neon(const uint8_t *src, int size, uint8_t *dst);
> +
> +static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst)
> +{
> +    /* Dealing with starting and stopping, and removing escape bytes, are
> +     * comparatively less time-sensitive, so are more clearly expressed using
> +     * a C wrapper around the assembly inner loop. Note that we assume a
> +     * little-endian machine that supports unaligned loads. */
> +    int dsize = 0;
> +    while (size >= 4)
> +    {
> +        int found = 0;
> +        while (!found && (((uintptr_t) dst) & 7) && size >= 4)
> +        {
> +            found = (*(uint32_t *)src &~ 0x03000000) == 0x00030000;
> +            if (!found)
> +            {
> +                *dst++ = *src++;
> +                --size;
> +                ++dsize;
> +            }
> +        }
> +        if (!found)
> +        {
> +            int skip = size - ff_vc1_unescape_buffer_helper_neon(src, size, dst);
> +            dst += skip;
> +            src += skip;
> +            size -= skip;
> +            dsize += skip;
> +            while (!found && size >= 4)
> +            {
> +                found = (*(uint32_t *)src &~ 0x03000000) == 0x00030000;
> +                if (!found)
> +                {
> +                    *dst++ = *src++;
> +                    --size;
> +                    ++dsize;
> +                }
> +            }
> +        }
> +        if (found)
> +        {
> +            *dst++ = *src++;
> +            *dst++ = *src++;
> +            ++src;
> +            size -= 3;
> +            dsize += 2;
> +        }
> +    }
> +    while (size > 0)
> +    {
> +        *dst++ = *src++;
> +        --size;
> +        ++dsize;
> +    }
> +    return dsize;
> +}
> +
>  av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp)
>  {
>      int cpu_flags = av_get_cpu_flags();
> @@ -76,5 +134,7 @@ av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp)
>          dsp->avg_no_rnd_vc1_chroma_pixels_tab[0] = ff_avg_vc1_chroma_mc8_neon;
>          dsp->put_no_rnd_vc1_chroma_pixels_tab[1] = ff_put_vc1_chroma_mc4_neon;
>          dsp->avg_no_rnd_vc1_chroma_pixels_tab[1] = ff_avg_vc1_chroma_mc4_neon;
> +
> +        dsp->vc1_unescape_buffer = vc1_unescape_buffer_neon;
>      }
>  }
> diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S
> index c3ca3eae1e..8bdeffab44 100644
> --- a/libavcodec/aarch64/vc1dsp_neon.S
> +++ b/libavcodec/aarch64/vc1dsp_neon.S
> @@ -1374,3 +1374,179 @@ function ff_vc1_h_loop_filter16_neon, export=1
>          st2     {v2.b, v3.b}[7], [x6]
>  4:      ret
>  endfunc
> +
> +// Copy at most the specified number of bytes from source to destination buffer,
> +// stopping at a multiple of 32 bytes, none of which are the start of an escape sequence
> +// On entry:
> +//   x0 -> source buffer
> +//   w1 = max number of bytes to copy
> +//   x2 -> destination buffer, optimally 8-byte aligned
> +// On exit:
> +//   w0 = number of bytes not copied
> +function ff_vc1_unescape_buffer_helper_neon, export=1
> +        // Offset by 80 to screen out cases that are too short for us to handle,
> +        // and also make it easy to test for loop termination, or to determine
> +        // whether we need an odd number of half-iterations of the loop.
> +        subs    w1, w1, #80
> +        b.mi    90f
> +
> +        // Set up useful constants
> +        movi    v20.4s, #3, lsl #24
> +        movi    v21.4s, #3, lsl #16
> +
> +        tst     w1, #32
> +        b.ne    1f
> +
> +          ld1     {v0.16b, v1.16b, v2.16b}, [x0], #48
> +          ext     v25.16b, v0.16b, v1.16b, #1
> +          ext     v26.16b, v0.16b, v1.16b, #2
> +          ext     v27.16b, v0.16b, v1.16b, #3
> +          ext     v29.16b, v1.16b, v2.16b, #1
> +          ext     v30.16b, v1.16b, v2.16b, #2
> +          ext     v31.16b, v1.16b, v2.16b, #3
> +          bic     v24.16b, v0.16b, v20.16b
> +          bic     v25.16b, v25.16b, v20.16b
> +          bic     v26.16b, v26.16b, v20.16b
> +          bic     v27.16b, v27.16b, v20.16b
> +          bic     v28.16b, v1.16b, v20.16b
> +          bic     v29.16b, v29.16b, v20.16b
> +          bic     v30.16b, v30.16b, v20.16b
> +          bic     v31.16b, v31.16b, v20.16b
> +          eor     v24.16b, v24.16b, v21.16b
> +          eor     v25.16b, v25.16b, v21.16b
> +          eor     v26.16b, v26.16b, v21.16b
> +          eor     v27.16b, v27.16b, v21.16b
> +          eor     v28.16b, v28.16b, v21.16b
> +          eor     v29.16b, v29.16b, v21.16b
> +          eor     v30.16b, v30.16b, v21.16b
> +          eor     v31.16b, v31.16b, v21.16b
> +          cmeq    v24.4s, v24.4s, #0
> +          cmeq    v25.4s, v25.4s, #0
> +          cmeq    v26.4s, v26.4s, #0
> +          cmeq    v27.4s, v27.4s, #0
> +          add     w1, w1, #32
> +          b       3f
> +
> +1:      ld1     {v3.16b, v4.16b, v5.16b}, [x0], #48
> +        ext     v25.16b, v3.16b, v4.16b, #1
> +        ext     v26.16b, v3.16b, v4.16b, #2
> +        ext     v27.16b, v3.16b, v4.16b, #3
> +        ext     v29.16b, v4.16b, v5.16b, #1
> +        ext     v30.16b, v4.16b, v5.16b, #2
> +        ext     v31.16b, v4.16b, v5.16b, #3
> +        bic     v24.16b, v3.16b, v20.16b
> +        bic     v25.16b, v25.16b, v20.16b
> +        bic     v26.16b, v26.16b, v20.16b
> +        bic     v27.16b, v27.16b, v20.16b
> +        bic     v28.16b, v4.16b, v20.16b
> +        bic     v29.16b, v29.16b, v20.16b
> +        bic     v30.16b, v30.16b, v20.16b
> +        bic     v31.16b, v31.16b, v20.16b
> +        eor     v24.16b, v24.16b, v21.16b
> +        eor     v25.16b, v25.16b, v21.16b
> +        eor     v26.16b, v26.16b, v21.16b
> +        eor     v27.16b, v27.16b, v21.16b
> +        eor     v28.16b, v28.16b, v21.16b
> +        eor     v29.16b, v29.16b, v21.16b
> +        eor     v30.16b, v30.16b, v21.16b
> +        eor     v31.16b, v31.16b, v21.16b
> +        cmeq    v24.4s, v24.4s, #0
> +        cmeq    v25.4s, v25.4s, #0
> +        cmeq    v26.4s, v26.4s, #0
> +        cmeq    v27.4s, v27.4s, #0
> +        // Drop through...
> +2:        mov     v0.16b, v5.16b
> +          ld1     {v1.16b, v2.16b}, [x0], #32
> +        cmeq    v28.4s, v28.4s, #0
> +        cmeq    v29.4s, v29.4s, #0
> +        cmeq    v30.4s, v30.4s, #0
> +        cmeq    v31.4s, v31.4s, #0
> +        orr     v24.16b, v24.16b, v25.16b
> +        orr     v26.16b, v26.16b, v27.16b
> +        orr     v28.16b, v28.16b, v29.16b
> +        orr     v30.16b, v30.16b, v31.16b
> +          ext     v25.16b, v0.16b, v1.16b, #1
> +        orr     v22.16b, v24.16b, v26.16b
> +          ext     v26.16b, v0.16b, v1.16b, #2
> +          ext     v27.16b, v0.16b, v1.16b, #3
> +          ext     v29.16b, v1.16b, v2.16b, #1
> +        orr     v23.16b, v28.16b, v30.16b
> +          ext     v30.16b, v1.16b, v2.16b, #2
> +          ext     v31.16b, v1.16b, v2.16b, #3
> +          bic     v24.16b, v0.16b, v20.16b
> +          bic     v25.16b, v25.16b, v20.16b
> +          bic     v26.16b, v26.16b, v20.16b
> +        orr     v22.16b, v22.16b, v23.16b
> +          bic     v27.16b, v27.16b, v20.16b
> +          bic     v28.16b, v1.16b, v20.16b
> +          bic     v29.16b, v29.16b, v20.16b
> +          bic     v30.16b, v30.16b, v20.16b
> +          bic     v31.16b, v31.16b, v20.16b
> +        addv    s22, v22.4s
> +          eor     v24.16b, v24.16b, v21.16b
> +          eor     v25.16b, v25.16b, v21.16b
> +          eor     v26.16b, v26.16b, v21.16b
> +          eor     v27.16b, v27.16b, v21.16b
> +          eor     v28.16b, v28.16b, v21.16b
> +        mov     w3, v22.s[0]
> +          eor     v29.16b, v29.16b, v21.16b
> +          eor     v30.16b, v30.16b, v21.16b
> +          eor     v31.16b, v31.16b, v21.16b
> +          cmeq    v24.4s, v24.4s, #0
> +          cmeq    v25.4s, v25.4s, #0
> +          cmeq    v26.4s, v26.4s, #0
> +          cmeq    v27.4s, v27.4s, #0
> +        cbnz    w3, 90f
> +        st1     {v3.16b, v4.16b}, [x2], #32
> +3:          mov     v3.16b, v2.16b
> +            ld1     {v4.16b, v5.16b}, [x0], #32
> +          cmeq    v28.4s, v28.4s, #0
> +          cmeq    v29.4s, v29.4s, #0
> +          cmeq    v30.4s, v30.4s, #0
> +          cmeq    v31.4s, v31.4s, #0
> +          orr     v24.16b, v24.16b, v25.16b
> +          orr     v26.16b, v26.16b, v27.16b
> +          orr     v28.16b, v28.16b, v29.16b
> +          orr     v30.16b, v30.16b, v31.16b
> +            ext     v25.16b, v3.16b, v4.16b, #1
> +          orr     v22.16b, v24.16b, v26.16b
> +            ext     v26.16b, v3.16b, v4.16b, #2
> +            ext     v27.16b, v3.16b, v4.16b, #3
> +            ext     v29.16b, v4.16b, v5.16b, #1
> +          orr     v23.16b, v28.16b, v30.16b
> +            ext     v30.16b, v4.16b, v5.16b, #2
> +            ext     v31.16b, v4.16b, v5.16b, #3
> +            bic     v24.16b, v3.16b, v20.16b
> +            bic     v25.16b, v25.16b, v20.16b
> +            bic     v26.16b, v26.16b, v20.16b
> +          orr     v22.16b, v22.16b, v23.16b
> +            bic     v27.16b, v27.16b, v20.16b
> +            bic     v28.16b, v4.16b, v20.16b
> +            bic     v29.16b, v29.16b, v20.16b
> +            bic     v30.16b, v30.16b, v20.16b
> +            bic     v31.16b, v31.16b, v20.16b
> +          addv    s22, v22.4s
> +            eor     v24.16b, v24.16b, v21.16b
> +            eor     v25.16b, v25.16b, v21.16b
> +            eor     v26.16b, v26.16b, v21.16b
> +            eor     v27.16b, v27.16b, v21.16b
> +            eor     v28.16b, v28.16b, v21.16b
> +          mov     w3, v22.s[0]
> +            eor     v29.16b, v29.16b, v21.16b
> +            eor     v30.16b, v30.16b, v21.16b
> +            eor     v31.16b, v31.16b, v21.16b
> +            cmeq    v24.4s, v24.4s, #0
> +            cmeq    v25.4s, v25.4s, #0
> +            cmeq    v26.4s, v26.4s, #0
> +            cmeq    v27.4s, v27.4s, #0
> +          cbnz    w3, 91f
> +          st1     {v0.16b, v1.16b}, [x2], #32
> +        subs    w1, w1, #64
> +        b.pl    2b
> +
> +90:     add     w0, w1, #80
> +        ret
> +
> +91:     sub     w1, w1, #32
> +        b       90b
> +endfunc
> diff --git a/libavcodec/arm/vc1dsp_init_neon.c b/libavcodec/arm/vc1dsp_init_neon.c
> index f5f5c702d7..3aefbcaf6d 100644
> --- a/libavcodec/arm/vc1dsp_init_neon.c
> +++ b/libavcodec/arm/vc1dsp_init_neon.c
> @@ -84,6 +84,64 @@ void ff_put_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
>  void ff_avg_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
>                                  int h, int x, int y);
>  
> +int ff_vc1_unescape_buffer_helper_neon(const uint8_t *src, int size, uint8_t *dst);
> +
> +static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst)
> +{
> +    /* Dealing with starting and stopping, and removing escape bytes, are
> +     * comparatively less time-sensitive, so are more clearly expressed using
> +     * a C wrapper around the assembly inner loop. Note that we assume a
> +     * little-endian machine that supports unaligned loads. */

You should nevertheless use AV_RL32 for your unaligned LE loads

> +    int dsize = 0;
> +    while (size >= 4)
> +    {
> +        int found = 0;
> +        while (!found && (((uintptr_t) dst) & 7) && size >= 4)
> +        {
> +            found = (*(uint32_t *)src &~ 0x03000000) == 0x00030000;
> +            if (!found)
> +            {
> +                *dst++ = *src++;
> +                --size;
> +                ++dsize;
> +            }
> +        }
> +        if (!found)
> +        {
> +            int skip = size - ff_vc1_unescape_buffer_helper_neon(src, size, dst);
> +            dst += skip;
> +            src += skip;
> +            size -= skip;
> +            dsize += skip;
> +            while (!found && size >= 4)
> +            {
> +                found = (*(uint32_t *)src &~ 0x03000000) == 0x00030000;
> +                if (!found)
> +                {
> +                    *dst++ = *src++;
> +                    --size;
> +                    ++dsize;
> +                }
> +            }
> +        }
> +        if (found)
> +        {
> +            *dst++ = *src++;
> +            *dst++ = *src++;
> +            ++src;
> +            size -= 3;
> +            dsize += 2;
> +        }
> +    }
> +    while (size > 0)
> +    {
> +        *dst++ = *src++;
> +        --size;
> +        ++dsize;
> +    }
> +    return dsize;
> +}
> +
>  #define FN_ASSIGN(X, Y) \
>      dsp->put_vc1_mspel_pixels_tab[0][X+4*Y] = ff_put_vc1_mspel_mc##X##Y##_16_neon; \
>      dsp->put_vc1_mspel_pixels_tab[1][X+4*Y] = ff_put_vc1_mspel_mc##X##Y##_neon
> @@ -130,4 +188,6 @@ av_cold void ff_vc1dsp_init_neon(VC1DSPContext *dsp)
>      dsp->avg_no_rnd_vc1_chroma_pixels_tab[0] = ff_avg_vc1_chroma_mc8_neon;
>      dsp->put_no_rnd_vc1_chroma_pixels_tab[1] = ff_put_vc1_chroma_mc4_neon;
>      dsp->avg_no_rnd_vc1_chroma_pixels_tab[1] = ff_avg_vc1_chroma_mc4_neon;
> +
> +    dsp->vc1_unescape_buffer = vc1_unescape_buffer_neon;
>  }
> diff --git a/libavcodec/arm/vc1dsp_neon.S b/libavcodec/arm/vc1dsp_neon.S
> index 4ef083102b..9d7333cf12 100644
> --- a/libavcodec/arm/vc1dsp_neon.S
> +++ b/libavcodec/arm/vc1dsp_neon.S
> @@ -1804,3 +1804,121 @@ function ff_vc1_h_loop_filter16_neon, export=1
>  4:      vpop            {d8-d15}
>          pop             {r4-r6,pc}
>  endfunc
> +
> +@ Copy at most the specified number of bytes from source to destination buffer,
> +@ stopping at a multiple of 16 bytes, none of which are the start of an escape sequence
> +@ On entry:
> +@   r0 -> source buffer
> +@   r1 = max number of bytes to copy
> +@   r2 -> destination buffer, optimally 8-byte aligned
> +@ On exit:
> +@   r0 = number of bytes not copied
> +function ff_vc1_unescape_buffer_helper_neon, export=1
> +        @ Offset by 48 to screen out cases that are too short for us to handle,
> +        @ and also make it easy to test for loop termination, or to determine
> +        @ whether we need an odd number of half-iterations of the loop.
> +        subs    r1, r1, #48
> +        bmi     90f
> +
> +        @ Set up useful constants
> +        vmov.i32        q0, #0x3000000
> +        vmov.i32        q1, #0x30000
> +
> +        tst             r1, #16
> +        bne             1f
> +
> +          vld1.8          {q8, q9}, [r0]!
> +          vbic            q12, q8, q0
> +          vext.8          q13, q8, q9, #1
> +          vext.8          q14, q8, q9, #2
> +          vext.8          q15, q8, q9, #3
> +          veor            q12, q12, q1
> +          vbic            q13, q13, q0
> +          vbic            q14, q14, q0
> +          vbic            q15, q15, q0
> +          vceq.i32        q12, q12, #0
> +          veor            q13, q13, q1
> +          veor            q14, q14, q1
> +          veor            q15, q15, q1
> +          vceq.i32        q13, q13, #0
> +          vceq.i32        q14, q14, #0
> +          vceq.i32        q15, q15, #0
> +          add             r1, r1, #16
> +          b               3f
> +
> +1:      vld1.8          {q10, q11}, [r0]!
> +        vbic            q12, q10, q0
> +        vext.8          q13, q10, q11, #1
> +        vext.8          q14, q10, q11, #2
> +        vext.8          q15, q10, q11, #3
> +        veor            q12, q12, q1
> +        vbic            q13, q13, q0
> +        vbic            q14, q14, q0
> +        vbic            q15, q15, q0
> +        vceq.i32        q12, q12, #0
> +        veor            q13, q13, q1
> +        veor            q14, q14, q1
> +        veor            q15, q15, q1
> +        vceq.i32        q13, q13, #0
> +        vceq.i32        q14, q14, #0
> +        vceq.i32        q15, q15, #0
> +        @ Drop through...
> +2:        vmov            q8, q11
> +          vld1.8          {q9}, [r0]!
> +        vorr            q13, q12, q13
> +        vorr            q15, q14, q15
> +          vbic            q12, q8, q0
> +        vorr            q3, q13, q15
> +          vext.8          q13, q8, q9, #1
> +          vext.8          q14, q8, q9, #2
> +          vext.8          q15, q8, q9, #3
> +          veor            q12, q12, q1
> +        vorr            d6, d6, d7
> +          vbic            q13, q13, q0
> +          vbic            q14, q14, q0
> +          vbic            q15, q15, q0
> +          vceq.i32        q12, q12, #0
> +        vmov            r3, r12, d6
> +          veor            q13, q13, q1
> +          veor            q14, q14, q1
> +          veor            q15, q15, q1
> +          vceq.i32        q13, q13, #0
> +          vceq.i32        q14, q14, #0
> +          vceq.i32        q15, q15, #0
> +        orrs            r3, r3, r12
> +        bne             90f
> +        vst1.64         {q10}, [r2]!
> +3:          vmov            q10, q9
> +            vld1.8          {q11}, [r0]!
> +          vorr            q13, q12, q13
> +          vorr            q15, q14, q15
> +            vbic            q12, q10, q0
> +          vorr            q3, q13, q15
> +            vext.8          q13, q10, q11, #1
> +            vext.8          q14, q10, q11, #2
> +            vext.8          q15, q10, q11, #3
> +            veor            q12, q12, q1
> +          vorr            d6, d6, d7
> +            vbic            q13, q13, q0
> +            vbic            q14, q14, q0
> +            vbic            q15, q15, q0
> +            vceq.i32        q12, q12, #0
> +          vmov            r3, r12, d6
> +            veor            q13, q13, q1
> +            veor            q14, q14, q1
> +            veor            q15, q15, q1
> +            vceq.i32        q13, q13, #0
> +            vceq.i32        q14, q14, #0
> +            vceq.i32        q15, q15, #0
> +          orrs            r3, r3, r12
> +          bne             91f
> +          vst1.64         {q8}, [r2]!
> +        subs            r1, r1, #32
> +        bpl             2b
> +
> +90:     add             r0, r1, #48
> +        bx              lr
> +
> +91:     sub             r1, r1, #16
> +        b               90b
> +endfunc
> diff --git a/libavcodec/vc1dec.c b/libavcodec/vc1dec.c
> index 1c92b9d401..6a30b5b664 100644
> --- a/libavcodec/vc1dec.c
> +++ b/libavcodec/vc1dec.c
> @@ -490,7 +490,7 @@ static av_cold int vc1_decode_init(AVCodecContext *avctx)
>              size = next - start - 4;
>              if (size <= 0)
>                  continue;
> -            buf2_size = vc1_unescape_buffer(start + 4, size, buf2);
> +            buf2_size = v->vc1dsp.vc1_unescape_buffer(start + 4, size, buf2);
>              init_get_bits(&gb, buf2, buf2_size * 8);
>              switch (AV_RB32(start)) {
>              case VC1_CODE_SEQHDR:
> @@ -680,7 +680,7 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data,
>                  case VC1_CODE_FRAME:
>                      if (avctx->hwaccel)
>                          buf_start = start;
> -                    buf_size2 = vc1_unescape_buffer(start + 4, size, buf2);
> +                    buf_size2 = v->vc1dsp.vc1_unescape_buffer(start + 4, size, buf2);
>                      break;
>                  case VC1_CODE_FIELD: {
>                      int buf_size3;
> @@ -697,8 +697,8 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data,
>                          ret = AVERROR(ENOMEM);
>                          goto err;
>                      }
> -                    buf_size3 = vc1_unescape_buffer(start + 4, size,
> -                                                    slices[n_slices].buf);
> +                    buf_size3 = v->vc1dsp.vc1_unescape_buffer(start + 4, size,
> +                                                              slices[n_slices].buf);
>                      init_get_bits(&slices[n_slices].gb, slices[n_slices].buf,
>                                    buf_size3 << 3);
>                      slices[n_slices].mby_start = avctx->coded_height + 31 >> 5;
> @@ -709,7 +709,7 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data,
>                      break;
>                  }
>                  case VC1_CODE_ENTRYPOINT: /* it should be before frame data */
> -                    buf_size2 = vc1_unescape_buffer(start + 4, size, buf2);
> +                    buf_size2 = v->vc1dsp.vc1_unescape_buffer(start + 4, size, buf2);
>                      init_get_bits(&s->gb, buf2, buf_size2 * 8);
>                      ff_vc1_decode_entry_point(avctx, v, &s->gb);
>                      break;
> @@ -726,8 +726,8 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data,
>                          ret = AVERROR(ENOMEM);
>                          goto err;
>                      }
> -                    buf_size3 = vc1_unescape_buffer(start + 4, size,
> -                                                    slices[n_slices].buf);
> +                    buf_size3 = v->vc1dsp.vc1_unescape_buffer(start + 4, size,
> +                                                              slices[n_slices].buf);
>                      init_get_bits(&slices[n_slices].gb, slices[n_slices].buf,
>                                    buf_size3 << 3);
>                      slices[n_slices].mby_start = get_bits(&slices[n_slices].gb, 9);
> @@ -761,7 +761,7 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data,
>                      ret = AVERROR(ENOMEM);
>                      goto err;
>                  }
> -                buf_size3 = vc1_unescape_buffer(divider + 4, buf + buf_size - divider - 4, slices[n_slices].buf);
> +                buf_size3 = v->vc1dsp.vc1_unescape_buffer(divider + 4, buf + buf_size - divider - 4, slices[n_slices].buf);
>                  init_get_bits(&slices[n_slices].gb, slices[n_slices].buf,
>                                buf_size3 << 3);
>                  slices[n_slices].mby_start = s->mb_height + 1 >> 1;
> @@ -770,9 +770,9 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data,
>                  n_slices1 = n_slices - 1;
>                  n_slices++;
>              }
> -            buf_size2 = vc1_unescape_buffer(buf, divider - buf, buf2);
> +            buf_size2 = v->vc1dsp.vc1_unescape_buffer(buf, divider - buf, buf2);
>          } else {
> -            buf_size2 = vc1_unescape_buffer(buf, buf_size, buf2);
> +            buf_size2 = v->vc1dsp.vc1_unescape_buffer(buf, buf_size, buf2);
>          }
>          init_get_bits(&s->gb, buf2, buf_size2*8);
>      } else{
> diff --git a/libavcodec/vc1dsp.c b/libavcodec/vc1dsp.c
> index a29b91bf3d..11d493f002 100644
> --- a/libavcodec/vc1dsp.c
> +++ b/libavcodec/vc1dsp.c
> @@ -34,6 +34,7 @@
>  #include "rnd_avg.h"
>  #include "vc1dsp.h"
>  #include "startcode.h"
> +#include "vc1_common.h"
>  
>  /* Apply overlap transform to horizontal edge */
>  static void vc1_v_overlap_c(uint8_t *src, int stride)
> @@ -1030,6 +1031,7 @@ av_cold void ff_vc1dsp_init(VC1DSPContext *dsp)
>  #endif /* CONFIG_WMV3IMAGE_DECODER || CONFIG_VC1IMAGE_DECODER */
>  
>      dsp->startcode_find_candidate = ff_startcode_find_candidate_c;
> +    dsp->vc1_unescape_buffer      = vc1_unescape_buffer;
>  
>      if (ARCH_AARCH64)
>          ff_vc1dsp_init_aarch64(dsp);
> diff --git a/libavcodec/vc1dsp.h b/libavcodec/vc1dsp.h
> index c6443acb20..8be1198071 100644
> --- a/libavcodec/vc1dsp.h
> +++ b/libavcodec/vc1dsp.h
> @@ -80,6 +80,9 @@ typedef struct VC1DSPContext {
>       * one or more further zero bytes and a one byte.
>       */
>      int (*startcode_find_candidate)(const uint8_t *buf, int size);
> +
> +    /* Copy a buffer, removing startcode emulation escape bytes as we go */
> +    int (*vc1_unescape_buffer)(const uint8_t *src, int size, uint8_t *dst);
>  } VC1DSPContext;
>  
>  void ff_vc1dsp_init(VC1DSPContext* c);

1. You should add some benchmarks to the commit message.
2. The unescaping process for VC1 is basically the same as for H.264 and
HEVC* and for those we already have better optimized code in
libavcodec/h2645_parse.c. Can you check the performance of this code
here against (re)using the code from h2645_parse.c?
(3. Btw: The code in h2645_parse.c could even be optimized further along
the lines of
https://ffmpeg.org/pipermail/ffmpeg-devel/2019-June/245203.html (The
H.264 and VC1 parsers use a quite suboptimal startcode search; this
patch is part of a patchset I submitted ages ago to improve it.).)

- Andreas

*: Except for the fact that VC-1 seems to allow 0x00 0x00 0x03 0xXY with
0xXY > 3 (where the 0x03 is not escaped) to occur inside a EBDU; it also
allows 0x00 0x00 0x02 (while the informative process for encoders is the
same as for H.2645; it does not produce the byte sequences disallows by
H.264).
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".