From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 7707E4C6FC
	for <ffmpegdev@gitmailbox.com>; Sun,  9 Mar 2025 21:28:20 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C2DD068E6F5;
	Sun,  9 Mar 2025 23:28:15 +0200 (EET)
Received: from haasn.dev (haasn.dev [78.46.187.166])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A31DF68E58C
 for <ffmpeg-devel@ffmpeg.org>; Sun,  9 Mar 2025 23:28:08 +0200 (EET)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=haasn.xyz; s=mail;
 t=1741555688; bh=Qaf8coue0ajUgp3Rx7vgBlIJlcdWWaSYY3R2YRxEjVU=;
 h=Date:From:To:Subject:In-Reply-To:References:From;
 b=RqMYPB695MSH+TvOg36kcrAKe/4iAVP2kONdnJw0bIlCFiyJ2z6nL4iRiFCYLo8Et
 i6IIs1GZF3zDs1xMqCp/Gup9ks6zEwUQxMlRvDFyiNYq8K9p/MwFLfV60RgLKpLAaj
 JAMm2w9ZQQFm/cqx9tfEY3yw2xeYinKFh2FuSJ7c=
Received: from localhost (unknown [185.35.208.89])
 by haasn.dev (Postfix) with UTF8SMTPSA id 574AA40D5A
 for <ffmpeg-devel@ffmpeg.org>; Sun,  9 Mar 2025 22:28:08 +0100 (CET)
Date: Sun, 9 Mar 2025 22:28:07 +0100
Message-ID: <20250309222807.GB706835@haasn.xyz>
From: Niklas Haas <ffmpeg@haasn.xyz>
To: ffmpeg-devel@ffmpeg.org
In-Reply-To: <20250309221349.GG683063@haasn.xyz>
References: <20250308235342.GB669161@haasn.xyz>
 <20250309221349.GG683063@haasn.xyz>
MIME-Version: 1.0
Content-Disposition: inline
Subject: Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/20250309222807.GB706835@haasn.xyz/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

On Sun, 09 Mar 2025 22:13:49 +0100 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> The worst slowdowns are currently those involving any sort of packed swizzle
> for which there exist dedicated MMX functions currently:
> 
>   Conversion pass for bgr24 -> abgr:
>     [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
>     [ u8 ...X -> X+++] SWS_OP_SWIZZLE      : 0012
>     [ u8 X... -> ++++] SWS_OP_CLEAR        : {255 _ _ _}
>     [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) packed >> 0
>       (X = unused, + = exact, 0 = zero)
>   bgr24 1920x1080 -> abgr 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
>     time=1710 us, ref=826 us, speedup=0.483x slower
> 
> I have previously identified these as a particularly weak spot in the compiler
> output, since no matter what C code I write, the result will always be roughly
> 0.5x compared to the existing hand-written MMX. That said, I also plan on taking
> that existing MMX code and simply plugging it into the new architecture, which
> should get rid of these last few slow cases.

I also wanted to point out that a lot of our conversions are also more
*accurate* than the previous implementations. An illustrative example:

  Conversion pass for gray -> gray10le:
    [ u8 XXXX -> +XXX] SWS_OP_READ         : 1 elem(s) packed >> 0
    [ u8 .XXX -> +XXX] SWS_OP_CONVERT      : u8 -> f32
    [f32 .XXX -> .XXX] SWS_OP_SCALE        : * 341/85
    [f32 .XXX -> .XXX] SWS_OP_DITHER       : 16x16 matrix
    [f32 .XXX -> .XXX] SWS_OP_CLAMP        : 0 <= x <= {1023 _ _ _}
    [f32 .XXX -> +XXX] SWS_OP_CONVERT      : f32 -> u16
    [u16 .XXX -> XXXX] SWS_OP_WRITE        : 1 elem(s) packed >> 0
      (X = unused, + = exact, 0 = zero)
  gray 1920x1080 -> gray10le 1920x1080, flags=0 dither=1, SSIM {Y=0.999974 U=1.000000 V=1.000000 A=1.000000}
    time=1317 us, ref=1300 us, speedup=0.987x slower

The reference implementation handles this as a full range shift:

  gray10 = gray << 2 | gray >> 6.

But this is *not* accurate and will therefore introduce round trip error. For
example, a value of 200 produces 200 << 2 | 200 >> 6 = 803, while the correct
result would be 200 / 255 * 1023 = 802.3529411764706. Our new implementation
accurately handles this conversion in floating point math and dithers the
result down to a 35%/65% mix of 802 and 803.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".