Re: [FFmpeg-devel] [RFC]] swscale modernization proposal

From: Niklas Haas <ffmpeg@haasn.xyz>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [RFC]] swscale modernization proposal
Date: Sat, 6 Jul 2024 14:27:55 +0200
Message-ID: <20240706142755.GB4446@haasn.xyz> (raw)
In-Reply-To: <20240705213406.GE4991@pb2>

On Fri, 05 Jul 2024 23:34:06 +0200 Michael Niedermayer <michael@niedermayer.cc> wrote:
> On Fri, Jul 05, 2024 at 08:31:17PM +0200, Niklas Haas wrote:
> > On Wed, 03 Jul 2024 15:25:58 +0200 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> > > On Tue, 02 Jul 2024 15:27:00 +0200 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> > >  
> > > > 1. Is this a good idea, or too confusing / complex to be worth the gain?
> > > >    Specifically, I am worried about confusion arising due to differences
> > > >    in behavior, and implemented options, between all of the above.
> > > > 
> > > >    That said, I think there is a big win to be had from unifying all of
> > > >    the different scaling and/or conversion filters we have in e.g.
> > > >    libavfilter, as well as making it trivial for users of this API to
> > > >    try using e.g. GPU scaling instead of CPU scaling.
> > > 
> > > After prototyping this approach a bit (using an internal struct
> > > AVScaleBackend), I think I like it. It specifically makes handling
> > > unscaled special converters pretty straightforward, for example - the
> > > "unscaled" backend can be separate from the generic/scaling backend.
> > > 
> > > We could also trivially plug in something like libyuv, or some other
> > > limited-use-case fast path, without the user really noticing.
> > 
> > Small update: I decided to scrap the idea of separate user-visible
> > "backends" for now, but preserved the internal API boundary between the
> > avscale_* "front-end" and the actual back-end implementation, which
> > I have called 'AVScaleGraph' for now.
> > 
> > The idea is that this will grow into a full colorspace <-> colorspace
> > "solver", but for now it is just hooked up to sws_createContext().
> > 
> 
> > Attached is my revised working draft of <avscale.h>.
> 
> I dont agree to the renaming of swscale, that is heading toward

I am not married to the name AVScaleContext, I only named it this way
because it is a new API and the old name is already taken. Consider it
a draft name, if you can come up with something better we can switch it.

> 
> 
> [...]
> > /**
> >  * The exact interpretation of these quality presets depends on the backend
> >  * used, but the backend-invariant common settings are derived as follows:
> >  */
> > enum AVScaleQuality {
> >     AV_SCALE_ULTRAFAST = 1,  /* no dither,      nearest+nearest     */
> >     AV_SCALE_SUPERFAST = 2,  /* no dither,      bilinear+nearest    */
> >     AV_SCALE_VERYFAST  = 3,  /* no dither,      bilinear+bilinear   */
> >     AV_SCALE_FASTER    = 4,  /* bayer dither,   bilinear+bilinear   */
> >     AV_SCALE_FAST      = 5,  /* bayer dither,   bicubic+bilinear    */
> >     AV_SCALE_MEDIUM    = 6,  /* bayer dither,   bicubic+bicubic     */
> >     AV_SCALE_SLOW      = 7,  /* bayer dither,   lanczos+bicubic     */
> >     AV_SCALE_SLOWER    = 8,  /* full dither,    lanczos+bicubic     */
> >     AV_SCALE_VERYSLOW  = 9,  /* full dither,    lanczos+lanczos     */
> >     AV_SCALE_PLACEBO   = 10, /* full dither,    lanczos+lanczos     */
> 
> I dont think its a good idea to hardcode dither and the "FIR" filter to the quality level in the API
> 
> 
> [...]
> > /**
> >  * Like `avscale_frame`, but operates only on the (source) range from `ystart`
> >  * to `height`.
> >  *
> >  * @param ctx   The scaling context.
> >  * @param dst   The destination frame. The data buffers may either be already
> >  *              allocated by the caller or left clear, in which case they will
> >  *              be allocated by the scaler. The latter may have performance
> >  *              advantages - e.g. in certain cases some (or all) output planes
> >  *              may be references to input planes, rather than copies.
> >  * @param src   The source frame. If the data buffers are set to NULL, then
> >  *              this function behaves identically to `avscale_frame_setup`.
> >  * @param slice_start   First row of slice, relative to `src`. Must be a
> >  *                      multiple of avscale_slice_alignment(src).
> >  * @param slice_height  Number of (source) rows in the slice. Must be a
> >  *                      multiple of avscale_slice_alignment(src).
> >  *
> >  * @return 0 on success, a negative AVERROR code on failure.
> >  */
> > int avscale_frame_slice(AVScaleContext *ctx, AVFrame *dst,
> >                         const AVFrame *src, int slice_start, int slice_height);
> > 
> > /**
> >  * Like `avscale_frame`, but without actually scaling. It will instead merely
> >  * initialize internal state that *would* be required to perform the operation,
> >  * as well as returning the correct error code for unsupported frame
> >  * combinations.
> >  *
> >  * @param ctx   The scaling context.
> >  * @param dst   The destination frame to consider.
> >  * @param src   The source frame to consider.
> >  * @return 0 on success, a negative AVERROR code on failure.
> >  */
> > int avscale_frame_setup(AVScaleContext *ctx, const AVFrame *dst,
> >                         const AVFrame *src);
> 
> somewhat off topic as this is public API but
> the swscale filtering code could internally use libavutil/executor.h
> having filters and slices interdepend and need to execute them "in order"
> and parallel, maybe that API is usefull, not sure, just wanted to mention it

Here is the design that I think both works and minimizes work:

1. Split the input image up into slices, dispatch one slice per thread.
2. Input processing, horizontal scaling etc. happens in thread-local
   buffers.
3. The output of this process is written to a (shared) thread frame
   with a synchronization primitive per slice. (*)

That way, there is a (mostly) clear hierarchy - the high-level code
splits the frame into slices, each thread sees only one slice, and the
dynamic input/output/conversion/etc. code does not need to care about
threading at all. The only nontrivial component is the vertical scaler,
which will have to somehow wait for the following slice's completion
before processing the last few rows.

(*) To save a bit on memory, we could again use a ring buffer for the
rows that are *not* shared between adjacent threads, but I think that is
an optimization we can worry about after it works and is fast enough.

> 
> thx
> 
> [...]
> 
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> Why not whip the teacher when the pupil misbehaves? -- Diogenes of Sinope
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".