Re: [FFmpeg-devel] [PATCH v6 14/14] vvcdec: add full vvc decoder

From: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH v6 14/14] vvcdec: add full vvc decoder
Date: Sat, 9 Dec 2023 06:14:57 +0100
Message-ID: <DU0P250MB074757365CDF9D12E95F5D188F89A@DU0P250MB0747.EURP250.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <CAFXK13fKBgbSZ54HyqA0zHd6DdXEcu4+j6KtnXmQ04ueKhJLfw@mail.gmail.com>

Nuo Mi:
> Hi Andreas,
> thank you for the review.
> On Fri, Dec 8, 2023 at 8:17 PM Andreas Rheinhardt <
> andreas.rheinhardt@outlook.com> wrote:
> 
>>
>>> +
>>> +static int min_pu_arrays_init(VVCFrameContext *fc, const int
>> pic_size_in_min_pu)
>>> +{
>>> +    if (fc->tab.pic_size_in_min_pu != pic_size_in_min_pu) {
>>> +        min_pu_arrays_free(fc);
>>> +        fc->tab.msf  = av_mallocz(pic_size_in_min_pu);
>>> +        fc->tab.iaf  = av_mallocz(pic_size_in_min_pu);
>>> +        fc->tab.mmi  = av_mallocz(pic_size_in_min_pu);
>>> +        fc->tab.mvf  = av_mallocz(pic_size_in_min_pu *
>> sizeof(*fc->tab.mvf));
>>
>> Do these have to be separate allocations? If there were allocated
>> jointly, one memset below would suffice.
>>
> They are separate flags, if we combine them. We can't use memset to set
> flags for a block.
> 

I disagree: You would still be able to use different pointers for
different parts of the large allocated block, it is just that you also
save some unnecessary allocations (and frees and errors checks for the
allocations) and also gain the ability to memset them via one memset
call in case one wants to set them to the same value.

>>
>>> +
>>> +static int init_slice_context(SliceContext *sc, VVCFrameContext *fc,
>> const H2645NAL *nal, const CodedBitstreamUnit *unit)
>>> +{
>>> +    const VVCSH *sh             = &sc->sh;
>>> +    const H266RawSlice *slice   = (const H266RawSlice *)unit->content;
>>
>> Please no pointless casts. Also, why is there unnecessary whitespace in
>> front of '='?
>>
> Fix here and serval other places
> The whitespace will make all = in a col.
> 

But there is nothing that needs that much whitespace.

>>> +
>>> +static av_cold int frame_context_init(VVCFrameContext *fc,
>> AVCodecContext *avctx)
>>> +{
>>> +
>>> +    fc->avctx = av_memdup(avctx, sizeof(*avctx));
>>
>> When I read this, I presumed you are using multiple AVCodecContexts to
>> store the ever changing state of the AVCodecContext fields similarly to
>> update_context_from_thread() in pthread_frame.c. But it seems you don't.
>> These contexts are only used as a) logcontexts (where the actual
>> user-facing AVCodecContext should be used, so that the user can make
>> sense of the logmessages!), b) in ff_thread_get_buffer() and c) in
>> export_frame_params() where only some basic fields
>> (dimension-related+pix_fmt) is set. Presumably c) is done for b).
>>
> I remember if i did not use a local AVCodecContext  it would trigger some
> assert when resolution changed.
> 

Can you be more specific about what assert has been triggered? And have
you set the AVFrame fields directly before ff_thread_get_buffer()?

>>
>> But the user is allowed to change the provided callbacks in the master
>> context at any time. E.g. the call to ff_thread_get_buffer() in
>> vvc_refs.c currently uses the VVCFrameContext and therefore uses the
>> get_buffer2 callback in place now (during av_memdup()). This is wrong.
>>
> This will not happen. av_memdup only happens in vvc_decode_init.
> Nobody will call ff_thread_get_buffer at this time
> 

You missed the point: If the user changes the get_buffer2 callback after
init, the new callback will not be used at all.

>>
>> I think you can just remove VVCFrameContext.avctx and use the
>> user-facing AVCodecContext if you set the AVFrame properties that are
>> normally derived from the AVCodecContext directly on the AVFrame before
>> ff_thread_get_buffer().
> 
> Could you explain more about how to create a user-facing  AVCodecContext?
> 

You do not create a user-facing AVCodecContext, the user does (and calls
avcodec_send_packet()/avcodec_receive_frame() with it).

>>
>>> +
>>> +static int decode_nal_unit(VVCContext *s, VVCFrameContext *fc, const
>> H2645NAL *nal, const CodedBitstreamUnit *unit)
>>> +{
>>> +    int  ret;
>>> +
>>> +    s->temporal_id   = nal->temporal_id;
>>> +
>>> +    switch (unit->type) {
>>> +    case VVC_VPS_NUT:
>>> +    case VVC_SPS_NUT:
>>> +    case VVC_PPS_NUT:
>>> +        /* vps, sps, sps cached by s->cbc */
>>> +        break;
>>> +    case VVC_TRAIL_NUT:
>>> +    case VVC_STSA_NUT:
>>> +    case VVC_RADL_NUT:
>>> +    case VVC_RASL_NUT:
>>> +    case VVC_IDR_W_RADL:
>>> +    case VVC_IDR_N_LP:
>>> +    case VVC_CRA_NUT:
>>> +    case VVC_GDR_NUT:
>>> +        ret = decode_slice(s, fc, nal, unit);
>>> +        if (ret < 0)
>>> +            goto fail;
>>> +        break;
>>> +    case VVC_PREFIX_APS_NUT:
>>> +    case VVC_SUFFIX_APS_NUT:
>>> +        ret = ff_vvc_decode_aps(&s->ps, unit);
>>> +        if (ret < 0)
>>> +            goto fail;
>>> +        break;
>>> +    default:
>>> +        av_log(s->avctx, AV_LOG_INFO,
>>> +               "Skipping NAL unit %d\n", unit->type);
>>
>> This will probably be very noisy (and warn for every SEI). I don't think
>> it is even needed, as h2645_parse.c already contains debug log messages
>> to display the unit type.
>>
> It's copied from hevcdec. It means we did not handle the nal diffrent than
> h2645_parser.c messages
> 

1. The message is unnecessary, because a user who wants to know which
NAL units have been handled or not can get the info about which units
are present from h2645_parse.c and then look up in this list whether
this type is processed.
2. hevcdec.c does "handle" quite a lot more NAL units; e.g. it actually
handles SEI messages and it ignores e.g. Access unit delimiters as well
as HEVC_NAL_UNSPEC62. Whereas you do not.

> A fail that is only "return ret" is pointless (not only here).
>>
> At someday if we need to add some cleanup code. we do not need to change
> all returns to goto.
> 

IMO a goto fail should be added if and when it is actually beneficial.

>>> +
>>> +    if (output) {
>>> +        while (s->nb_delayed) {
>>> +            wait_delayed_frame(s, output, &got_output);
>>> +            if (got_output) {
>>> +                av_frame_unref(output);
>>> +            }
>>> +        }
>>> +        av_frame_free(&output);
>>> +    }
>>>  }
>>>
>>>  static av_cold int vvc_decode_free(AVCodecContext *avctx)
>>>  {
>>> +    VVCContext *s = avctx->priv_data;
>>> +    int i;
>>> +
>>> +    ff_cbs_fragment_free(&s->current_frame);
>>
>> Is it sure that the fragment is not in use (given that other threads may
>> be running now before vvc_decode_flush())?
>>
> Do you mean the executor threads? If they want to use some data, they will
> take their own hip.
> see ff_refstruct_replace(&sps->r, rsps);
> 

"hip"?

I have now noticed that the SliceContexts contain a reference to the
packet's data via VVCSH->H266RawSliceHeader, which in reality points to
a H266RawSlice which contains a reference to the actual data, so this
point is moot. But using H266RawSliceHeader* for a H266RawSlice* is not
nice.

>>
>>> +
>>> +    s->nb_fcs = (avctx->flags & AV_CODEC_FLAG_LOW_DELAY) ? 1 :
>> FFMIN(av_cpu_count(), VVC_MAX_FRMAE_DELAY);
>>
>> This may evaluate av_cpu_count() multiple times. Furthermore I don't
>> know why this define is used here at all: With frame threading, the
>> number of frame threads is not limited by the delay/number of reordering
>> frames at all (we even have frame-threading for decoders without
>> frame-reordering at all).
>>
>  vvc_decode_frame only allows 1 frame in 1 frame out. We can remove the
> delay if we switch to FFCodec->receive_frame,
> 

I do not get how this is supposed to address my point.

>>
>> But worst of this is that you do not check avctx->thread_count at all.
>>
> we do not use avctx->thread_count.  we use the executor to manage threads.
> 

This is complete nonsense: It is the user who specifies how many threads
to use, regardless of which mechanism is used to manage threads.

>>
>>> +    s->fcs = av_calloc(s->nb_fcs, sizeof(*s->fcs));
>>> +    if (!s->fcs)
>>> +        goto fail;
>>> +
>>> +    for (int i = 0; i < s->nb_fcs; i++) {
>>> +        VVCFrameContext *fc = s->fcs + i;
>>> +        ret = frame_context_init(fc, avctx);
>>> +        if (ret < 0)
>>> +            goto fail;
>>> +    }
>>> +
>>> +    s->executor = ff_vvc_executor_alloc(s, s->nb_fcs);
>>> +    if (!s->executor)
>>> +        goto fail;
>>> +
>>> +    s->eos = 1;
>>> +    GDR_SET_RECOVERED(s);
>>> +    memset(&ff_vvc_default_scale_m, 16, sizeof(ff_vvc_default_scale_m));
>>
>> This needs to be done once (i.e. protected by an AVOnce) and not every
>> time a decoder is set up. Otherwise there might be data races.
>>
> It's not read and set, it will no data races:), I can change it to  AVOnce
> .

This is wrong: It is set here and presumably it will be read somewhere,
so there absolutely can be data races.
If you believe that there is no data race because every memset sets it
to the same value, then you should be aware that the C specification
disagrees with you (all references are to the C11 spec):

a) 5.1.2.4 25: "The execution of a program contains a data race if it
contains two conflicting actions in different threads, at least one of
which is not atomic, and neither happens before the other. Any such data
race results in undefined behavior."
b) 5.1.2.4 4: "Two expression evaluations conflict if one of them
modifies a memory location and the other one reads or modifies the same
memory location."
c) Note 2 in 3.1 (to the definition of "access"):
"‘‘Modify’’ includes the case where the new value being stored is the
same as the previous value."

With the current code data races will happen if a) two different decoder
instances are initialized without synchronisation (given that lavc does
not serialize initialization of codecs (except in rare cases based upon
a flag which this decoder does not set), this synchronization would have
to be performed by the user, but we do not require our users to do this)
or b) a decoder is initialized while another decoder runs and reads from
ff_vvc_default_scale_m:

Because the accesses performed by the initing thread are always
modifications according to c), the accesses by the different threads
conflict by definition b). memset() is not required to perform atomic
modifications (and according to the standard atomic modifications can
only happen with atomic objects, which ff_vvc_default_scale_m is not)
and by our assumption there is no synchronisation between these actions,
so it is a data race according to a). And data races are undefined
behaviour.

This clause allows compilers to optimize lots of code as if the program
were single-threaded (because concurrent accesses were UB, so presumably
they don't happen). In particular, speculative writes are legal (and
happen sometimes, but probably not in memset). In fact, it would be
legal for memset to always zero the memory it is supposed to set and
then overwrite it with the final value.

- Andreas

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".