[FFmpeg-devel] off: How to merge multiple H.264 streams into a single H.264 stream in HW/FPGA?

From: "Tamás Forgács" <forgi007@gmail.com>
To: ffmpeg-devel@ffmpeg.org
Subject: [FFmpeg-devel] off: How to merge multiple H.264 streams into a single H.264 stream in HW/FPGA?
Date: Sat, 12 Feb 2022 18:39:16 +0100
Message-ID: <CAMFiZG+q7pG6Y-wnBm4VR+smteFxBFxbWxVxyO7yC57zx8aV=Q@mail.gmail.com> (raw)
In-Reply-To: <CAMFiZGJ63uE==n5XLGjL53vEVyJWHp2MqJ19hF8Mp2vUZ6U9Lg@mail.gmail.com>

Sorry about posting somewhat off-topic here, but I do not have much chance
to get help from elsewhere.

I encode camera images with several parallel H.264 baseline I-frame only
encoder cores in FPGA (more parallel cores are required to get the encoding
done fast enough without external memory). I use an opensource
[encoder][1]. On simulator I use 352x288 yuv420 input image and I slice it
to 2 horizontal halves.

The decoder is on another embedded device and I can decode the separated
slices correctly. But, instead of several decoded video sequences, I would
need to have just one vido sequence (with the whole image) from one H.264
byte stream. It is preferred to merge the H.264 streams on the FPGA
(instead of doing this by post processing the decoded slice video sequences
on the embedded host).

So, the input image data looks like this:
    .--.-----.--.--.-----.--.
    |MB|     |MB|MB|     |MB|
    | 0| ... |10|11| ... |21|
    |--.     '--|--'     '--|
    |MB|        |           |
    |22|        |           |
    |--'        |           |
    |   Slice   |   Slice   |
    |     0     |     1     |
    '-----------'----------'

And the encoder's block diagram is this (just one encoder core shown,
without merging):

                                 YUV420 MacroBlocks (16*16 pixel)
                                         |       |
                                  chroma |       | luma
                                         |       |
                      .----------------.-------. |
                      |                | |     | |
                      |                v v     v v
                 reconstruct      intra8x8cc  intra4x4 ---> header
                      ^                |        |             |
                      |                v        v             |
                inv.transform        coretransform            |
                      ^                |     |                |
                      |                |     v                |
                  dequantise           | dctransform          |
                   ^       ^           |     |                |
                   |       |           v     v                |
        inv.dc.transform <-'---------- quantise               |
                                           |                  |
                                           v                  |
                                         buffer               |
                                           |                  |
                                           v                  |
                                          cavlc               |
                                           |                  |
                                           v                  |
                                         tobytes <------------'
                                           |
                                           v
                                         H.264 byte stream

The encoded h264 data from a singel core (parsed with [H264Naked,][2] see
[parse.txt][3], [ref.yuv420][4] and [ref.264][5]) contains 3 NAL-s: in the
first one there is a Sequence Parameter Set, in the second a Picture
Parameter Set and in the third a Coded slice of an IDR picture. The SPS and
PPS headers are static and these are generated on the host. The IDR header
is generated by the encoder on the FPGA. Currently I use a constant
Quantization Parameter (i.e. VBR, but it is planned to implement CBR, with
variable QP later).

To get the encoded streams merged, I have removed the IDR header from the
slices (except from slice 0) in the header block. Then I buffer the output
of CAVLC blocks from each encoder for a whole MB row (this is the encoded
MB data, MB headers can be identified from the output of header block).
Then from this buffer I drive all the CAVLC data to a new instance of
tobytes block in this sequence:
- CAVLC data of one MB row from slice 0
- CAVLC data of one MB row from slice 1
- the above is repeated till the end of the frame.

When I decode this merged H.264 stream, then I get the following error with
ffmpeg (see [merged_tobytes.264][6] and [ffplay_error.txt][7]):

```
[h264 @ 0x7f5f0c00ac00] dquant out of range (-112) at 13 1B f=0/0
[h264 @ 0x7f5f0c00ac00] error while decoding MB 13 1
[h264 @ 0x7f5f0c00ac00] concealing 396 DC, 396 AC, 396 MV errors in I frame
```

With JM reference decoder I get this error ([jm_dec_error.txt][8]):

```
mb_qp_delta is out of range (-112)
illegal chroma intra pred mode!
```

I think the problem is that the decoder does not know about the slices and
expects neighboring MB data from other slice which are not available in the
encoded data. I have checked the Macroblock prediction syntax in the
[standard][9] (chapter 7.3.5.1), but I do not see how should I correct this
(I have very little knowledge about H.264). On the encoder side, I see that
intra4x4 and intra8x8cc are using neighboring MB prediction and pixel data
from the top and left MB neighbors when those are available (we have no top
MB neighbors in the first MB row and we have no left MB neighbor at first
MB of each MB row, in each slice).

Regarding horizontal slicing (which should work much easier in theory): I
do not have enough buffer for slicing horizontally (I need to start
processing the data with all the encoder cores in the first or second MB
row, I do not have a frame buffer). Or if I would have many small
horizontal slices (like 2 slice per each row in the whole frame) then I
think the compression efficiency would drop dramatically.

I have also tried to use FMO and declared slice groups
(slice_group_map_type=2) as described [here][10], but FMO and slice groups
are not supported by typical H.264 decoders (see ffmpeg [issue][11], I
think it is used only in broadcast equipments). In fact, I had also got
error from JM reference decoder even when I have declared the slice groups
(see [merged_tobytes_with_slice_grops.264][12] and
[jm_dec_slice_groups_error.txt][13]):

```
warning: Intra_8x8_Horizontal prediction mode not allowed at mb 0
illegal chroma intra pred mode!
```

Any help is appreciated.

Reference photo is taken from [here][14].

PS: if someone can advise a deblocker which could be used with this FPGA
core, that would be also nice.

  [1]: https://github.com/bcattle/hardh264/
  [2]: https://github.com/shi-yan/H264Naked
  [3]: https://www.dropbox.com/s/kktelcy06jmpa0t/h264naked_ref.txt?dl=0
  [4]: https://www.dropbox.com/s/mqkohwten6qax6z/ref.yuv420?dl=0
  [5]: https://www.dropbox.com/s/vbon3ul5jimkq60/ref.264?dl=0
  [6]: https://www.dropbox.com/s/ic3cms9p40g80od/merged_tobytes.264?dl=0
  [7]: https://www.dropbox.com/s/hkombdr087z7knf/ffplay_error.txt?dl=0
  [8]: https://www.dropbox.com/s/s42hxmjea5ul3qd/jm_dec_error.txt?dl=0
  [9]:
https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.264-202108-I!!PDF-E&type=items
  [10]: https://en.wikipedia.org/wiki/Flexible_Macroblock_Ordering
  [11]: https://www.mail-archive.com/ffmpeg-issues@lscube.org/msg03024.html
  [12]:
https://www.dropbox.com/s/8mnavr5y4dgr469/merged_tobytes_with_slice_grops.264?dl=0
  [13]:
https://www.dropbox.com/s/wgxnup7m19atm5j/jm_dec_slice_groups_error.txt?dl=0
  [14]: http://%20https://unsplash.com/photos/NuIbSaztf_g
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".