Re: [FFmpeg-devel] [PATCH v4 1/3] lavc/hashtable: create generic robin hood hash table

From: Nicolas George <george@nsup.org>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH v4 1/3] lavc/hashtable: create generic robin hood hash table
Date: Tue, 22 Apr 2025 19:24:34 +0200
Message-ID: <aAfQ0r2EJVnfzBtg@phare.normalesup.org> (raw)
In-Reply-To: <20250422074254.77328-1-emma@emma.gg>

Emma Worley (HE12025-04-22):
> Adds a generic hash table with the DXV encoder as an initial use case.

Hi.

This code is already really good. I have some local remarks that will
not require a lot of work from you and make it better. And I have some
global remarks that would require a lot of work from you and might shave
a few cycles. For the rice wine of clarity, I have decided to send them
separately. This is the mail about the later.

After writing most of what follows I thought of checking how this data
structure is used in practice. What I wrote is more relevant as the size
of the entries grows, but in practice the entries are very small, which
means the current version is probably close to the best. Still, I wrote
it, I might as well send it.

As I said, the code is already really good. I do not consider necessary
to act on what I will say here, or even to read it. But it may contain
ideas to do even better, for you or somebody else, for now or for later.

This is about hash tables and algorithmic.

Vocabulary as I will use it here:

Hash bucket, or bucket: integer-indexed cell where the hash function
directs us to start searching for a key.

Entry: key and its associated value, as wanted by the code that uses the
hash table.

Entry slot, or slot: memory that can store an entry, especially relevant
if pre-allocated.

We all know the core difficulty with hash tables: different keys can map
to the same hash value and therefore hash bucket. There are two kinds of
solutions to this collision issue:

- try to find another hash bucket to find an available entry slot;

- make hash buckets capable of holding multiple entries, typically by
  the means of some kind of linked list.

In the most C and simplistic implementation, hash buckets and entry
slots are the same, and the code, common to add, get, del, just tries
the next bucket until it finds the right key or a hole.

In the most Java/GLib implementation, hash buckets are just pointers and
each entry is a separately allocated slot. (Though I am sure the
efficient Java hash tables are written much more in the C style.)

The issue with the second way is that it requires an extra number
(pointer/index, whatever) per bucket and per entry slot.

The first way has two problems. Clustering ruins the efficiency the
table. And deletion becomes hard, because just leaving a hole would
prevent from finding an entry that has the be stored after the hole
compared to its ideal place.

Clustering can be reduced with a secondary hash function (or even more
subtle): instead of looking at bucket h+1, look at bucket h+h'. Make
sure h' and the size of the hash table do not have common factors. But
that does not solve the issue of deletions.

I was not familiar with this “Robin Hood” algorithm. From what I
understand, it is a solution to both clustering and deletion. Excellent.

Except it costs a number per bucket/slot.

If we consider spending extra memory, we have to consider if the linked
list solution might not be more efficient.

This hash table has constant size. That makes it easy to pre-allocate
all slots in a big array. It can be the same array as the buckets or a
separate one.

If we keep buckets = slots and if we accept to move an entry on
occasion, then we can make it work with just one number per
entry/bucket. When adding a new entry, if the bucket is not empty, check
if it is a collision: same hash value? If it is a collision, then take
any empty bucket/slot (keeping them chained is easy) for the new entry
and link it to the existing one. If it is not a collision, then take the
bucket/slot for the new entry after moving the entry that was there into
an empty bucket.

What I described above is not very nice, but it is the best I know / can
find that only costs one number per bucket/slot. If we can afford to
spend more memory on it, we can do better by the virtue of breaking the
buckets = slots equivalence.

It is especially interesting if the entries are large, because this
version does not require copying entries. Also, we can have more buckets
than slots, reducing collisions.

All of this is very dependant on specifics. In particular, cache
locality can make a solution that seems slightly worse actually better.
It would require extensive benchmarking.

Regards,

-- 
  Nicolas George
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".