From: Nicolas George <george@nsup.org>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH v4 1/3] lavc/hashtable: create generic robin hood hash table
Date: Tue, 22 Apr 2025 19:24:34 +0200
Message-ID: <aAfQ0r2EJVnfzBtg@phare.normalesup.org> (raw)
In-Reply-To: <20250422074254.77328-1-emma@emma.gg>
Emma Worley (HE12025-04-22):
> Adds a generic hash table with the DXV encoder as an initial use case.
Hi.
This code is already really good. I have some local remarks that will
not require a lot of work from you and make it better. And I have some
global remarks that would require a lot of work from you and might shave
a few cycles. For the rice wine of clarity, I have decided to send them
separately. This is the mail about the later.
After writing most of what follows I thought of checking how this data
structure is used in practice. What I wrote is more relevant as the size
of the entries grows, but in practice the entries are very small, which
means the current version is probably close to the best. Still, I wrote
it, I might as well send it.
As I said, the code is already really good. I do not consider necessary
to act on what I will say here, or even to read it. But it may contain
ideas to do even better, for you or somebody else, for now or for later.
This is about hash tables and algorithmic.
Vocabulary as I will use it here:
Hash bucket, or bucket: integer-indexed cell where the hash function
directs us to start searching for a key.
Entry: key and its associated value, as wanted by the code that uses the
hash table.
Entry slot, or slot: memory that can store an entry, especially relevant
if pre-allocated.
We all know the core difficulty with hash tables: different keys can map
to the same hash value and therefore hash bucket. There are two kinds of
solutions to this collision issue:
- try to find another hash bucket to find an available entry slot;
- make hash buckets capable of holding multiple entries, typically by
the means of some kind of linked list.
In the most C and simplistic implementation, hash buckets and entry
slots are the same, and the code, common to add, get, del, just tries
the next bucket until it finds the right key or a hole.
In the most Java/GLib implementation, hash buckets are just pointers and
each entry is a separately allocated slot. (Though I am sure the
efficient Java hash tables are written much more in the C style.)
The issue with the second way is that it requires an extra number
(pointer/index, whatever) per bucket and per entry slot.
The first way has two problems. Clustering ruins the efficiency the
table. And deletion becomes hard, because just leaving a hole would
prevent from finding an entry that has the be stored after the hole
compared to its ideal place.
Clustering can be reduced with a secondary hash function (or even more
subtle): instead of looking at bucket h+1, look at bucket h+h'. Make
sure h' and the size of the hash table do not have common factors. But
that does not solve the issue of deletions.
I was not familiar with this “Robin Hood” algorithm. From what I
understand, it is a solution to both clustering and deletion. Excellent.
Except it costs a number per bucket/slot.
If we consider spending extra memory, we have to consider if the linked
list solution might not be more efficient.
This hash table has constant size. That makes it easy to pre-allocate
all slots in a big array. It can be the same array as the buckets or a
separate one.
If we keep buckets = slots and if we accept to move an entry on
occasion, then we can make it work with just one number per
entry/bucket. When adding a new entry, if the bucket is not empty, check
if it is a collision: same hash value? If it is a collision, then take
any empty bucket/slot (keeping them chained is easy) for the new entry
and link it to the existing one. If it is not a collision, then take the
bucket/slot for the new entry after moving the entry that was there into
an empty bucket.
What I described above is not very nice, but it is the best I know / can
find that only costs one number per bucket/slot. If we can afford to
spend more memory on it, we can do better by the virtue of breaking the
buckets = slots equivalence.
It is especially interesting if the entries are large, because this
version does not require copying entries. Also, we can have more buckets
than slots, reducing collisions.
All of this is very dependant on specifics. In particular, cache
locality can make a solution that seems slightly worse actually better.
It would require extensive benchmarking.
Regards,
--
Nicolas George
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2025-04-22 17:24 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-22 7:42 Emma Worley
2025-04-22 7:42 ` [FFmpeg-devel] [PATCH v4 2/3] lavc/dxvenc: migrate DXT1 encoder to lavc hashtable Emma Worley
2025-04-22 7:42 ` [FFmpeg-devel] [PATCH v4 3/3] lavc/dxvenc: improve compatibility with Resolume products Emma Worley
2025-04-22 7:59 ` Andreas Rheinhardt
2025-04-22 8:14 ` Emma Worley
2025-04-22 17:24 ` Nicolas George [this message]
2025-04-22 17:57 ` [FFmpeg-devel] [PATCH v4 1/3] lavc/hashtable: create generic robin hood hash table Emma Worley
2025-04-22 22:04 ` Nicolas George
2025-04-22 22:02 ` Nicolas George
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aAfQ0r2EJVnfzBtg@phare.normalesup.org \
--to=george@nsup.org \
--cc=ffmpeg-devel@ffmpeg.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git