From: Emma Worley <emma@emma.gg>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH v4 1/3] lavc/hashtable: create generic robin hood hash table
Date: Tue, 22 Apr 2025 10:57:30 -0700
Message-ID: <CAFQzNw4DkD-+xSoNQ5Uiv7OF8w-Awa4m6SQuq8KSb4y4tAuKyg@mail.gmail.com> (raw)
In-Reply-To: <aAfQ0r2EJVnfzBtg@phare.normalesup.org>
Perhaps I can add a `mode` enum parameter to the FFHashtableContext to
control behavior? Then we can benchmark different behaviors on a
per-use-case basis.
On Tue, Apr 22, 2025 at 10:24 AM Nicolas George <george@nsup.org> wrote:
>
> Emma Worley (HE12025-04-22):
> > Adds a generic hash table with the DXV encoder as an initial use case.
>
> Hi.
>
> This code is already really good. I have some local remarks that will
> not require a lot of work from you and make it better. And I have some
> global remarks that would require a lot of work from you and might shave
> a few cycles. For the rice wine of clarity, I have decided to send them
> separately. This is the mail about the later.
>
> After writing most of what follows I thought of checking how this data
> structure is used in practice. What I wrote is more relevant as the size
> of the entries grows, but in practice the entries are very small, which
> means the current version is probably close to the best. Still, I wrote
> it, I might as well send it.
>
> As I said, the code is already really good. I do not consider necessary
> to act on what I will say here, or even to read it. But it may contain
> ideas to do even better, for you or somebody else, for now or for later.
>
> This is about hash tables and algorithmic.
>
> Vocabulary as I will use it here:
>
> Hash bucket, or bucket: integer-indexed cell where the hash function
> directs us to start searching for a key.
>
> Entry: key and its associated value, as wanted by the code that uses the
> hash table.
>
> Entry slot, or slot: memory that can store an entry, especially relevant
> if pre-allocated.
>
> We all know the core difficulty with hash tables: different keys can map
> to the same hash value and therefore hash bucket. There are two kinds of
> solutions to this collision issue:
>
> - try to find another hash bucket to find an available entry slot;
>
> - make hash buckets capable of holding multiple entries, typically by
> the means of some kind of linked list.
>
> In the most C and simplistic implementation, hash buckets and entry
> slots are the same, and the code, common to add, get, del, just tries
> the next bucket until it finds the right key or a hole.
>
> In the most Java/GLib implementation, hash buckets are just pointers and
> each entry is a separately allocated slot. (Though I am sure the
> efficient Java hash tables are written much more in the C style.)
>
> The issue with the second way is that it requires an extra number
> (pointer/index, whatever) per bucket and per entry slot.
>
> The first way has two problems. Clustering ruins the efficiency the
> table. And deletion becomes hard, because just leaving a hole would
> prevent from finding an entry that has the be stored after the hole
> compared to its ideal place.
>
> Clustering can be reduced with a secondary hash function (or even more
> subtle): instead of looking at bucket h+1, look at bucket h+h'. Make
> sure h' and the size of the hash table do not have common factors. But
> that does not solve the issue of deletions.
>
> I was not familiar with this “Robin Hood” algorithm. From what I
> understand, it is a solution to both clustering and deletion. Excellent.
>
> Except it costs a number per bucket/slot.
>
> If we consider spending extra memory, we have to consider if the linked
> list solution might not be more efficient.
>
> This hash table has constant size. That makes it easy to pre-allocate
> all slots in a big array. It can be the same array as the buckets or a
> separate one.
>
> If we keep buckets = slots and if we accept to move an entry on
> occasion, then we can make it work with just one number per
> entry/bucket. When adding a new entry, if the bucket is not empty, check
> if it is a collision: same hash value? If it is a collision, then take
> any empty bucket/slot (keeping them chained is easy) for the new entry
> and link it to the existing one. If it is not a collision, then take the
> bucket/slot for the new entry after moving the entry that was there into
> an empty bucket.
>
> What I described above is not very nice, but it is the best I know / can
> find that only costs one number per bucket/slot. If we can afford to
> spend more memory on it, we can do better by the virtue of breaking the
> buckets = slots equivalence.
>
> It is especially interesting if the entries are large, because this
> version does not require copying entries. Also, we can have more buckets
> than slots, reducing collisions.
>
> All of this is very dependant on specifics. In particular, cache
> locality can make a solution that seems slightly worse actually better.
> It would require extensive benchmarking.
>
> Regards,
>
> --
> Nicolas George
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2025-04-22 17:57 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-22 7:42 Emma Worley
2025-04-22 7:42 ` [FFmpeg-devel] [PATCH v4 2/3] lavc/dxvenc: migrate DXT1 encoder to lavc hashtable Emma Worley
2025-04-22 7:42 ` [FFmpeg-devel] [PATCH v4 3/3] lavc/dxvenc: improve compatibility with Resolume products Emma Worley
2025-04-22 7:59 ` Andreas Rheinhardt
2025-04-22 8:14 ` Emma Worley
2025-04-22 17:24 ` [FFmpeg-devel] [PATCH v4 1/3] lavc/hashtable: create generic robin hood hash table Nicolas George
2025-04-22 17:57 ` Emma Worley [this message]
2025-04-22 22:04 ` Nicolas George
2025-04-22 22:02 ` Nicolas George
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFQzNw4DkD-+xSoNQ5Uiv7OF8w-Awa4m6SQuq8KSb4y4tAuKyg@mail.gmail.com \
--to=emma@emma.gg \
--cc=ffmpeg-devel@ffmpeg.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git