* [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
@ 2022-09-05 12:20 Martin Storsjö
2022-09-05 19:58 ` Martin Storsjö
0 siblings, 1 reply; 7+ messages in thread
From: Martin Storsjö @ 2022-09-05 12:20 UTC (permalink / raw)
To: ffmpeg-devel
This matches a similar cap on the number of automatic threads
in libavcodec/pthread_slice.c.
On systems with lots of cores, this does speed things up in
general (measurable on the level of the runtime of running
"make fate"), and fixes a couple fate failures in 32 bit mode on
such machines (where spawning a huge number of threads runs
out of address space).
---
libavutil/slicethread.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/libavutil/slicethread.c b/libavutil/slicethread.c
index ea1c9c8311..115b099736 100644
--- a/libavutil/slicethread.c
+++ b/libavutil/slicethread.c
@@ -24,6 +24,8 @@
#include "thread.h"
#include "avassert.h"
+#define MAX_AUTO_THREADS 16
+
#if HAVE_PTHREADS || HAVE_W32THREADS || HAVE_OS2THREADS
typedef struct WorkerContext {
@@ -105,7 +107,7 @@ int avpriv_slicethread_create(AVSliceThread **pctx, void *priv,
if (!nb_threads) {
int nb_cpus = av_cpu_count();
if (nb_cpus > 1)
- nb_threads = nb_cpus + 1;
+ nb_threads = FFMIN(nb_cpus + 1, MAX_AUTO_THREADS);
else
nb_threads = 1;
}
--
2.25.1
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
2022-09-05 12:20 [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16 Martin Storsjö
@ 2022-09-05 19:58 ` Martin Storsjö
2022-09-06 19:50 ` Lukas Fellechner
0 siblings, 1 reply; 7+ messages in thread
From: Martin Storsjö @ 2022-09-05 19:58 UTC (permalink / raw)
To: ffmpeg-devel
On Mon, 5 Sep 2022, Martin Storsjö wrote:
> This matches a similar cap on the number of automatic threads
> in libavcodec/pthread_slice.c.
>
> On systems with lots of cores, this does speed things up in
> general (measurable on the level of the runtime of running
> "make fate"), and fixes a couple fate failures in 32 bit mode on
> such machines (where spawning a huge number of threads runs
> out of address space).
> ---
On second thought - this observation that it speeds up "make -j$(nproc)
fate" isn't surprising at all; as long as there are jobs to saturate all
cores with the make level parallelism anyway, any threading within each
job just adds extra overhead, nothing more.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
2022-09-05 19:58 ` Martin Storsjö
@ 2022-09-06 19:50 ` Lukas Fellechner
2022-09-06 20:53 ` Martin Storsjö
2022-09-06 21:11 ` Andreas Rheinhardt
0 siblings, 2 replies; 7+ messages in thread
From: Lukas Fellechner @ 2022-09-06 19:50 UTC (permalink / raw)
To: ffmpeg-devel
>Gesendet: Montag, 05. September 2022 um 21:58 Uhr
>Von: "Martin Storsjö" <martin@martin.st>
>An: ffmpeg-devel@ffmpeg.org
>Betreff: Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
>On Mon, 5 Sep 2022, Martin Storsjö wrote:
>
>> This matches a similar cap on the number of automatic threads
>> in libavcodec/pthread_slice.c.
>>
>> On systems with lots of cores, this does speed things up in
>> general (measurable on the level of the runtime of running
>> "make fate"), and fixes a couple fate failures in 32 bit mode on
>> such machines (where spawning a huge number of threads runs
>> out of address space).
>> ---
>
> On second thought - this observation that it speeds up "make -j$(nproc)
> fate" isn't surprising at all; as long as there are jobs to saturate all
> cores with the make level parallelism anyway, any threading within each
> job just adds extra overhead, nothing more.
>
> // Martin
Agreed, this observation of massively parallel test runs does not tell
much about real world performance.
There are really two separate issues here:
1. Running out of address space in 32-bit processes
It probably makes sense to limit auto threads to 16, but it should only
be done in 32-bit processes. A 64-bit process should never run out of
address space. We should not cripple high end machines running
64-bit applications.
Sidenotes about "it does not make sense to have more than 16 slices":
On 8K video, when using 32 threads, each thread will process 256 lines
or about 1MP (> FullHD!). Sure makes sense to me. But even for sw decoding
4K video, having more than 16 threads on a powerful machine makes sense.
Intel's next desktop CPUs will have up to 24 physical cores. The
proposed change would limit them to use only 16 cores, even on 64-bit.
2. Spawning too many threads when "auto" is used in multiple places
This can indeed be an efficiency problem, although probably not major.
Since usually only one part of the pipeline is active at any time,
many of the threads will be sleeping, consuming very little resources.
The issue only affects certain scenarios. If someone has such
a scenario and wants to optimize, they could explicitly set threads to
a lower value, and see if it helps.
Putting an arbitrary limit on threads would only "solve" this issue
for the biggest CPUs (which have more than enough power anyways),
at the cost of crippling their performance in other scenarios.
A "normal" <= 8 core CPU might still end up with 16 threads for
the decoder, 16 threads for effects and 16 threads for encoding,
with 2/3 of them sleeping at any time.
--> The issue affects only certain scenarios. The proposed fix only
fixes it for a minority of all PCs, while it cripples performance
of these PCs in other scenarios.
--> I do not think that this 16 threads limit is a good idea.
IMHO "auto" should always use the logical CPU count,
except for 32-bit applications.
The only true solution to this problem would be adding a shared
thread pool. The application would create the pool when it is started,
with the number of logical CPU cores as maximum (maybe limit on 32 bits).
It passes this to all created decoders/encoders/filters. But doing this
correctly is a major task, and it would require major rework in all areas
where multi threading is used now. Not sure if the problem is really big
enough to justify this effort.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
2022-09-06 19:50 ` Lukas Fellechner
@ 2022-09-06 20:53 ` Martin Storsjö
2022-09-11 19:00 ` Lukas Fellechner
2022-09-06 21:11 ` Andreas Rheinhardt
1 sibling, 1 reply; 7+ messages in thread
From: Martin Storsjö @ 2022-09-06 20:53 UTC (permalink / raw)
To: FFmpeg development discussions and patches
On Tue, 6 Sep 2022, Lukas Fellechner wrote:
> There are really two separate issues here:
>
> 1. Running out of address space in 32-bit processes
>
> It probably makes sense to limit auto threads to 16, but it should only
> be done in 32-bit processes.
FWIW, this was my first approach, until Andreas pointed out that we have
such caps for automatic numbers of threads already in all other places
where we pick an automatic number of threads - including
libavcodec/pthread_slice.c, where the limit already today is 16 threads.
Also FWIW, this patch was already pushed, after being OK'd by Andreas on
irc.
> A 64-bit process should never run out of address space. We should not
> cripple high end machines running 64-bit applications.
>
>
> Sidenotes about "it does not make sense to have more than 16 slices":
> On 8K video, when using 32 threads, each thread will process 256 lines
> or about 1MP (> FullHD!). Sure makes sense to me. But even for sw decoding
> 4K video, having more than 16 threads on a powerful machine makes sense.
>
> Intel's next desktop CPUs will have up to 24 physical cores. The
> proposed change would limit them to use only 16 cores, even on 64-bit.
>
>
> 2. Spawning too many threads when "auto" is used in multiple places
>
> This can indeed be an efficiency problem, although probably not major.
> Since usually only one part of the pipeline is active at any time,
> many of the threads will be sleeping, consuming very little resources.
For 32 bit processes running out of address space, yes, the issue is with
"auto" being used in many places at once.
But in general, allowing arbitrarily high numbers of auto threads isn't
beneficial - the optimal cap of threads depends a lot on the content at
hand.
The system I'm testing on has 160 cores - and it's quite certain that
doing slice threading with 160 slices doesn't make sense. Maybe the cap of
16 is indeed too low - I don't mind raising it to 32 or something like
that. Ideally, the auto mechanism would factor in the resolution of the
content.
Just for arguments sake - here's the output from 'time ffmpeg ...' for a
fairly straightforward transcode (decode, transpose, scale, encode), 1080p
input 10bit, 720p output 8bit, with explicitly setting the number of
threads ("ffmpeg -threads N -i input -threads N -filter_threads N
output").
12:
real 0m25.079s
user 5m22.318s
sys 0m5.047s
16:
real 0m19.967s
user 6m3.607s
sys 0m9.112s
20:
real 0m20.853s
user 6m21.841s
sys 0m28.829s
24:
real 0m20.642s
user 6m28.022s
sys 1m1.262s
32:
real 0m29.785s
user 6m8.442s
sys 4m45.290s
64:
real 1m0.808s
user 6m31.065s
sys 40m44.598s
I'm not testing this with 160 threads for each stage, since 64 already was
painfully slow - while you suggest that using threads==cores always should
be preferred, regardless of the number of cores. The optimum here seems to
be somewhere between 16 and 20.
Also, in these cases, the decoder and encoder both warn that "Application
has requested N threads. Using a thread count greater than 16 is not
recommended" (see libavcodec/pthread.c).
I can also test with only varying the -filter_threads parameter, while
keeping the decoder and encoder threads fixed at 16:
16:
real 0m20.303s
user 6m5.425s
sys 0m12.954s
20:
real 0m20.862s
user 6m12.625s
sys 0m21.860s
24:
real 0m20.445s
user 6m20.734s
sys 0m21.111s
32:
real 0m21.216s
user 6m15.926s
sys 0m42.264s
64:
real 0m20.687s
user 6m39.544s
sys 0m59.204s
Not quite as dramatical in this case, but (on this particular test clip,
mostly determined by the resolution) we still don't gain anything above 16
threads. On a larger test clip, the optimum number of slice threads
probably is a bit higher. But always using up to the number of cores isn't
really healthy.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
2022-09-06 19:50 ` Lukas Fellechner
2022-09-06 20:53 ` Martin Storsjö
@ 2022-09-06 21:11 ` Andreas Rheinhardt
2022-09-11 18:42 ` Lukas Fellechner
1 sibling, 1 reply; 7+ messages in thread
From: Andreas Rheinhardt @ 2022-09-06 21:11 UTC (permalink / raw)
To: ffmpeg-devel
Lukas Fellechner:
>> Gesendet: Montag, 05. September 2022 um 21:58 Uhr
>> Von: "Martin Storsjö" <martin@martin.st>
>> An: ffmpeg-devel@ffmpeg.org
>> Betreff: Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
>> On Mon, 5 Sep 2022, Martin Storsjö wrote:
>>
>>> This matches a similar cap on the number of automatic threads
>>> in libavcodec/pthread_slice.c.
>>>
>>> On systems with lots of cores, this does speed things up in
>>> general (measurable on the level of the runtime of running
>>> "make fate"), and fixes a couple fate failures in 32 bit mode on
>>> such machines (where spawning a huge number of threads runs
>>> out of address space).
>>> ---
>>
>> On second thought - this observation that it speeds up "make -j$(nproc)
>> fate" isn't surprising at all; as long as there are jobs to saturate all
>> cores with the make level parallelism anyway, any threading within each
>> job just adds extra overhead, nothing more.
>>
>> // Martin
>
> Agreed, this observation of massively parallel test runs does not tell
> much about real world performance.
> There are really two separate issues here:
>
> 1. Running out of address space in 32-bit processes
>
> It probably makes sense to limit auto threads to 16, but it should only
> be done in 32-bit processes. A 64-bit process should never run out of
> address space. We should not cripple high end machines running
> 64-bit applications.
>
>
> Sidenotes about "it does not make sense to have more than 16 slices":
>
> On 8K video, when using 32 threads, each thread will process 256 lines
> or about 1MP (> FullHD!). Sure makes sense to me. But even for sw decoding
> 4K video, having more than 16 threads on a powerful machine makes sense.
>
> Intel's next desktop CPUs will have up to 24 physical cores. The
> proposed change would limit them to use only 16 cores, even on 64-bit.
>
This part is completely wrong: You can always set the number of threads
manually.
(Btw: 1. 8K is the horizontal resolution; the vertical resolution of it
is 4360 (when using 16:9), so every thread processes 135 lines which
have as many pixels as 540 lines of FullHD. 2. FullHD has about 2MP.)
- Andreas
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
2022-09-06 21:11 ` Andreas Rheinhardt
@ 2022-09-11 18:42 ` Lukas Fellechner
0 siblings, 0 replies; 7+ messages in thread
From: Lukas Fellechner @ 2022-09-11 18:42 UTC (permalink / raw)
To: ffmpeg-devel
Gesendet: Dienstag, 06. September 2022 um 23:11 Uhr
Von: "Andreas Rheinhardt" <andreas.rheinhardt@outlook.com>
An: ffmpeg-devel@ffmpeg.org
Betreff: Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
Lukas Fellechner:
>> 1. Running out of address space in 32-bit processes
>>
>> It probably makes sense to limit auto threads to 16, but it should only
>> be done in 32-bit processes. A 64-bit process should never run out of
>> address space. We should not cripple high end machines running
>> 64-bit applications.
>>
>>
>> Sidenotes about "it does not make sense to have more than 16 slices":
>>
>> On 8K video, when using 32 threads, each thread will process 256 lines
>> or about 1MP (> FullHD!). Sure makes sense to me. But even for sw decoding
>> 4K video, having more than 16 threads on a powerful machine makes sense.
>>
>> Intel's next desktop CPUs will have up to 24 physical cores. The
>> proposed change would limit them to use only 16 cores, even on 64-bit.
>
> This part is completely wrong: You can always set the number of threads
> manually.
> (Btw: 1. 8K is the horizontal resolution; the vertical resolution of it
> is 4360 (when using 16:9), so every thread processes 135 lines which
> have as many pixels as 540 lines of FullHD. 2. FullHD has about 2MP.)
>
> - Andreas
You are right. What I ment was: When someone does not explicitly set
threads, then only 16 of his 24 cores will be used. I know that it is
always possible to manually override the auto value without limits.
And indeed I somehow confused the resolutions. Sill each thread would
process 1MP of pixel data, which is a lot of data.
- Lukas
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16
2022-09-06 20:53 ` Martin Storsjö
@ 2022-09-11 19:00 ` Lukas Fellechner
0 siblings, 0 replies; 7+ messages in thread
From: Lukas Fellechner @ 2022-09-11 19:00 UTC (permalink / raw)
To: ffmpeg-devel
>> 2. Spawning too many threads when "auto" is used in multiple places
>>
>> This can indeed be an efficiency problem, although probably not major.
>> Since usually only one part of the pipeline is active at any time,
>> many of the threads will be sleeping, consuming very little resources.
>
> For 32 bit processes running out of address space, yes, the issue is with
> "auto" being used in many places at once.
>
> But in general, allowing arbitrarily high numbers of auto threads isn't
> beneficial - the optimal cap of threads depends a lot on the content at
> hand.
>
> The system I'm testing on has 160 cores - and it's quite certain that
> doing slice threading with 160 slices doesn't make sense. Maybe the cap of
> 16 is indeed too low - I don't mind raising it to 32 or something like
> that. Ideally, the auto mechanism would factor in the resolution of the
> content.
>
> Just for arguments sake - here's the output from 'time ffmpeg ...' for a
> fairly straightforward transcode (decode, transpose, scale, encode), 1080p
> input 10bit, 720p output 8bit, with explicitly setting the number of
> threads ("ffmpeg -threads N -i input -threads N -filter_threads N
> output").
>
> 12:
> real 0m25.079s
> user 5m22.318s
> sys 0m5.047s
>
> 16:
> real 0m19.967s
> user 6m3.607s
> sys 0m9.112s
>
> 20:
> real 0m20.853s
> user 6m21.841s
> sys 0m28.829s
>
> 24:
> real 0m20.642s
> user 6m28.022s
> sys 1m1.262s
>
> 32:
> real 0m29.785s
> user 6m8.442s
> sys 4m45.290s
>
> 64:
> real 1m0.808s
> user 6m31.065s
> sys 40m44.598s
>
> I'm not testing this with 160 threads for each stage, since 64 already was
> painfully slow - while you suggest that using threads==cores always should
> be preferred, regardless of the number of cores. The optimum here seems to
> be somewhere between 16 and 20.
These are interesting scores. I would not have expected such a dramatic
effect of having too many threads. You are probably right that always using
the core count as auto threads is not such a good idea.
But the encoding part works on 720p, so there each of the 64 threads only
has 11 lines and 14.000 pixels to process, which is really not much.
I do not have a CPU with so many cores, but when doing 4K -> 4K transcode,
I sure see a benefit of using 32 vs 16 cores.
Maybe the best approach would really be to decide auto thread count
on the amount of pixels to process (I would not use line count because
when line count doubles, the pixel count usually goes up by factor 4).
This would probably need some more test data. I will also try to do some
testing on my side.
- Lukas
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-09-11 19:00 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-05 12:20 [FFmpeg-devel] [PATCH] slicethread: Limit the automatic number of threads to 16 Martin Storsjö
2022-09-05 19:58 ` Martin Storsjö
2022-09-06 19:50 ` Lukas Fellechner
2022-09-06 20:53 ` Martin Storsjö
2022-09-11 19:00 ` Lukas Fellechner
2022-09-06 21:11 ` Andreas Rheinhardt
2022-09-11 18:42 ` Lukas Fellechner
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git