For the purpose of merging streams in a stream group, a filtergraph can't be
created once we know it will be used. Therefore, allow filtergraphs to be
removed from the scheduler after being added.
Signed-off-by: James Almer <jamrial@gmail.com>
Currently, while the filter graph is being initially created, the scheduler
continues demuxing frames on the last input that happened to be active before
the filter graph was complete.
This can lead to an excess number of decoded frames "piling" up on this input,
regardless of whether or not it will actually be requested by the configured
filter graph. Suspending the filter graph during this initialization phase
reduces the amount of wasted memory.
I tested this extensively under different conditions and could not come up
with any scenario where using a larger queue size was actually beneficial.
Moreover, having such a large default queue is very wasteful especially
for larger frame sizes; and can in the worst case lead to an extra ~50% memory
footprint per input (with the default 16 threads), regardless of whether that
input is currently in use or not.
My methodology was to add logging in the event of a queue underrun/overrun,
and then observe and then observe the frequency of such events in practice,
as well as the impact on performance. I came up with an example filter graph
involving decoding, filtering and encoding with several input files and
various changes to move the bottleneck around.
I found that, in all configurations I tested, with all thread counts and
bottlenecks, using a queue size of 2 frames yielded practically identical
performance to a queue size of 8 frames. I was only able to consistently
measure a slowdown when restricting the queue to a single frame, where the
underruns ended up making up almost 1.1% of frame events in the worst case.
A summary of my test log follows:
= Bottleneck in decoder =
ffmpeg -i A -i B -i C -filter_complex "concat=n=3" -f null -
== 16 threads ==
=== Queue statistics (dec -> filtergraph) ===
- 8 frames = 91355 underruns, 1 overrun
- 4 frames = 91381 underruns, 2 overruns
- 2 frames = 91326 underruns, 21 overruns
- 1 frame = 91284 underruns, 102 overruns
=== Time elapsed ===
- 8 frames = 14.37s
- 4 frames = 14.28s
- 2 frames = 14.27s
- 1 frame = 14.35s
== 1 thread ==
=== Queue statistics (dec -> filtergraph) ===
- 8 frames = 91801 underruns, 0 overruns
- 4 frames = 91929 underruns, 1 overrun
- 2 frames = 91854 underruns, 7 overruns
- 1 frame = 91745 underrons, 83 overruns
=== Time elapsed ===
- 8 frames = 39.51s
- 4 frames = 39.94s
- 2 frames = 39.91s
- 1 frame = 41.69s
= Bottleneck in filter graph: =
ffmpeg -i A -i B -i C -filter_complex "concat=n=3,scale=3840x2160" -f null -
== 16 threads ==
=== Queue statistics (dec -> filtergraph) ===
- 8 frames = 277 underruns, 84673 overruns
- 4 frames = 640 underruns, 86523 overruns
- 2 frames = 850 underruns, 88751 overruns
- 1 frame = 1028 underruns, 89957 overruns
=== Time elapsed ===
- 8 frames = 26.35s
- 4 frames = 26.31s
- 2 frames = 26.38s
- 1 frame = 26.55s
== 1 thread ==
=== Queue statistics (dec -> filtergraph) ===
- 8 frames = 29746 underruns, 57033 overruns
- 4 frames = 29940 underruns, 58948 overruns
- 2 frames = 30160 underruns, 60185 overruns
- 1 frame = 30259 underruns, 61126 overruns
=== Time elapsed ===
- 8 frames = 52.08s
- 4 frames = 52.49s
- 2 frames = 52.25s
- 1 frame = 52.69s
= Bottleneck in encoder: =
ffmpeg -i A -i B -i C -filter_complex "concat=n=3" -c:v libx264 -preset veryfast -f null -
== 1 thread ==
== Queue statistics (filtergraph -> enc) ==
- 8 frames = 26763 underruns, 63535 overruns
- 4 frames = 26863 underruns, 63810 overruns
- 2 frames = 27243 underruns, 63839 overruns
- 1 frame = 27670 underruns, 63953 overruns
== Time elapsed ==
- 8 frames = 89.45s
- 4 frames = 89.04s
- 2 frames = 89.24s
- 1 frame = 90.26s
Since e0da916b8f the ffmpeg utility has
held multiple frames output by the decoder in internal queues without
telling the decoder that it is going to do so. When the decoder has a
fixed-size pool of frames (common in some hardware APIs where the output
frames must be stored as an array texture) this could lead to the pool
being exhausted and the decoder getting stuck. Fix this by telling the
decoder to allocate additional frames according to the queue size.
These functions used to be passed directly to pthread_create(), which
required them to return void*. This is no longer the case, so they can
return a plain int.
Use 8 packets/frames by default rather than 1, which seems to provide
better throughput.
Allow -thread_queue_size to set the muxer queue size manually again.
See the comment block at the top of fftools/ffmpeg_sched.h for more
details on what this scheduler is for.
This commit adds the scheduling code itself, along with minimal
integration with the rest of the program:
* allocating and freeing the scheduler
* passing it throughout the call stack in order to register the
individual components (demuxers/decoders/filtergraphs/encoders/muxers)
with the scheduler
The scheduler is not actually used as of this commit, so it should not
result in any change in behavior. That will change in future commits.