You've already forked FFmpeg
mirror of
https://github.com/FFmpeg/FFmpeg.git
synced 2025-08-10 06:10:52 +02:00
doc: add swscale rewrite design document
This should hopefully serve as a better introduction to my new swscale redesign than hunting down random commit message monologues.
This commit is contained in:
344
doc/swscale-v2.txt
Normal file
344
doc/swscale-v2.txt
Normal file
@@ -0,0 +1,344 @@
|
||||
New swscale design to change everything (tm)
|
||||
============================================
|
||||
|
||||
SwsGraph
|
||||
--------
|
||||
|
||||
The entry point to the new architecture, SwsGraph is what coordinates
|
||||
multiple "passes". These can include cascaded scaling passes, error diffusion
|
||||
dithering, and so on. Or we could have separate passes for the vertical and
|
||||
horizontal scaling. In between each SwsPass lies a fully allocated image buffer.
|
||||
Graph passes may have different levels of threading, e.g. we can have a single
|
||||
threaded error diffusion pass following a multi-threaded scaling pass.
|
||||
|
||||
SwsGraph is internally recreated whenever the image format, dimensions or
|
||||
settings change in any way. sws_scale_frame() is itself just a light-weight
|
||||
wrapper that runs ff_sws_graph_create() whenever the format changes, splits
|
||||
interlaced images into separate fields, and calls ff_sws_graph_run() on each.
|
||||
|
||||
From the point of view of SwsGraph itself, all inputs are progressive.
|
||||
|
||||
SwsOp / SwsOpList
|
||||
-----------------
|
||||
|
||||
This is the newly introduced abstraction layer between the high-level format
|
||||
handling logic and the low-level backing implementation. Each SwsOp is designed
|
||||
to be as small and atomic as possible, with the possible exception of the
|
||||
read / write operations due to their numerous variants.
|
||||
|
||||
The basic idea is to split logic between three major components:
|
||||
|
||||
1. The high-level format "business logic", which generates in a very
|
||||
naive way a sequence of operations guaranteed to get you from point A
|
||||
to point B. This logic is written with correctness in mind only, and
|
||||
ignoring any performance concerns or low-level implementation decisions.
|
||||
Semantically, everything is always decoded from the input format to
|
||||
normalized (real valued) RGB, and then encoded back to output format.
|
||||
|
||||
This code lives in libswscale/format.c
|
||||
|
||||
2. The optimizer. This is where the "magic" happens, so to speak. The
|
||||
optimizer's job is to take the abstract sequence of operations
|
||||
produced by the high-level format analysis code and incrementally
|
||||
optimize it. Each optimization step is designed to be minute and provably
|
||||
lossless, or otherwise guarded behind the BITEXACT flag. This ensures that
|
||||
the resulting output is always identical, no matter how many layers of
|
||||
optimization we add.
|
||||
|
||||
This code lives in libswscale/ops.c
|
||||
|
||||
3. The compiler. Once we have a sequence of operations as output by the
|
||||
optimizer, we "compile" this down to a callable function. This is then
|
||||
applied by the dispatch wrapper by striping it over the input image.
|
||||
|
||||
See libswscale/ops_backend.c for the reference backend, or
|
||||
libswscale/x86/ops.c for a more complex SIMD example.
|
||||
|
||||
This overall approach has a considerable number of benefits:
|
||||
|
||||
1. It allows us to verify correctness of logic and spot semantic errors at a
|
||||
very high level, by simply looking at the sequence of operations (available
|
||||
by default at debug / verbose log level), without having to dig through the
|
||||
multiple levels of complicated, interwoven format handling code that is
|
||||
legacy swscale.
|
||||
|
||||
2. Because most of the brains lives inside the the powerful optimizer, we get
|
||||
fast paths "for free" for any suitable format conversion, rather than having
|
||||
to enumerate them one by one. SIMD code itself can be written in a very
|
||||
general way and does need to be tied to specific pixel formats - subsequent
|
||||
low-level implementations can be strung together without much overhead.
|
||||
|
||||
3. We can in the future, with relative ease, compile these operations
|
||||
down to SPIR-V (or even LLVM IR) and generate efficient GPU or
|
||||
target-machine specific implementations. This also opens the window for
|
||||
adding hardware frame support to libswscale, and even transparently using
|
||||
GPU acceleration for CPU frames.
|
||||
|
||||
4. Platform-specific SIMD can be reduced down to a comparatively small set of
|
||||
optimized routines, while still providing 100% coverage for all possible
|
||||
pixel formats and operations. (As of writing, the x86 example backend has
|
||||
about 60 unique implementations, of which 20 are trivial swizzles, 10 are
|
||||
read/write ops, 10 are pixel type conversions and the remaining 20 are the
|
||||
various logic/arithmetic ops).
|
||||
|
||||
5. Backends hide behind a layer of abstraction offering them a considerable
|
||||
deal of flexibility in how they want to implement their operations. For
|
||||
example, the x86 backend has a dedicated function for compiling compatible
|
||||
operations down to a single in-place pshufb instruction.
|
||||
|
||||
Platform specific low level data is self-contained within its own setup()
|
||||
function and private data structure, eliminating all reads into SwsContext
|
||||
or the possibility of conflicts between platforms.
|
||||
|
||||
6. We can compute an exact reference result for each operation with fixed
|
||||
precision (ff_sws_op_apply_q), and use that to e.g. measure the amount of
|
||||
error introduced by dithering, or even catch bugs in the reference C
|
||||
implementation. (In theory - currently checkasm just compares against C)
|
||||
|
||||
Examples of SwsOp in action
|
||||
---------------------------
|
||||
|
||||
For illustration, here is the sequence of operations currently generated by
|
||||
my prototype, for a conversion from RGB24 to YUV444P:
|
||||
|
||||
Unoptimized operation list:
|
||||
[ u8 .... -> ....] SWS_OP_READ : 3 elem(s) packed >> 0
|
||||
[ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
|
||||
[ u8 .... -> ....] SWS_OP_RSHIFT : >> 0
|
||||
[ u8 .... -> ....] SWS_OP_CLEAR : {_ _ _ 0}
|
||||
[ u8 .... -> ....] SWS_OP_CONVERT : u8 -> f32
|
||||
[f32 .... -> ....] SWS_OP_LINEAR : diag3+alpha [[1/255 0 0 0 0] [0 1/255 0 0 0] [0 0 1/255 0 0] [0 0 0 1 1]]
|
||||
[f32 .... -> ....] SWS_OP_LINEAR : matrix3 [[0.299000 0.587000 0.114000 0 0] [-0.168736 -0.331264 1/2 0 0] [1/2 -0.418688 -57/701 0 0] [0 0 0 1 0]]
|
||||
[f32 .... -> ....] SWS_OP_LINEAR : diag3+off3 [[219 0 0 0 16] [0 224 0 0 128] [0 0 224 0 128] [0 0 0 1 0]]
|
||||
[f32 .... -> ....] SWS_OP_DITHER : 16x16 matrix
|
||||
[f32 .... -> ....] SWS_OP_MAX : {0 0 0 0} <= x
|
||||
[f32 .... -> ....] SWS_OP_MIN : x <= {255 255 255 _}
|
||||
[f32 .... -> ....] SWS_OP_CONVERT : f32 -> u8
|
||||
[ u8 .... -> ....] SWS_OP_LSHIFT : << 0
|
||||
[ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
|
||||
[ u8 .... -> ....] SWS_OP_WRITE : 3 elem(s) planar >> 0
|
||||
|
||||
This is optimized into the following sequence:
|
||||
|
||||
Optimized operation list:
|
||||
[ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0
|
||||
[ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> f32
|
||||
[f32 ...X -> ...X] SWS_OP_LINEAR : matrix3+off3 [[0.256788 0.504129 0.097906 0 16] [-0.148223 -0.290993 112/255 0 128] [112/255 -0.367788 -0.071427 0 128] [0 0 0 1 0]]
|
||||
[f32 ...X -> ...X] SWS_OP_DITHER : 16x16 matrix
|
||||
[f32 ...X -> +++X] SWS_OP_CONVERT : f32 -> u8
|
||||
[ u8 ...X -> +++X] SWS_OP_WRITE : 3 elem(s) planar >> 0
|
||||
(X = unused, + = exact, 0 = zero)
|
||||
|
||||
The extra metadata on the left of the operation list is just a dump of the
|
||||
internal state used by the optimizer during optimization. It keeps track of
|
||||
knowledge about the pixel values, such as their value range, whether or not
|
||||
they're exact integers, and so on.
|
||||
|
||||
In this example, you can see that the input values are exact (except for
|
||||
the alpha channel, which is undefined), until the first SWS_OP_LINEAR
|
||||
multiplies them by a noninteger constant. They regain their exact integer
|
||||
status only after the (truncating) conversion to U8 in the output step.
|
||||
|
||||
Example of more aggressive optimization
|
||||
---------------------------------------
|
||||
|
||||
Conversion pass for gray -> rgb48:
|
||||
Unoptimized operation list:
|
||||
[ u8 .... -> ....] SWS_OP_READ : 1 elem(s) planar >> 0
|
||||
[ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
|
||||
[ u8 .... -> ....] SWS_OP_RSHIFT : >> 0
|
||||
[ u8 .... -> ....] SWS_OP_CLEAR : {_ 0 0 0}
|
||||
[ u8 .... -> ....] SWS_OP_CONVERT : u8 -> f32
|
||||
[f32 .... -> ....] SWS_OP_LINEAR : luma+alpha [[1/255 0 0 0 0] [0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 1]]
|
||||
[f32 .... -> ....] SWS_OP_LINEAR : matrix3 [[1 0 701/500 0 0] [1 -0.344136 -0.714136 0 0] [1 443/250 0 0 0] [0 0 0 1 0]]
|
||||
[f32 .... -> ....] SWS_OP_LINEAR : diag3 [[65535 0 0 0 0] [0 65535 0 0 0] [0 0 65535 0 0] [0 0 0 1 0]]
|
||||
[f32 .... -> ....] SWS_OP_MAX : {0 0 0 0} <= x
|
||||
[f32 .... -> ....] SWS_OP_MIN : x <= {65535 65535 65535 _}
|
||||
[f32 .... -> ....] SWS_OP_CONVERT : f32 -> u16
|
||||
[u16 .... -> ....] SWS_OP_LSHIFT : << 0
|
||||
[u16 .... -> ....] SWS_OP_SWIZZLE : 0123
|
||||
[u16 .... -> ....] SWS_OP_WRITE : 3 elem(s) packed >> 0
|
||||
|
||||
Optimized operation list:
|
||||
[ u8 XXXX -> +XXX] SWS_OP_READ : 1 elem(s) planar >> 0
|
||||
[ u8 .XXX -> +XXX] SWS_OP_CONVERT : u8 -> u16 (expand)
|
||||
[u16 .XXX -> +++X] SWS_OP_SWIZZLE : 0003
|
||||
[u16 ...X -> +++X] SWS_OP_WRITE : 3 elem(s) packed >> 0
|
||||
(X = unused, + = exact, 0 = zero)
|
||||
|
||||
Here, the optimizer has managed to eliminate all of the unnecessary linear
|
||||
operations on previously zero'd values, turn the resulting column matrix into
|
||||
a swizzle operation, avoid the unnecessary dither (and round trip via float)
|
||||
because the pixel values are guaranteed to be bit exact, and finally, turns
|
||||
the multiplication by 65535 / 255 = 257 into a simple integer expand operation.
|
||||
|
||||
As a final bonus, the x86 backend further optimizes this into a 12-byte shuffle:
|
||||
pshufb = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1}
|
||||
|
||||
time=208 us, ref=4212 us, speedup=20.236x faster (single thread)
|
||||
time=57 us, ref=472 us, speedup=8.160x faster (multi thread)
|
||||
|
||||
Compiler and underlying implementation layer (SwsOpChain)
|
||||
---------------------------------------------------------
|
||||
|
||||
While the backend API is flexible enough to permit more exotic implementations
|
||||
(e.g. using JIT code generation), we establish a common set of helpers for use
|
||||
in "traditional" SIMD implementations.
|
||||
|
||||
The basic idea is to have one "kernel" (or implementation) per operation,
|
||||
and then just chain a list of these kernels together as separate function
|
||||
calls. For best performance, we want to keep data in vector registers in
|
||||
between function calls using a custom calling convention, thus avoiding any
|
||||
unnecessary memory accesses. Additionally, we want the per-kernel overhead to
|
||||
be as low as possible, with each kernel ideally just jumping directly into
|
||||
the next kernel.
|
||||
|
||||
As a result, we arrive at a design where we first divide the image into small
|
||||
chunks, or "blocks", and then dispatch the "chain" of kernels on each chunk in
|
||||
sequence. Each kernel processes a fixed number of pixels, with the overall
|
||||
entry point taking care of looping. Remaining pixels (the "tail") are handled
|
||||
generically by the backend-invariant dispatch code (located in ops.c), using a
|
||||
partial memcpy into a suitably sized temporary buffer.
|
||||
|
||||
To minimize the per-kernel function call overhead, we use a "continuation
|
||||
passing style" for chaining kernels. Each operation computes its result and
|
||||
then directly calls the next operation in the sequence, with the appropriate
|
||||
internal function signature.
|
||||
|
||||
The C reference backend reads data into the stack and then passes the array
|
||||
pointers to the next continuation as regular function arguments:
|
||||
|
||||
void process(GlobalContext *ctx, OpContext *op,
|
||||
block_t x, block_t y, block_t z, block_t w)
|
||||
{
|
||||
for (int i = 0; i < SWS_BLOCK_SIZE; i++)
|
||||
// do something with x[i], y[i], z[i], w[i]
|
||||
|
||||
op->next(ctx, &op[1], x, y, z, w);
|
||||
}
|
||||
|
||||
With type conversions pushing the new data onto the stack as well:
|
||||
|
||||
void convert8to16(GlobalContext *ctx, OpContext *op,
|
||||
block_t x, block_t y, block_t z, block_t w)
|
||||
{
|
||||
/* Pseudo-code */
|
||||
u16block_t x16 = (u16block_t) x;
|
||||
u16block_t y16 = (u16block_t) y;
|
||||
u16block_t z16 = (u16block_t) z;
|
||||
u16block_t w16 = (u16block_t) w;
|
||||
|
||||
op->next(ctx, &op[1], x16, y16, z16, w16);
|
||||
}
|
||||
|
||||
By contrast, the x86 backend always keeps the X/Y/Z/W values pinned in specific
|
||||
vector registers (ymm0-ymm3 for the lower half, and ymm4-ymm7 for the second
|
||||
half).
|
||||
|
||||
Each kernel additionally has access to a 32 byte per-op context storing the
|
||||
pointer to the next kernel plus 16 bytes of arbitrary private data. This is
|
||||
used during construction of the function chain to place things like small
|
||||
constants.
|
||||
|
||||
In assembly, the per-kernel overhead looks like this:
|
||||
|
||||
load $tmp, $arg1
|
||||
...
|
||||
add $arg1, 32
|
||||
jump $tmp
|
||||
|
||||
This design gives vastly better performance than the alternative of returning
|
||||
out to a central loop or "trampoline". This is partly because the order of
|
||||
kernels within a chain is always the same, so the branch predictor can easily
|
||||
remember the target address of each "jump" instruction.
|
||||
|
||||
The only way to realistically improve on this design would be to directly
|
||||
stitch the kernel body together using runtime code generation.
|
||||
|
||||
Future considerations and limitations
|
||||
-------------------------------------
|
||||
|
||||
My current prototype has a number of severe limitations and opportunities
|
||||
for improvements:
|
||||
|
||||
1. It does not handle scaling at all. I am not yet entirely sure on how I want
|
||||
to handle scaling; this includes handling of subsampled content. I have a
|
||||
number of vague ideas in my head, but nothing where I can say with certainty
|
||||
that it will work out well.
|
||||
|
||||
It's possible that we won't come up with a perfect solution here, and will
|
||||
need to decide on which set of compromises we are comfortable accepting:
|
||||
|
||||
1. Do we need the ability to scale YUV -> YUV by handling luma and chroma
|
||||
independently? When downscaling 100x100 4:2:0 to 50x50 4:4:4, should we
|
||||
support the option of reusing the chroma plane directly (even though
|
||||
this would introduce a subpixel shift for typical chroma siting)?
|
||||
|
||||
Looking towards zimg, I am also thinking that we probably also want to do
|
||||
scaling on floating point values, since this is best for both performance
|
||||
and accuracy, especially given that we need to go up to 32-bit intermediates
|
||||
during scaling anyway.
|
||||
|
||||
So far, the most promising approach seems to be to handle subsampled
|
||||
input/output as a dedicated read/write operation type; perhaps even with a
|
||||
fixed/static subsampling kernel. To avoid compromising on performance when
|
||||
chroma resampling is not necessary, the optimizer could then relax the
|
||||
pipeline to use non-interpolating read/writes when all intermediate
|
||||
operations are component-independent.
|
||||
|
||||
2. Since each operation is conceptually defined on 4-component pixels, we end
|
||||
up defining a lot of variants of each implementation for each possible
|
||||
*subset*. For example, we have four different implementations for
|
||||
SWS_OP_SCALE in my current templates:
|
||||
- op_scale_1000
|
||||
- op_scale_1001
|
||||
- op_scale_1110
|
||||
- op_scale_1111
|
||||
|
||||
This reflects the four different arangements of pixel components that are
|
||||
typically present (or absent). While best for performance, it does turn into
|
||||
a bit of a chore when implementing these kernels.
|
||||
|
||||
The only real alternative would be to either branch inside the kernel (bad),
|
||||
or to use separate kernels for each individual component and chain them all
|
||||
together. I have not yet tested whether the latter approach would be faster
|
||||
after the latest round of refactors to the kernel glue code.
|
||||
|
||||
3. I do not yet have any support for LUTs. But when I add them, something we
|
||||
could do is have the optimized pass automatically "promote" a sequence of
|
||||
operations to LUTs. For example, any sequence that looks like:
|
||||
|
||||
1. [u8] SWS_OP_CONVERT -> X
|
||||
2. [X] ... // only per-component operations
|
||||
4. [X] SWS_OP_CONVERT -> Y
|
||||
3. [Y] SWS_OP_WRITE
|
||||
|
||||
could be replaced by a LUT with 256 entries. This is especially important
|
||||
for anything involving packed 8-bit input (e.g. rgb8, rgb4_byte).
|
||||
|
||||
We also definitely want to hook this up to the existing CMS code for
|
||||
transformations between different primaries.
|
||||
|
||||
4. Because we rely on AVRational math to generate the coefficients for
|
||||
operations, we need to be able to represent all pixel values as an
|
||||
AVRational. However, this presents a challenge for 32-bit formats (e.g.
|
||||
GRAY32, RGBA128), because their size exceeds INT_MAX, which is the maximum
|
||||
value representable by an AVRational.
|
||||
|
||||
It's possible we may want to introduce an AVRational64 for this, or
|
||||
perhaps more flexibly, extend AVRational to an AVFloating type which is
|
||||
represented as { AVRational n; int exp; }, representing n/d * 2^exp. This
|
||||
would preserve our ability to represent all pixel values exactly, while
|
||||
opening up the range arbitrarily.
|
||||
|
||||
5. Is there ever a situation where the use of floats introduces the risk of
|
||||
non bit-exact output? For this reason, and possible performance advantages,
|
||||
we may want to explore the use of a fixed-point 16 bit path as an alternative
|
||||
to the floating point math.
|
||||
|
||||
So far, I have managed to avoid any bit exactness issues inside the x86
|
||||
backend by ensuring that the order of linear operations is identical
|
||||
between the C backend and the x86 backend, but this may not be practical
|
||||
to guarantee on all backends. The x86 float code is also dramatically
|
||||
faster than the old fixed point code, so I'm tentatively optimistic about
|
||||
the lack of a need for a fixed point path.
|
Reference in New Issue
Block a user