When support for this was added the details weren't yet finalized.
This is no longer the case.
Fixes writing of mkv/webm files with HDR.
Reported-by: Kagami Hiiragi <kagami@genshiken.org>
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
Reviewed-by: James Almer <jamrial@gmail.com>
This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2
vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0
By skipping individual 8x16 or 8x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_16x16_sub2_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub8_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 5190.2
vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0
vp9_inv_dct_dct_32x32_sub8_add_neon: 5183.1
vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5
vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5
vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3
vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4
vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9
vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8
I.e. in general a very minor overhead for the full subpartition case due
to the additional cmps, but a significant speedup for the cases when we
only need to process a small part of the actual input data.
This is cherrypicked from libav commits
cad42fadcd2c2ae1b3676bb398844a1f521a2d7b and
a0c443a3980dc22eb02b067ac4cb9ffa2f9b04d2.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3
By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9
vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1
I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.
In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.
This is cherrypicked from libav commit
9c8bc74c2b40537b0997f646c87c008042d788c2.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This avoids reloading them if they haven't been clobbered, if the
first pass also was idct.
This is similar to what was done in the aarch64 version.
This is cherrypicked from libav commit
3c87039a404c5659ae9bf7454a04e186532eb40b.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Since the same parameter is used for both input and output,
the name inout is more fitting.
This matches the naming used below in the dmbutterfly macro.
This is cherrypicked from libav commit
79566ec8c77969d5f9be533de04b1349834cca62.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The clobbering tests in checkasm are only invoked when testing
correctness, so this bug didn't show up when benchmarking the
dc-only version.
This is cherrypicked from libav commit
4d960a11855f4212eb3a4e470ce890db7f01df29.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This is one instruction less for thumb, and only have got
1/2 arm/thumb specific instructions.
This is cherrypicked from libav commit
e5b0fc170f85b00f7dd0ac514918fb5c95253d39.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The latter is 1 cycle faster on a cortex-53 and since the operands are
bytewise (or larger) bitmask (impossible to overflow to zero) both are
equivalent.
This is cherrypicked from libav commit
e7ae8f7a715843a5089d18e033afb3ee19ab3057.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Since aarch64 has enough free general purpose registers use them to
branch to the appropiate storage code. 1-2 cycles faster for the
functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
(up to 2 cycles faster/slower) on a cortex-a57.
This is cherrypicked from libav commit
d7595de0b25e7064fd9e06dea5d0425536cef6dc.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Seemingly ff_clear_block_sse assumed that the block array is aligned,
so make sure it is.
Fixes ticket #6079
Signed-off-by: James Almer <jamrial@gmail.com>
when hlsenc use flag second_level_segment_index,
second_level_segment_size and second_level_segment_duration,
the rename is ok but the output filename always use the old filename
so move the rename operation after the close the ts file and
before open new segment
Reported-by: Christian Johannesen <chrisjohannesen@gmail.com>
Reviewed-by: Bodecs Bela <bodecsb@vivanet.hu>
Signed-off-by: Steven Liu <lq@chinaffmpeg.org>
CID: 1396852
check the devices_list alloc status,
and release the devices_list when alloc devices error
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Steven Liu <lq@chinaffmpeg.org>
Fixes compilation with hardcoded tables after eaff1aa09e90e2711207c9463db8bf8e8dec8178
and e71b8119e7db675dd2dac3f7fb069b0df2943c38
Reviewed-by: Timo Rothenpieler <timo@rothenpieler.org>
Signed-off-by: James Almer <jamrial@gmail.com>
avfilter_graph_request_oldest() does work that should be done by
either the filter or the application.
The principle of this function, calling ff_request_frame() from
outside the filter was always shaky. This version is less elegant
since it requires making special cases for each filter, but it
is more robust since it no longer calls ff_request_frame()
directly without notifying the filter.
Eventually, avfilter_graph_request_oldest() will be deprecated
for a function to just run the graph.