Loads from this strictly doesn't require alignment, but specify it
just for consistency with the arm version.
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit 'effc1430b2fe5997d9d55bf28dc507c27125eb27':
Revert "checkasm: vp9dsp: Benchmark the dc-only version of idct_idct separately"
Merged-by: James Almer <jamrial@gmail.com>
* commit 'c91d6a33f872574c95c8784277cf60ffcf6bff4f':
checkasm: aarch64: Add filler args to make sure all parameters are passed on the stack
Merged-by: James Almer <jamrial@gmail.com>
* commit 'f1b3e131385176c3c9d9783b25047856a0dcebf6':
checkasm: aarch64: Clobber the stack before calling functions
Merged-by: James Almer <jamrial@gmail.com>
* commit 'a05cc56124b4f1237f6355784de821e3290ddb44':
checkasm: arm/aarch64: Fix the amount of space reserved for stack parameters
Merged-by: James Almer <jamrial@gmail.com>
* commit '22c3ab18646924ce24dc6017a9e882ff69689e40':
checkasm: Add test for huffyuvdsp add_bytes
huffyuvdsp is renamed to llviddsp to be consistent with our codebase.
Note: af607b7e07 wasn't actually required for this test since this
commit is not actually testing huffyuvdsp.
Merged-by: Clément Bœsch <u@pkh.me>
* commit '12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5':
audiodsp/x86: yasmify vector_clipf_sse
audiodsp: reorder arguments for vector_clipf
Merged the version from Libav after a discussion with James Almer on
IRC:
19:22 <ubitux> jamrial: opinion on 12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5?
19:23 <ubitux> it was apparently yasmified differently
19:23 <ubitux> (it depends on the previous commit arg shuffle)
19:24 <ubitux> i don't see the magic movsxdifnidn in your port btw
19:24 <ubitux> it's a port from 1d36defe94
19:25 <jamrial> seems better thanks to said arg shuffle
19:25 <jamrial> the loop is the same, but init is simpler
19:25 <jamrial> probably worth merging
19:25 <ubitux> OK
19:25 <ubitux> thanks
19:26 <jamrial> curious they didn't make len ptrdiff_t after the previous bunch of commits, heh
19:26 <ubitux> yeah indeed
Both commits are merged at the same time to prevent a conflict with our
existing yasmified ff_vector_clipf_sse.
Merged-by: Clément Bœsch <u@pkh.me>
Previously, all link-time dependencies were added for all libraries,
resulting in bogus link-time dependencies since not all dependencies
are shared across libraries. Also, in some cases like libavutil, not
all dependencies were taken into account, resulting in some cases of
underlinking.
To address all this mess a machinery is added for tracking which
dependency belongs to which library component and then leveraged
to determine correct dependencies for all individual libraries.
* commit '71a0472114574993df7035f4de9aa007e03817b8':
checkasm: arm: report the first clobbered register in checkasm_checked_call
Also includes 446353ea18, 59aeed93e4, and 37961044c6 to avoid breaking
too much stuff.
Merged-by: Clément Bœsch <u@pkh.me>
This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3
By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9
vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1
I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.
In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.
This is cherrypicked from libav commit
9c8bc74c2b.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Supporting the system was a nice joke for the 9 release, but it has
run its course. Nowadays Plan 9 receives no testing and has no
practical usefulness.
This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3
By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9
vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1
I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.
In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.
Signed-off-by: Martin Storsjö <martin@martin.st>
This reverts commit 81d7f0bbca.
Instead of just benchmarking dc separately, test all relevant subparts
(in the next commit).
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit '80fbb7becae530167373fe5178966b7d7604306e':
checkasm: vp8.mc: initialize the full src buffer after ec32574209
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '8c816c0c9b12fdefd9046415e97df299880bc9b8':
checkasm/arm: align the clobber check data properly for ldrd
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit 'ec32574209f36467ef0d22c21a7e811ba98c15b6':
checkasm: vp8: mc: test unequal width/height for partitions
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
The dc-only mode is already checked to work correctly above, but this
allows benchmarking this mode for performance tuning, and allows making
sure that it actually is correctly hooked up.
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit 'e48746deec48e9ff195841bc3266b4e153a878cd':
checkasm: h264dsp: Move the x and y variables into the randomize_buffer macro
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
This makes it match the pattern already used for VP8 MC functions.
This also makes the signature match ffmpeg's version of these
functions, easing porting of code in both directions.
Signed-off-by: Martin Storsjö <martin@martin.st>
x29 (FP) is a callee saved register and should be restored on
return. Instead of backing up x29 and restoring it here, back up
sp in a register that we are allowed to overwrite.
This fixes crashes in checkasm on aarch64 since f1b3e13138.
For some reason, gcc builds didn't crash, but clang builds do.
Signed-off-by: Martin Storsjö <martin@martin.st>
This, combined with clobbering the stack space prior to the call,
increases the chances of finding cases where 32 bit parameters
are erroneously treated as 64 bit.
Signed-off-by: Martin Storsjö <martin@martin.st>
Even if MAX_ARGS - 2 (for arm) or MAX_ARGS - 7 (for aarch64) parameters
are passed on the stack to checkasm_checked_call, we actually only
need to store MAX_ARGS - 4 (for arm) or MAX_ARGS - 8 (for aarch64)
parameters on the stack when calling the tested function.
Signed-off-by: Martin Storsjö <martin@martin.st>
The randomize_buffer() implementation assures that "most of the time",
we'll do a good mix of wide16/wide8/hev/regular/no filters for complete
code coverage. However, this is not mathematically assured because that
would make the code either much more complex, or much less random.
Some fixes and improvements by Rodger Combs <rodger.combs@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
This fixes valgrind warnings about conditional jumps based on
uninitialized data (even though the uninitialized data only ever
was compared with a direct copy of the same uninitialized data).
Signed-off-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
When targeting COFF (windows), clang doesn't support this
directive (while binutils supports it for all targets).
Signed-off-by: Martin Storsjö <martin@martin.st>
These bits are set by exceptions in NEON instructions.
Also print the differing bits when FPSCR is clobbered,
and use bic instead of lsl, for clearing the topmost bits.
Signed-off-by: Martin Storsjö <martin@martin.st>
Each const block needs to be terminated by one endconst
invocation so either call endconst after each, or just
declare plain labels to the later strings.
This fixes errors such as this, on some binutils versions:
checkasm.S:38: Error: Macro `endconst' was already defined
Signed-off-by: Martin Storsjö <martin@martin.st>
The stack used by checkasm_checked_call_vfp was a multiple of 4 when the
checked function is called. AAPCS requires a double word (8 byte)
aligned stack public interfaces. Since both calls are public interfaces
the stack is misaligned when the checked is called.
Might fix the SIGBUS error in the armv7-linux-clang-3.7 fate config.
This avoids listing the same feature multiple times in the
test output. Previously the output contained something like this:
SSE2:
- hevc_mc.qpel [OK]
- hevc_mc.epel [OK]
- hevc_mc.unweighted_pred [OK]
- hevc_mc.qpel [OK]
- hevc_mc.epel [OK]
- hevc_mc.unweighted_pred [OK]
Signed-off-by: Martin Storsjö <martin@martin.st>
This avoids the risk of accidentally clobbering such variables outside
of the macro if the same variables are used there.
Signed-off-by: Martin Storsjö <martin@martin.st>
This fixes valgrind warnings about conditional jumps based on
uninitialized data (even though the uninitialized data only ever
was compared with a direct copy of the same uninitialized data).
Signed-off-by: Martin Storsjö <martin@martin.st>
The functions may not clean up properly after using MMX
registers. For the normal testing calls, the checkasm_checked_call
functions will do the cleanup (and check that functions that
should clean up do it as well), but when benchmarking functions
that don't clean up, we don't currently properly clean up at all.
This causes issues if a benchmarked function is followed by testing
of a function that is supposed to not clobber the MMX/FPU state but
doesn't touch it at all.
Signed-off-by: Martin Storsjö <martin@martin.st>
Restore alphabetical order in lists, break overly long lines, do some
prettyprinting, add some explanatory section comments, group parts
together that belong together logically.
Also bench a smaller buffer. This drastically reduces --bench runtime
and reports smaller, more readable numbers.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
Some debuggers/profilers use this metadata to determine which function a
given instruction is in; without it they get can confused by local labels
(if you haven't stripped those). On the other hand, some tools are still
confused even with this metadata. e.g. this fixes `gdb`, but not `perf`.
Currently only implemented for ELF.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Some debuggers/profilers use this metadata to determine which function a
given instruction is in; without it they get can confused by local labels
(if you haven't stripped those). On the other hand, some tools are still
confused even with this metadata. e.g. this fixes `gdb`, but not `perf`.
Currently only implemented for ELF.
Use two separate functions, depending on whether VFP/NEON is available.
This is set to require armv5te - it uses blx, which is only available
since armv5t, but we don't have a separate configure item for that.
(It also uses ldrd, which requires armv5te, but this could be avoided
if necessary.)
Signed-off-by: Martin Storsjö <martin@martin.st>
* commit '2008f76054906e9ff6bf744800af0e5a5bfe61be':
dca: remove unused decode_hf function and quant_d tables
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
* commit '711781d7a1714ea4eb0217eb1ba04811978c43d1':
x86: checkasm: check for or handle missing cleanup after MMX instructions
Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
Check the full FPU tag word instead of only the lower half and simplify
the comparison.
Use upper-case function base name as macro name to instantiate both
checked_call variants.
Not every asm routine is expected clear the MMX state after returning.
It is however a requisite for testing floating point code in checkasm.
Annotate functions requiring cleanup with declare_func_emms() and issue
emms after the call. The remaining functions are checked for having a
cleared MMX state after return.
The vector mode was deprecated in ARMv7-A/VFPv3 and various cpu
implementations do not support it in hardware. Vector mode code will
depending the OS either be emulated in software or result in an illegal
instruction on cpus which does not support it. This was not really
problem in practice since NEON implementations of the same functions are
preferred. It will however become a problem for checkasm which tests
every cpu flag separately.
Since this is a cpu feature newer cpu do not support anymore the
behaviour of this flag differs from the other flags. It can be only
activated by runtime cpu feature selection.
These aren't quite as helpful as the ones in 8bpp, since over there,
we can use pmulhrsw, but here the coefficients have too many bits to
be able to take advantage of pmulhrsw. However, we can still skip
cols for which all coefs are 0, and instead just zero the input data
for the row itx. This helps a few % on overall decoding speed.
The System V ABI on x86-64 specifies that the al register contains an upper
bound of the number of arguments passed in vector registers when calling
variadic functions, so we aren't allowed to clobber it.
checkasm_fail_func() is a variadic function so also zero al before calling it.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Tested functions are internally kept in a binary search tree for efficient
lookups. The downside of the current implementation is that the tree quickly
becomes unbalanced which causes an unneccessary amount of comparisons between
nodes. Improve this by changing the tree into a self-balancing left-leaning
red-black tree with a worst case lookup/insertion time complexity of O(log n).
Significantly reduces the recursion depth and makes the tests run around 10%
faster overall. The relative performance improvement compared to the existing
non-balanced tree will also most likely increase as more tests are added.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
The System V ABI on x86-64 specifies that the al register contains an upper
bound of the number of arguments passed in vector registers when calling
variadic functions, so we aren't allowed to clobber it.
checkasm_fail_func() is a variadic function so also zero al before calling it.
Tested functions are internally kept in a binary search tree for efficient
lookups. The downside of the current implementation is that the tree quickly
becomes unbalanced which causes an unneccessary amount of comparisons between
nodes. Improve this by changing the tree into a self-balancing left-leaning
red-black tree with a worst case lookup/insertion time complexity of O(log n).
Significantly reduces the recursion depth and makes the tests run around 10%
faster overall. The relative performance improvement compared to the existing
non-balanced tree will also most likely increase as more tests are added.