Martin Storsjö
ad01d06f91
aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8
...
AWS Graviton 3:
put_hevc_qpel_uni_w_h4_8_c: 159.0
put_hevc_qpel_uni_w_h4_8_neon: 64.2
put_hevc_qpel_uni_w_h4_8_i8mm: 40.0
put_hevc_qpel_uni_w_h6_8_c: 344.7
put_hevc_qpel_uni_w_h6_8_neon: 114.5
put_hevc_qpel_uni_w_h6_8_i8mm: 82.0
put_hevc_qpel_uni_w_h8_8_c: 596.2
put_hevc_qpel_uni_w_h8_8_neon: 132.2
put_hevc_qpel_uni_w_h8_8_i8mm: 106.0
put_hevc_qpel_uni_w_h12_8_c: 1325.0
put_hevc_qpel_uni_w_h12_8_neon: 299.0
put_hevc_qpel_uni_w_h12_8_i8mm: 211.5
put_hevc_qpel_uni_w_h16_8_c: 2300.0
put_hevc_qpel_uni_w_h16_8_neon: 422.0
put_hevc_qpel_uni_w_h16_8_i8mm: 286.2
put_hevc_qpel_uni_w_h24_8_c: 5059.0
put_hevc_qpel_uni_w_h24_8_neon: 912.2
put_hevc_qpel_uni_w_h24_8_i8mm: 664.2
put_hevc_qpel_uni_w_h32_8_c: 9198.2
put_hevc_qpel_uni_w_h32_8_neon: 1638.2
put_hevc_qpel_uni_w_h32_8_i8mm: 1033.7
put_hevc_qpel_uni_w_h48_8_c: 20754.7
put_hevc_qpel_uni_w_h48_8_neon: 3633.7
put_hevc_qpel_uni_w_h48_8_i8mm: 2300.7
put_hevc_qpel_uni_w_h64_8_c: 36854.7
put_hevc_qpel_uni_w_h64_8_neon: 6435.7
put_hevc_qpel_uni_w_h64_8_i8mm: 4039.2
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 09:03:18 +02:00
Martin Storsjö
de23b384fd
aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm
...
In addition to just templating, this contains one change to
ff_hevc_put_hevc_epel_bi_hv32_8, by setting the w6 register
which ff_hevc_put_hevc_epel_h32_8_neon requires.
AWS Graviton 3:
put_hevc_epel_bi_hv4_8_c: 176.5
put_hevc_epel_bi_hv4_8_neon: 62.0
put_hevc_epel_bi_hv4_8_i8mm: 58.0
put_hevc_epel_bi_hv6_8_c: 343.7
put_hevc_epel_bi_hv6_8_neon: 109.7
put_hevc_epel_bi_hv6_8_i8mm: 105.7
put_hevc_epel_bi_hv8_8_c: 536.0
put_hevc_epel_bi_hv8_8_neon: 112.7
put_hevc_epel_bi_hv8_8_i8mm: 111.7
put_hevc_epel_bi_hv12_8_c: 1107.7
put_hevc_epel_bi_hv12_8_neon: 254.7
put_hevc_epel_bi_hv12_8_i8mm: 239.0
put_hevc_epel_bi_hv16_8_c: 1927.7
put_hevc_epel_bi_hv16_8_neon: 356.2
put_hevc_epel_bi_hv16_8_i8mm: 334.2
put_hevc_epel_bi_hv24_8_c: 4195.2
put_hevc_epel_bi_hv24_8_neon: 736.7
put_hevc_epel_bi_hv24_8_i8mm: 715.5
put_hevc_epel_bi_hv32_8_c: 7280.5
put_hevc_epel_bi_hv32_8_neon: 1287.7
put_hevc_epel_bi_hv32_8_i8mm: 1162.2
put_hevc_epel_bi_hv48_8_c: 16857.7
put_hevc_epel_bi_hv48_8_neon: 2836.2
put_hevc_epel_bi_hv48_8_i8mm: 2908.5
put_hevc_epel_bi_hv64_8_c: 29248.2
put_hevc_epel_bi_hv64_8_neon: 5051.7
put_hevc_epel_bi_hv64_8_i8mm: 4491.5
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 09:03:16 +02:00
Martin Storsjö
96e5adda9f
aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm
...
AWS Graviton 3:
put_hevc_epel_uni_w_hv4_8_c: 191.2
put_hevc_epel_uni_w_hv4_8_neon: 87.7
put_hevc_epel_uni_w_hv4_8_i8mm: 83.2
put_hevc_epel_uni_w_hv6_8_c: 349.5
put_hevc_epel_uni_w_hv6_8_neon: 153.0
put_hevc_epel_uni_w_hv6_8_i8mm: 148.5
put_hevc_epel_uni_w_hv8_8_c: 581.2
put_hevc_epel_uni_w_hv8_8_neon: 166.7
put_hevc_epel_uni_w_hv8_8_i8mm: 163.5
put_hevc_epel_uni_w_hv12_8_c: 1230.0
put_hevc_epel_uni_w_hv12_8_neon: 387.7
put_hevc_epel_uni_w_hv12_8_i8mm: 370.2
put_hevc_epel_uni_w_hv16_8_c: 2003.2
put_hevc_epel_uni_w_hv16_8_neon: 501.5
put_hevc_epel_uni_w_hv16_8_i8mm: 490.2
put_hevc_epel_uni_w_hv24_8_c: 4448.7
put_hevc_epel_uni_w_hv24_8_neon: 1092.2
put_hevc_epel_uni_w_hv24_8_i8mm: 1069.7
put_hevc_epel_uni_w_hv32_8_c: 7817.2
put_hevc_epel_uni_w_hv32_8_neon: 1916.2
put_hevc_epel_uni_w_hv32_8_i8mm: 1829.5
put_hevc_epel_uni_w_hv48_8_c: 16728.2
put_hevc_epel_uni_w_hv48_8_neon: 4263.7
put_hevc_epel_uni_w_hv48_8_i8mm: 4342.7
put_hevc_epel_uni_w_hv64_8_c: 29563.2
put_hevc_epel_uni_w_hv64_8_neon: 7474.2
put_hevc_epel_uni_w_hv64_8_i8mm: 7128.5
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:59:58 +02:00
Martin Storsjö
d7294199ab
aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm
...
AWS Graviton 3:
put_hevc_epel_uni_hv4_8_c: 163.5
put_hevc_epel_uni_hv4_8_neon: 59.7
put_hevc_epel_uni_hv4_8_i8mm: 57.5
put_hevc_epel_uni_hv6_8_c: 344.7
put_hevc_epel_uni_hv6_8_neon: 105.0
put_hevc_epel_uni_hv6_8_i8mm: 102.7
put_hevc_epel_uni_hv8_8_c: 552.2
put_hevc_epel_uni_hv8_8_neon: 111.2
put_hevc_epel_uni_hv8_8_i8mm: 104.0
put_hevc_epel_uni_hv12_8_c: 1195.0
put_hevc_epel_uni_hv12_8_neon: 248.7
put_hevc_epel_uni_hv12_8_i8mm: 229.5
put_hevc_epel_uni_hv16_8_c: 1910.2
put_hevc_epel_uni_hv16_8_neon: 339.5
put_hevc_epel_uni_hv16_8_i8mm: 323.2
put_hevc_epel_uni_hv24_8_c: 4048.2
put_hevc_epel_uni_hv24_8_neon: 737.7
put_hevc_epel_uni_hv24_8_i8mm: 713.7
put_hevc_epel_uni_hv32_8_c: 6865.7
put_hevc_epel_uni_hv32_8_neon: 1285.0
put_hevc_epel_uni_hv32_8_i8mm: 1206.0
put_hevc_epel_uni_hv48_8_c: 15830.5
put_hevc_epel_uni_hv48_8_neon: 2844.7
put_hevc_epel_uni_hv48_8_i8mm: 2914.0
put_hevc_epel_uni_hv64_8_c: 27912.7
put_hevc_epel_uni_hv64_8_neon: 4970.5
put_hevc_epel_uni_hv64_8_i8mm: 4653.7
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:59:28 +02:00
Martin Storsjö
7bf3d14769
aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm
...
AWS Graviton 3:
put_hevc_epel_hv4_8_c: 163.7
put_hevc_epel_hv4_8_neon: 52.5
put_hevc_epel_hv4_8_i8mm: 49.5
put_hevc_epel_hv6_8_c: 292.2
put_hevc_epel_hv6_8_neon: 97.7
put_hevc_epel_hv6_8_i8mm: 101.2
put_hevc_epel_hv8_8_c: 471.0
put_hevc_epel_hv8_8_neon: 106.7
put_hevc_epel_hv8_8_i8mm: 102.5
put_hevc_epel_hv12_8_c: 1030.2
put_hevc_epel_hv12_8_neon: 240.5
put_hevc_epel_hv12_8_i8mm: 215.0
put_hevc_epel_hv16_8_c: 1711.5
put_hevc_epel_hv16_8_neon: 340.2
put_hevc_epel_hv16_8_i8mm: 319.2
put_hevc_epel_hv24_8_c: 3670.0
put_hevc_epel_hv24_8_neon: 702.0
put_hevc_epel_hv24_8_i8mm: 666.5
put_hevc_epel_hv32_8_c: 6785.5
put_hevc_epel_hv32_8_neon: 1247.0
put_hevc_epel_hv32_8_i8mm: 1169.0
put_hevc_epel_hv48_8_c: 14689.7
put_hevc_epel_hv48_8_neon: 2665.2
put_hevc_epel_hv48_8_i8mm: 2740.0
put_hevc_epel_hv64_8_c: 25899.2
put_hevc_epel_hv64_8_neon: 4801.2
put_hevc_epel_hv64_8_i8mm: 4487.7
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:59:19 +02:00
Martin Storsjö
5b5666e5ab
aarch64: hevc: Reorder epel_hv functions to prepare for templating
...
This is a pure reordering of code without changing anything in
the individual functions.
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:59:07 +02:00
Martin Storsjö
e6d4c0e117
aarch64: hevc: Split the epel_*_hv functions into two parts
...
The first horizontal filter can use either i8mm or plain neon
versions, while the second part is a pure neon implementation.
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:59:00 +02:00
Martin Storsjö
54af555bfa
aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8
...
AWS Graviton 3:
put_hevc_epel_uni_w_h4_8_c: 97.2
put_hevc_epel_uni_w_h4_8_neon: 41.2
put_hevc_epel_uni_w_h4_8_i8mm: 35.2
put_hevc_epel_uni_w_h6_8_c: 203.7
put_hevc_epel_uni_w_h6_8_neon: 84.7
put_hevc_epel_uni_w_h6_8_i8mm: 74.7
put_hevc_epel_uni_w_h8_8_c: 345.7
put_hevc_epel_uni_w_h8_8_neon: 94.0
put_hevc_epel_uni_w_h8_8_i8mm: 80.7
put_hevc_epel_uni_w_h12_8_c: 768.7
put_hevc_epel_uni_w_h12_8_neon: 196.7
put_hevc_epel_uni_w_h12_8_i8mm: 169.7
put_hevc_epel_uni_w_h16_8_c: 1313.0
put_hevc_epel_uni_w_h16_8_neon: 290.7
put_hevc_epel_uni_w_h16_8_i8mm: 238.0
put_hevc_epel_uni_w_h24_8_c: 2877.5
put_hevc_epel_uni_w_h24_8_neon: 650.0
put_hevc_epel_uni_w_h24_8_i8mm: 512.0
put_hevc_epel_uni_w_h32_8_c: 5113.5
put_hevc_epel_uni_w_h32_8_neon: 1129.5
put_hevc_epel_uni_w_h32_8_i8mm: 739.2
put_hevc_epel_uni_w_h48_8_c: 11757.0
put_hevc_epel_uni_w_h48_8_neon: 2518.7
put_hevc_epel_uni_w_h48_8_i8mm: 1688.5
put_hevc_epel_uni_w_h64_8_c: 20478.0
put_hevc_epel_uni_w_h64_8_neon: 4411.7
put_hevc_epel_uni_w_h64_8_i8mm: 2884.0
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:58:47 +02:00
Martin Storsjö
6d384298ec
aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8
...
AWS Graviton 3:
put_hevc_epel_h4_8_c: 64.7
put_hevc_epel_h4_8_neon: 25.0
put_hevc_epel_h4_8_i8mm: 21.2
put_hevc_epel_h6_8_c: 130.0
put_hevc_epel_h6_8_neon: 40.7
put_hevc_epel_h6_8_i8mm: 36.5
put_hevc_epel_h8_8_c: 209.0
put_hevc_epel_h8_8_neon: 45.2
put_hevc_epel_h8_8_i8mm: 41.2
put_hevc_epel_h12_8_c: 465.5
put_hevc_epel_h12_8_neon: 104.5
put_hevc_epel_h12_8_i8mm: 86.5
put_hevc_epel_h16_8_c: 830.7
put_hevc_epel_h16_8_neon: 134.2
put_hevc_epel_h16_8_i8mm: 114.0
put_hevc_epel_h24_8_c: 1844.7
put_hevc_epel_h24_8_neon: 282.2
put_hevc_epel_h24_8_i8mm: 277.2
put_hevc_epel_h32_8_c: 3227.5
put_hevc_epel_h32_8_neon: 501.5
put_hevc_epel_h32_8_i8mm: 396.0
put_hevc_epel_h48_8_c: 7229.2
put_hevc_epel_h48_8_neon: 1120.2
put_hevc_epel_h48_8_i8mm: 901.2
put_hevc_epel_h64_8_c: 12869.0
put_hevc_epel_h64_8_neon: 1999.2
put_hevc_epel_h64_8_i8mm: 1610.5
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:58:29 +02:00
Martin Storsjö
8f03c30a17
aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:58:20 +02:00
Martin Storsjö
717cc82d28
aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping
...
For widths of 32 pixels and more, loop first horizontally,
then vertically.
Previously, this function would process a 16 pixel wide slice
of the block, looping vertically. After processing the whole
height, it would backtrack and process the next 16 pixel wide
slice.
When doing 8tap filtering horizontally, the function must load
7 more pixels (in practice, 8) following the actual inputs, and
this was done for each slice.
By iterating first horizontally throughout each line, then
vertically, we access data in a more cache friendly order, and
we don't need to reload data unnecessarily.
Keep the original order in put_hevc_\type\()_h12_8_neon; the
only suboptimal case there is for width=24. But specializing
an optimal variant for that would require more code, which
might not be worth it.
For the h16 case, this implementation would give a slowdown,
as it now loads the first 8 pixels separately from the rest, but
for larger widths, it is a gain. Therefore, keep the h16 case
as it was (but remove the outer loop), and create a new specialized
version for horizontal looping with 16 pixels at a time.
Before: Cortex A53 A72 A73 Graviton 3
put_hevc_qpel_h16_8_neon: 710.5 667.7 692.5 211.0
put_hevc_qpel_h32_8_neon: 2791.5 2643.5 2732.0 883.5
put_hevc_qpel_h64_8_neon: 10954.0 10657.0 10874.2 3241.5
After:
put_hevc_qpel_h16_8_neon: 697.5 663.5 705.7 212.5
put_hevc_qpel_h32_8_neon: 2767.2 2684.5 2791.2 920.5
put_hevc_qpel_h64_8_neon: 10559.2 10471.5 10932.2 3051.7
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:58:11 +02:00
Martin Storsjö
e3a54cabde
aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon
...
This gets rid of a couple instructions, but the actual performance
is almost identical on Cortex A72/A73. On Cortex A53, it is a
handful of cycles faster.
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:58:01 +02:00
Martin Storsjö
78db8405c0
aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm
...
Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon
store temporary buffers on the stack. When consuming it,
many of these functions use the stack pointer as incremental pointer
for reading the data (instead of storing it in another register),
which is rather unusual.
Technically, this is fine as long as the pointer remains properly
aligned.
However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm,
after incrementing sp when reading data (within each 16 pixel
wide stripe) it would then reset the stack pointer back to a lower
value, for reading the next 16 pixel wide stripe, expecting the
data to remain untouched.
This can't be assumed; data on the stack below the stack pointer
can be clobbered (e.g. by a signal handler). Some OS ABIs
allow for a little margin that won't be touched, aka a red zone,
but not all do. The ones that do, guarantee 16 or 128 bytes, not
9 KB.
Convert this function to use a separate pointer register to
iterate through the data, retaining the stack pointer to point
at the bottom of the data we require to remain untouched.
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:57:55 +02:00
Martin Storsjö
e66858fbab
aarch64: hevc: Reorder a misplaced function init line
...
Group the epel and qpel functions together.
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-03-26 08:57:50 +02:00
Martin Storsjö
0c5da7be59
aarch64: Fix ff_hevc_put_hevc_epel_h48_8_neon_i8mm
...
The first 32 elements of each row were correct, while the
last 16 were scrambled.
This hasn't been noticed, because the checkasm test erroneously
only checked half of the output (for 8 bit functions), and
apparently none of the samples as part of "fate-hevc" seem to
trigger this specific function.
Signed-off-by: J. Dekker <jdek@itanimul.li>
2024-03-14 13:42:39 +01:00
J. Dekker
570052cd2a
avcodec/aarch64/hevc: add luma deblock NEON
...
Benched using single-threaded full decode on an Ampere Altra.
Bpp Before After Speedup
8 73,3s 65,2s 1.124x
10 114,2s 104,0s 1.098x
12 125,8s 115,7s 1.087x
Signed-off-by: J. Dekker <jdek@itanimul.li>
2024-02-28 10:14:58 +01:00
Mikhail Nitenko
0f745b74ec
lavc/aarch64: h264qpel, add 10-bit lowpass_8_10 based functions
...
Benchmarks A53 A55 A72 A76
avg_h264_qpel_8_mc01_10_c: 936.5 924.0 656.0 504.7
avg_h264_qpel_8_mc01_10_neon: 234.7 202.0 120.7 63.2
avg_h264_qpel_8_mc02_10_c: 921.0 920.0 669.2 493.7
avg_h264_qpel_8_mc02_10_neon: 202.0 173.2 102.7 58.5
avg_h264_qpel_8_mc03_10_c: 936.5 924.0 656.0 509.5
avg_h264_qpel_8_mc03_10_neon: 236.2 203.7 120.0 63.2
avg_h264_qpel_8_mc10_10_c: 1441.0 1437.7 806.7 478.5
avg_h264_qpel_8_mc10_10_neon: 325.7 324.0 153.7 94.2
avg_h264_qpel_8_mc11_10_c: 2160.7 2148.2 1366.7 906.7
avg_h264_qpel_8_mc11_10_neon: 492.0 464.0 242.5 134.5
avg_h264_qpel_8_mc13_10_c: 2157.0 2138.2 1357.0 908.2
avg_h264_qpel_8_mc13_10_neon: 494.0 467.2 242.0 140.0
avg_h264_qpel_8_mc20_10_c: 1433.5 1410.0 785.2 486.0
avg_h264_qpel_8_mc20_10_neon: 293.7 289.7 138.0 91.5
avg_h264_qpel_8_mc30_10_c: 1458.5 1461.7 813.7 483.2
avg_h264_qpel_8_mc30_10_neon: 341.7 339.2 154.0 95.2
avg_h264_qpel_8_mc31_10_c: 2194.7 2197.2 1358.7 928.0
avg_h264_qpel_8_mc31_10_neon: 520.0 495.0 245.5 142.5
avg_h264_qpel_8_mc33_10_c: 2188.0 2205.5 1356.7 910.7
avg_h264_qpel_8_mc33_10_neon: 521.0 494.5 245.7 145.7
avg_h264_qpel_16_mc01_10_c: 3717.2 3595.0 2610.0 2012.0
avg_h264_qpel_16_mc01_10_neon: 920.5 791.5 483.2 240.5
avg_h264_qpel_16_mc02_10_c: 3684.0 3633.0 2659.0 1919.7
avg_h264_qpel_16_mc02_10_neon: 790.7 678.2 409.2 217.0
avg_h264_qpel_16_mc03_10_c: 3726.5 3596.0 2606.7 2010.0
avg_h264_qpel_16_mc03_10_neon: 922.0 792.5 483.2 239.7
avg_h264_qpel_16_mc10_10_c: 5912.0 5803.2 3241.5 1916.7
avg_h264_qpel_16_mc10_10_neon: 1267.5 1277.2 616.5 365.0
avg_h264_qpel_16_mc11_10_c: 8599.2 8482.5 5338.0 3616.2
avg_h264_qpel_16_mc11_10_neon: 1913.0 1827.0 956.2 542.2
avg_h264_qpel_16_mc13_10_c: 8643.7 8488.5 5388.0 3628.5
avg_h264_qpel_16_mc13_10_neon: 1914.7 1828.7 969.2 530.5
avg_h264_qpel_16_mc20_10_c: 5719.5 5641.0 3147.0 1946.2
avg_h264_qpel_16_mc20_10_neon: 1139.5 1150.0 539.5 344.0
avg_h264_qpel_16_mc30_10_c: 5930.0 5872.5 3267.5 1918.0
avg_h264_qpel_16_mc30_10_neon: 1331.5 1341.2 616.5 369.5
avg_h264_qpel_16_mc31_10_c: 8758.7 8697.7 5353.0 3630.7
avg_h264_qpel_16_mc31_10_neon: 2018.7 1941.7 982.2 574.7
avg_h264_qpel_16_mc33_10_c: 8683.2 8675.2 5339.2 3634.7
avg_h264_qpel_16_mc33_10_neon: 2019.7 1940.2 994.5 566.0
put_h264_qpel_8_mc01_10_c: 854.2 843.0 599.2 478.0
put_h264_qpel_8_mc01_10_neon: 192.7 168.0 101.7 56.7
put_h264_qpel_8_mc02_10_c: 766.5 760.0 550.2 441.0
put_h264_qpel_8_mc02_10_neon: 160.0 139.2 88.7 53.0
put_h264_qpel_8_mc03_10_c: 854.2 843.0 599.2 479.0
put_h264_qpel_8_mc03_10_neon: 194.2 169.7 102.0 56.2
put_h264_qpel_8_mc10_10_c: 1352.7 1353.7 749.7 446.7
put_h264_qpel_8_mc10_10_neon: 289.7 294.2 135.5 88.5
put_h264_qpel_8_mc11_10_c: 2080.0 2066.2 1309.5 876.7
put_h264_qpel_8_mc11_10_neon: 450.0 429.7 229.7 131.2
put_h264_qpel_8_mc13_10_c: 2074.7 2060.2 1294.5 870.5
put_h264_qpel_8_mc13_10_neon: 452.5 434.5 226.5 130.0
put_h264_qpel_8_mc20_10_c: 1221.5 1216.0 684.5 399.7
put_h264_qpel_8_mc20_10_neon: 257.7 262.5 121.2 78.7
put_h264_qpel_8_mc30_10_c: 1379.0 1374.7 757.2 449.5
put_h264_qpel_8_mc30_10_neon: 305.7 310.2 135.5 86.5
put_h264_qpel_8_mc31_10_c: 2109.2 2119.7 1299.5 878.0
put_h264_qpel_8_mc31_10_neon: 478.0 458.5 226.0 137.2
put_h264_qpel_8_mc33_10_c: 2101.5 2115.2 1306.5 887.0
put_h264_qpel_8_mc33_10_neon: 479.0 458.7 229.7 141.7
put_h264_qpel_16_mc01_10_c: 3485.7 3396.7 2460.5 1914.5
put_h264_qpel_16_mc01_10_neon: 752.5 665.5 397.0 213.2
put_h264_qpel_16_mc02_10_c: 3103.5 3023.2 2154.7 1720.7
put_h264_qpel_16_mc02_10_neon: 622.7 551.2 347.7 196.2
put_h264_qpel_16_mc03_10_c: 3486.2 3394.0 2436.5 1917.7
put_h264_qpel_16_mc03_10_neon: 754.0 666.5 397.0 215.7
put_h264_qpel_16_mc10_10_c: 5533.0 5488.5 2989.0 1783.0
put_h264_qpel_16_mc10_10_neon: 1123.5 1165.2 535.2 334.7
put_h264_qpel_16_mc11_10_c: 8437.7 8281.2 5209.0 3510.7
put_h264_qpel_16_mc11_10_neon: 1745.0 1697.0 878.5 513.5
put_h264_qpel_16_mc13_10_c: 8567.7 8468.0 5221.5 3528.0
put_h264_qpel_16_mc13_10_neon: 1751.7 1698.2 889.2 507.0
put_h264_qpel_16_mc20_10_c: 4907.5 4885.0 2786.2 1607.5
put_h264_qpel_16_mc20_10_neon: 995.5 1034.5 475.5 307.0
put_h264_qpel_16_mc30_10_c: 5579.7 5537.7 3045.2 1789.5
put_h264_qpel_16_mc30_10_neon: 1187.5 1231.2 532.5 334.5
put_h264_qpel_16_mc31_10_c: 8677.2 8672.5 5204.2 3516.0
put_h264_qpel_16_mc31_10_neon: 1850.7 1813.2 893.0 545.2
put_h264_qpel_16_mc33_10_c: 8688.7 8671.2 5223.2 3512.0
put_h264_qpel_16_mc33_10_neon: 1851.7 1814.2 908.5 535.2
Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-07 23:20:14 +02:00
Logan Lyu
fa0470347e
lavc/aarch64: new optimization for 8-bit hevc_qpel_bi_hv
...
put_hevc_qpel_bi_hv4_8_c: 433.7
put_hevc_qpel_bi_hv4_8_i8mm: 117.9
put_hevc_qpel_bi_hv6_8_c: 803.9
put_hevc_qpel_bi_hv6_8_i8mm: 252.7
put_hevc_qpel_bi_hv8_8_c: 1296.4
put_hevc_qpel_bi_hv8_8_i8mm: 316.2
put_hevc_qpel_bi_hv12_8_c: 2867.4
put_hevc_qpel_bi_hv12_8_i8mm: 669.2
put_hevc_qpel_bi_hv16_8_c: 4709.4
put_hevc_qpel_bi_hv16_8_i8mm: 929.9
put_hevc_qpel_bi_hv24_8_c: 9639.7
put_hevc_qpel_bi_hv24_8_i8mm: 2072.4
put_hevc_qpel_bi_hv32_8_c: 16663.7
put_hevc_qpel_bi_hv32_8_i8mm: 3391.4
put_hevc_qpel_bi_hv48_8_c: 36972.9
put_hevc_qpel_bi_hv48_8_i8mm: 7505.7
put_hevc_qpel_bi_hv64_8_c: 64106.4
put_hevc_qpel_bi_hv64_8_i8mm: 13145.2
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-01 21:25:39 +02:00
Logan Lyu
595f97028b
lavc/aarch64: new optimization for 8-bit hevc_qpel_bi_v
...
put_hevc_qpel_bi_v4_8_c: 166.1
put_hevc_qpel_bi_v4_8_neon: 61.9
put_hevc_qpel_bi_v6_8_c: 309.4
put_hevc_qpel_bi_v6_8_neon: 75.6
put_hevc_qpel_bi_v8_8_c: 531.1
put_hevc_qpel_bi_v8_8_neon: 78.1
put_hevc_qpel_bi_v12_8_c: 1139.9
put_hevc_qpel_bi_v12_8_neon: 238.1
put_hevc_qpel_bi_v16_8_c: 2063.6
put_hevc_qpel_bi_v16_8_neon: 308.9
put_hevc_qpel_bi_v24_8_c: 4317.1
put_hevc_qpel_bi_v24_8_neon: 629.9
put_hevc_qpel_bi_v32_8_c: 8241.9
put_hevc_qpel_bi_v32_8_neon: 1140.1
put_hevc_qpel_bi_v48_8_c: 18422.9
put_hevc_qpel_bi_v48_8_neon: 2533.9
put_hevc_qpel_bi_v64_8_c: 37508.6
put_hevc_qpel_bi_v64_8_neon: 4520.1
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-01 21:25:39 +02:00
Logan Lyu
00290a64f7
lavc/aarch64: new optimization for 8-bit hevc_epel_bi_hv
...
put_hevc_epel_bi_hv4_8_c: 242.9
put_hevc_epel_bi_hv4_8_i8mm: 68.6
put_hevc_epel_bi_hv6_8_c: 402.4
put_hevc_epel_bi_hv6_8_i8mm: 135.9
put_hevc_epel_bi_hv8_8_c: 636.4
put_hevc_epel_bi_hv8_8_i8mm: 145.6
put_hevc_epel_bi_hv12_8_c: 1363.1
put_hevc_epel_bi_hv12_8_i8mm: 324.1
put_hevc_epel_bi_hv16_8_c: 2222.1
put_hevc_epel_bi_hv16_8_i8mm: 509.1
put_hevc_epel_bi_hv24_8_c: 4793.4
put_hevc_epel_bi_hv24_8_i8mm: 1091.9
put_hevc_epel_bi_hv32_8_c: 8393.9
put_hevc_epel_bi_hv32_8_i8mm: 1720.6
put_hevc_epel_bi_hv48_8_c: 19526.6
put_hevc_epel_bi_hv48_8_i8mm: 4285.9
put_hevc_epel_bi_hv64_8_c: 33915.4
put_hevc_epel_bi_hv64_8_i8mm: 6783.6
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-01 21:25:39 +02:00
Logan Lyu
0448f27f41
lavc/aarch64: new optimization for 8-bit hevc_epel_bi_v
...
put_hevc_epel_bi_v4_8_c: 138.4
put_hevc_epel_bi_v4_8_neon: 33.7
put_hevc_epel_bi_v6_8_c: 302.9
put_hevc_epel_bi_v6_8_neon: 46.7
put_hevc_epel_bi_v8_8_c: 408.7
put_hevc_epel_bi_v8_8_neon: 48.7
put_hevc_epel_bi_v12_8_c: 779.4
put_hevc_epel_bi_v12_8_neon: 139.7
put_hevc_epel_bi_v16_8_c: 1344.9
put_hevc_epel_bi_v16_8_neon: 160.2
put_hevc_epel_bi_v24_8_c: 2981.7
put_hevc_epel_bi_v24_8_neon: 344.9
put_hevc_epel_bi_v32_8_c: 5280.9
put_hevc_epel_bi_v32_8_neon: 618.4
put_hevc_epel_bi_v48_8_c: 12494.9
put_hevc_epel_bi_v48_8_neon: 1364.4
put_hevc_epel_bi_v64_8_c: 22127.7
put_hevc_epel_bi_v64_8_neon: 2473.7
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-01 21:25:39 +02:00
Logan Lyu
216275bd80
lavc/aarch64: new optimization for 8-bit hevc_epel_bi_h
...
put_hevc_epel_bi_h4_8_c: 96.0
put_hevc_epel_bi_h4_8_neon: 36.3
put_hevc_epel_bi_h6_8_c: 288.3
put_hevc_epel_bi_h6_8_neon: 59.3
put_hevc_epel_bi_h8_8_c: 358.5
put_hevc_epel_bi_h8_8_neon: 61.5
put_hevc_epel_bi_h12_8_c: 759.8
put_hevc_epel_bi_h12_8_neon: 159.5
put_hevc_epel_bi_h16_8_c: 1307.0
put_hevc_epel_bi_h16_8_neon: 182.0
put_hevc_epel_bi_h24_8_c: 2778.3
put_hevc_epel_bi_h24_8_neon: 430.5
put_hevc_epel_bi_h32_8_c: 4952.3
put_hevc_epel_bi_h32_8_neon: 679.5
put_hevc_epel_bi_h48_8_c: 11803.3
put_hevc_epel_bi_h48_8_neon: 1443.5
put_hevc_epel_bi_h64_8_c: 20654.8
put_hevc_epel_bi_h64_8_neon: 2737.0
put_hevc_qpel_bi_h4_8_c: 140.0
put_hevc_qpel_bi_h4_8_neon: 111.5
put_hevc_qpel_bi_h6_8_c: 318.0
put_hevc_qpel_bi_h6_8_neon: 85.8
put_hevc_qpel_bi_h8_8_c: 536.5
put_hevc_qpel_bi_h8_8_neon: 95.3
put_hevc_qpel_bi_h12_8_c: 1188.5
put_hevc_qpel_bi_h12_8_neon: 291.3
put_hevc_qpel_bi_h16_8_c: 2064.3
put_hevc_qpel_bi_h16_8_neon: 365.3
put_hevc_qpel_bi_h24_8_c: 4757.5
put_hevc_qpel_bi_h24_8_neon: 1010.0
put_hevc_qpel_bi_h32_8_c: 8351.8
put_hevc_qpel_bi_h32_8_neon: 2917.8
put_hevc_qpel_bi_h48_8_c: 19299.8
put_hevc_qpel_bi_h48_8_neon: 2976.8
put_hevc_qpel_bi_h64_8_c: 34182.5
put_hevc_qpel_bi_h64_8_neon: 5236.3
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-01 21:25:39 +02:00
Logan Lyu
40cf4a5ca3
lavc/aarch64: new optimization for 8-bit hevc_pel_bi_pixels
...
put_hevc_pel_bi_pixels4_8_c: 54.7
put_hevc_pel_bi_pixels4_8_neon: 43.0
put_hevc_pel_bi_pixels6_8_c: 94.7
put_hevc_pel_bi_pixels6_8_neon: 37.0
put_hevc_pel_bi_pixels8_8_c: 171.0
put_hevc_pel_bi_pixels8_8_neon: 24.0
put_hevc_pel_bi_pixels12_8_c: 354.0
put_hevc_pel_bi_pixels12_8_neon: 68.7
put_hevc_pel_bi_pixels16_8_c: 588.2
put_hevc_pel_bi_pixels16_8_neon: 77.5
put_hevc_pel_bi_pixels24_8_c: 1670.7
put_hevc_pel_bi_pixels24_8_neon: 173.0
put_hevc_pel_bi_pixels32_8_c: 2267.7
put_hevc_pel_bi_pixels32_8_neon: 281.2
put_hevc_pel_bi_pixels48_8_c: 5787.5
put_hevc_pel_bi_pixels48_8_neon: 673.5
put_hevc_pel_bi_pixels64_8_c: 9897.0
put_hevc_pel_bi_pixels64_8_neon: 1159.5
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-01 21:25:39 +02:00
xufuji456
cc86343b96
lavc/hevcdsp_qpel_neon: using movi.16b instead of movi.2d
...
Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b"
Signed-off-by: xufuji456 <839789740@qq.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-11-28 15:54:49 +02:00
Logan Lyu
55f28eb627
lavc/aarch64: new optimization for 8-bit hevc_qpel_hv
...
checkasm bench:
put_hevc_qpel_hv4_8_c: 422.1
put_hevc_qpel_hv4_8_i8mm: 101.6
put_hevc_qpel_hv6_8_c: 756.4
put_hevc_qpel_hv6_8_i8mm: 225.9
put_hevc_qpel_hv8_8_c: 1189.9
put_hevc_qpel_hv8_8_i8mm: 296.6
put_hevc_qpel_hv12_8_c: 2407.4
put_hevc_qpel_hv12_8_i8mm: 552.4
put_hevc_qpel_hv16_8_c: 4021.4
put_hevc_qpel_hv16_8_i8mm: 886.6
put_hevc_qpel_hv24_8_c: 8992.1
put_hevc_qpel_hv24_8_i8mm: 1968.9
put_hevc_qpel_hv32_8_c: 15197.9
put_hevc_qpel_hv32_8_i8mm: 3209.4
put_hevc_qpel_hv48_8_c: 32811.1
put_hevc_qpel_hv48_8_i8mm: 7442.1
put_hevc_qpel_hv64_8_c: 58106.1
put_hevc_qpel_hv64_8_i8mm: 12423.9
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-31 14:14:21 +02:00
Logan Lyu
97a9d12657
lavc/aarch64: new optimization for 8-bit hevc_qpel_v
...
checkasm bench:
put_hevc_qpel_v4_8_c: 138.1
put_hevc_qpel_v4_8_neon: 41.1
put_hevc_qpel_v6_8_c: 276.6
put_hevc_qpel_v6_8_neon: 60.9
put_hevc_qpel_v8_8_c: 478.9
put_hevc_qpel_v8_8_neon: 72.9
put_hevc_qpel_v12_8_c: 1072.6
put_hevc_qpel_v12_8_neon: 203.9
put_hevc_qpel_v16_8_c: 1852.1
put_hevc_qpel_v16_8_neon: 264.1
put_hevc_qpel_v24_8_c: 4137.6
put_hevc_qpel_v24_8_neon: 586.9
put_hevc_qpel_v32_8_c: 7579.1
put_hevc_qpel_v32_8_neon: 1036.6
put_hevc_qpel_v48_8_c: 16355.6
put_hevc_qpel_v48_8_neon: 2326.4
put_hevc_qpel_v64_8_c: 33545.1
put_hevc_qpel_v64_8_neon: 4126.4
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-31 14:14:21 +02:00
Logan Lyu
265450b89e
lavc/aarch64: new optimization for 8-bit hevc_epel_hv
...
checkasm bench:
put_hevc_epel_hv4_8_c: 213.7
put_hevc_epel_hv4_8_i8mm: 59.4
put_hevc_epel_hv6_8_c: 350.9
put_hevc_epel_hv6_8_i8mm: 130.2
put_hevc_epel_hv8_8_c: 548.7
put_hevc_epel_hv8_8_i8mm: 136.9
put_hevc_epel_hv12_8_c: 1126.7
put_hevc_epel_hv12_8_i8mm: 302.2
put_hevc_epel_hv16_8_c: 1925.2
put_hevc_epel_hv16_8_i8mm: 459.9
put_hevc_epel_hv24_8_c: 4301.9
put_hevc_epel_hv24_8_i8mm: 1024.9
put_hevc_epel_hv32_8_c: 7509.2
put_hevc_epel_hv32_8_i8mm: 1680.4
put_hevc_epel_hv48_8_c: 16566.9
put_hevc_epel_hv48_8_i8mm: 3945.4
put_hevc_epel_hv64_8_c: 29134.2
put_hevc_epel_hv64_8_i8mm: 6567.7
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-31 14:02:53 +02:00
Logan Lyu
22c7291506
lavc/aarch64: new optimization for 8-bit hevc_epel_v
...
checkasm bench:
put_hevc_epel_v4_8_c: 79.9
put_hevc_epel_v4_8_neon: 25.7
put_hevc_epel_v6_8_c: 151.4
put_hevc_epel_v6_8_neon: 46.4
put_hevc_epel_v8_8_c: 250.9
put_hevc_epel_v8_8_neon: 41.7
put_hevc_epel_v12_8_c: 542.7
put_hevc_epel_v12_8_neon: 108.7
put_hevc_epel_v16_8_c: 939.4
put_hevc_epel_v16_8_neon: 169.2
put_hevc_epel_v24_8_c: 2104.9
put_hevc_epel_v24_8_neon: 307.9
put_hevc_epel_v32_8_c: 3713.9
put_hevc_epel_v32_8_neon: 524.2
put_hevc_epel_v48_8_c: 8175.2
put_hevc_epel_v48_8_neon: 1197.2
put_hevc_epel_v64_8_c: 16049.4
put_hevc_epel_v64_8_neon: 2094.9
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-31 14:02:53 +02:00
Logan Lyu
772865717b
lavc/aarch64: new optimization for 8-bit hevc_epel_pixels and and hevc_qpel_pixels
...
checkasm bench:
put_hevc_pel_pixels4_8_c: 33.7
put_hevc_pel_pixels4_8_neon: 20.2
put_hevc_pel_pixels6_8_c: 61.4
put_hevc_pel_pixels6_8_neon: 25.4
put_hevc_pel_pixels8_8_c: 121.4
put_hevc_pel_pixels8_8_neon: 16.9
put_hevc_pel_pixels12_8_c: 199.9
put_hevc_pel_pixels12_8_neon: 40.2
put_hevc_pel_pixels16_8_c: 355.9
put_hevc_pel_pixels16_8_neon: 43.4
put_hevc_pel_pixels24_8_c: 774.7
put_hevc_pel_pixels24_8_neon: 78.9
put_hevc_pel_pixels32_8_c: 1345.2
put_hevc_pel_pixels32_8_neon: 152.2
put_hevc_pel_pixels48_8_c: 2963.7
put_hevc_pel_pixels48_8_neon: 309.4
put_hevc_pel_pixels64_8_c: 5236.2
put_hevc_pel_pixels64_8_neon: 514.2
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-31 14:02:53 +02:00
Martin Storsjö
a4877f1ec1
aarch64: Only enable extensions in the intended files/regions
...
This eases actual development of the assembly functions, by only
allowing extension instructions within the sections that explicitly
enable them, instead of having all extensions enabled everywhere.
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-24 14:46:20 +03:00
Martin Storsjö
1762975ba1
libavcodec/aarch64/hevc: Require consistent use of trailing semicolon
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-23 10:39:12 +03:00
Martin Storsjö
a76b409dd0
aarch64: Reindent all assembly to 8/24 column indentation
...
libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:54 +03:00
Martin Storsjö
7f905f3672
aarch64: Make the indentation more consistent
...
Some functions have slightly different indentation styles; try
to match the surrounding code.
libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:29 +03:00
Martin Storsjö
93cda5a9c2
aarch64: Lowercase UXTW/SXTW and similar flags
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:23 +03:00
Martin Storsjö
184103b310
aarch64: Consistently use lowercase for vector element specifiers
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:18 +03:00
Andreas Rheinhardt
6f7bf64dbc
avcodec: Remove DCT, FFT, MDCT and RDFT
...
They were replaced by TX from libavutil; the tremendous work
to get to this point (both creating TX as well as porting
the users of the components removed in this commit) was
completely performed by Lynne alone.
Removing the subsystems from configure may break some command lines,
because the --disable-fft etc. options are no longer recognized.
Co-authored-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-10-01 02:25:09 +02:00
Logan Lyu
8fa83ad70f
lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_hv
...
checkasm bench:
put_hevc_qpel_uni_hv4_8_c: 489.2
put_hevc_qpel_uni_hv4_8_i8mm: 105.7
put_hevc_qpel_uni_hv6_8_c: 852.7
put_hevc_qpel_uni_hv6_8_i8mm: 268.7
put_hevc_qpel_uni_hv8_8_c: 1345.7
put_hevc_qpel_uni_hv8_8_i8mm: 300.4
put_hevc_qpel_uni_hv12_8_c: 2757.4
put_hevc_qpel_uni_hv12_8_i8mm: 581.4
put_hevc_qpel_uni_hv16_8_c: 4458.9
put_hevc_qpel_uni_hv16_8_i8mm: 860.2
put_hevc_qpel_uni_hv24_8_c: 9582.2
put_hevc_qpel_uni_hv24_8_i8mm: 2086.7
put_hevc_qpel_uni_hv32_8_c: 16401.9
put_hevc_qpel_uni_hv32_8_i8mm: 3217.4
put_hevc_qpel_uni_hv48_8_c: 36402.4
put_hevc_qpel_uni_hv48_8_i8mm: 7082.7
put_hevc_qpel_uni_hv64_8_c: 62713.2
put_hevc_qpel_uni_hv64_8_i8mm: 12408.9
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-09-26 15:50:44 +03:00
Logan Lyu
23ca61b7de
lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_v
...
checkasm bench:
put_hevc_qpel_uni_v4_8_c: 146.2
put_hevc_qpel_uni_v4_8_neon: 43.2
put_hevc_qpel_uni_v6_8_c: 303.9
put_hevc_qpel_uni_v6_8_neon: 69.7
put_hevc_qpel_uni_v8_8_c: 495.2
put_hevc_qpel_uni_v8_8_neon: 74.7
put_hevc_qpel_uni_v12_8_c: 1100.9
put_hevc_qpel_uni_v12_8_neon: 222.4
put_hevc_qpel_uni_v16_8_c: 1955.2
put_hevc_qpel_uni_v16_8_neon: 269.2
put_hevc_qpel_uni_v24_8_c: 4571.9
put_hevc_qpel_uni_v24_8_neon: 832.4
put_hevc_qpel_uni_v32_8_c: 8226.4
put_hevc_qpel_uni_v32_8_neon: 1035.7
put_hevc_qpel_uni_v48_8_c: 18324.2
put_hevc_qpel_uni_v48_8_neon: 2321.2
put_hevc_qpel_uni_v64_8_c: 37659.4
put_hevc_qpel_uni_v64_8_neon: 4122.2
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-09-26 15:50:44 +03:00
Logan Lyu
b7a3150bc5
lavc/aarch64: new optimization for 8-bit hevc_epel_uni_hv
...
checkasm bench:
put_hevc_epel_uni_hv4_8_c: 204.7
put_hevc_epel_uni_hv4_8_i8mm: 70.2
put_hevc_epel_uni_hv6_8_c: 378.2
put_hevc_epel_uni_hv6_8_i8mm: 131.9
put_hevc_epel_uni_hv8_8_c: 637.7
put_hevc_epel_uni_hv8_8_i8mm: 137.9
put_hevc_epel_uni_hv12_8_c: 1301.9
put_hevc_epel_uni_hv12_8_i8mm: 314.2
put_hevc_epel_uni_hv16_8_c: 2203.4
put_hevc_epel_uni_hv16_8_i8mm: 454.7
put_hevc_epel_uni_hv24_8_c: 4848.2
put_hevc_epel_uni_hv24_8_i8mm: 1065.2
put_hevc_epel_uni_hv32_8_c: 8517.4
put_hevc_epel_uni_hv32_8_i8mm: 1898.4
put_hevc_epel_uni_hv48_8_c: 19591.7
put_hevc_epel_uni_hv48_8_i8mm: 4107.2
put_hevc_epel_uni_hv64_8_c: 33880.2
put_hevc_epel_uni_hv64_8_i8mm: 6568.7
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-09-26 15:50:44 +03:00
Logan Lyu
c0374f77f4
lavc/aarch64: move macros calc_epelh, calc_epelh2, load_epel_filterh
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-09-26 15:50:40 +03:00
Logan Lyu
7ce5a2f640
lavc/aarch64: new optimization for 8-bit hevc_epel_uni_v
...
checkasm bench:
put_hevc_epel_uni_hv64_8_i8mm: 6568.7
put_hevc_epel_uni_v4_8_c: 88.7
put_hevc_epel_uni_v4_8_neon: 32.7
put_hevc_epel_uni_v6_8_c: 185.4
put_hevc_epel_uni_v6_8_neon: 44.9
put_hevc_epel_uni_v8_8_c: 333.9
put_hevc_epel_uni_v8_8_neon: 44.4
put_hevc_epel_uni_v12_8_c: 728.7
put_hevc_epel_uni_v12_8_neon: 119.7
put_hevc_epel_uni_v16_8_c: 1224.2
put_hevc_epel_uni_v16_8_neon: 139.7
put_hevc_epel_uni_v24_8_c: 2531.2
put_hevc_epel_uni_v24_8_neon: 329.9
put_hevc_epel_uni_v32_8_c: 4739.9
put_hevc_epel_uni_v32_8_neon: 562.7
put_hevc_epel_uni_v48_8_c: 10618.7
put_hevc_epel_uni_v48_8_neon: 1256.2
put_hevc_epel_uni_v64_8_c: 19169.9
put_hevc_epel_uni_v64_8_neon: 2179.2
Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-09-26 15:50:40 +03:00
Casey Smalley
b98ee1a355
aarch64/hevc: Replace br return with ret
...
This patch changes the return instruction in the tr_32x4 macro from
BR to RET.
Function returns should always use the RET instruction instead of BR,
to avoid interfering with branch prediction.
On devices that support BTI, this is observeable as a landing pad is
required when branching with BR. The change fixes
fate-hevc-hdr-vivid-metadata when on hardware with BTI support.
Signed-off-by: Casey Smalley <casey.smalley@arm.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-08-08 13:46:07 +03:00
Reimar Döffinger
dcff15692d
hevcdsp_idct_neon.S: Avoid unnecessary mov.
...
ret can be given an argument instead.
This is also consistent with how other assembler code
in FFmpeg does it.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
2023-07-29 16:05:23 +02:00
Rémi Denis-Courmont
82cb4b1c05
lavc/aarch64: remove bogus HAVE_VFP guard
...
The IMDCT offset is only relevant for NEON optimisations. There are no
VFP optimisations here that would justify the HAVE_VFP flag. In
practice, this makes no difference because HAVE_NEON is practically
always true for standard Armv8 platforms.
2023-07-15 22:56:30 +03:00
Logan Lyu
9557bf26b3
lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_hv
...
put_hevc_epel_uni_w_hv4_8_c: 254.6
put_hevc_epel_uni_w_hv4_8_i8mm: 102.9
put_hevc_epel_uni_w_hv6_8_c: 411.6
put_hevc_epel_uni_w_hv6_8_i8mm: 221.6
put_hevc_epel_uni_w_hv8_8_c: 669.4
put_hevc_epel_uni_w_hv8_8_i8mm: 214.9
put_hevc_epel_uni_w_hv12_8_c: 1412.6
put_hevc_epel_uni_w_hv12_8_i8mm: 481.4
put_hevc_epel_uni_w_hv16_8_c: 2425.4
put_hevc_epel_uni_w_hv16_8_i8mm: 647.4
put_hevc_epel_uni_w_hv24_8_c: 5384.1
put_hevc_epel_uni_w_hv24_8_i8mm: 1450.6
put_hevc_epel_uni_w_hv32_8_c: 9470.9
put_hevc_epel_uni_w_hv32_8_i8mm: 2497.1
put_hevc_epel_uni_w_hv48_8_c: 20930.1
put_hevc_epel_uni_w_hv48_8_i8mm: 5635.9
put_hevc_epel_uni_w_hv64_8_c: 36682.9
put_hevc_epel_uni_w_hv64_8_i8mm: 9712.6
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
d48c89701c
lavc/aarch64: new optimization for 8-bit hevc_epel_h
...
put_hevc_epel_h4_8_c: 67.1
put_hevc_epel_h4_8_i8mm: 21.1
put_hevc_epel_h6_8_c: 147.1
put_hevc_epel_h6_8_i8mm: 45.1
put_hevc_epel_h8_8_c: 237.4
put_hevc_epel_h8_8_i8mm: 72.1
put_hevc_epel_h12_8_c: 527.4
put_hevc_epel_h12_8_i8mm: 115.4
put_hevc_epel_h16_8_c: 943.6
put_hevc_epel_h16_8_i8mm: 153.9
put_hevc_epel_h24_8_c: 2105.4
put_hevc_epel_h24_8_i8mm: 384.4
put_hevc_epel_h32_8_c: 3631.4
put_hevc_epel_h32_8_i8mm: 519.9
put_hevc_epel_h48_8_c: 8082.1
put_hevc_epel_h48_8_i8mm: 1110.4
put_hevc_epel_h64_8_c: 14400.6
put_hevc_epel_h64_8_i8mm: 2057.1
put_hevc_qpel_h4_8_c: 124.9
put_hevc_qpel_h4_8_neon: 43.1
put_hevc_qpel_h4_8_i8mm: 33.1
put_hevc_qpel_h6_8_c: 269.4
put_hevc_qpel_h6_8_neon: 90.6
put_hevc_qpel_h6_8_i8mm: 61.4
put_hevc_qpel_h8_8_c: 477.6
put_hevc_qpel_h8_8_neon: 82.1
put_hevc_qpel_h8_8_i8mm: 99.9
put_hevc_qpel_h12_8_c: 1062.4
put_hevc_qpel_h12_8_neon: 226.9
put_hevc_qpel_h12_8_i8mm: 170.9
put_hevc_qpel_h16_8_c: 1880.6
put_hevc_qpel_h16_8_neon: 302.9
put_hevc_qpel_h16_8_i8mm: 251.4
put_hevc_qpel_h24_8_c: 4221.9
put_hevc_qpel_h24_8_neon: 893.9
put_hevc_qpel_h24_8_i8mm: 626.1
put_hevc_qpel_h32_8_c: 7437.6
put_hevc_qpel_h32_8_neon: 1189.9
put_hevc_qpel_h32_8_i8mm: 959.1
put_hevc_qpel_h48_8_c: 16838.4
put_hevc_qpel_h48_8_neon: 2727.9
put_hevc_qpel_h48_8_i8mm: 2163.9
put_hevc_qpel_h64_8_c: 29982.1
put_hevc_qpel_h64_8_neon: 4777.6
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
668eb4c00e
lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_v
...
put_hevc_epel_uni_w_v4_8_c: 116.1
put_hevc_epel_uni_w_v4_8_neon: 48.6
put_hevc_epel_uni_w_v6_8_c: 248.9
put_hevc_epel_uni_w_v6_8_neon: 80.6
put_hevc_epel_uni_w_v8_8_c: 383.9
put_hevc_epel_uni_w_v8_8_neon: 91.9
put_hevc_epel_uni_w_v12_8_c: 806.1
put_hevc_epel_uni_w_v12_8_neon: 202.9
put_hevc_epel_uni_w_v16_8_c: 1411.1
put_hevc_epel_uni_w_v16_8_neon: 289.9
put_hevc_epel_uni_w_v24_8_c: 3168.9
put_hevc_epel_uni_w_v24_8_neon: 619.4
put_hevc_epel_uni_w_v32_8_c: 5632.9
put_hevc_epel_uni_w_v32_8_neon: 1161.1
put_hevc_epel_uni_w_v48_8_c: 12406.1
put_hevc_epel_uni_w_v48_8_neon: 2476.4
put_hevc_epel_uni_w_v64_8_c: 22001.4
put_hevc_epel_uni_w_v64_8_neon: 4343.9
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
0c604b1913
lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_h
...
put_hevc_epel_uni_w_h4_8_c: 126.1
put_hevc_epel_uni_w_h4_8_i8mm: 41.6
put_hevc_epel_uni_w_h6_8_c: 222.9
put_hevc_epel_uni_w_h6_8_i8mm: 91.4
put_hevc_epel_uni_w_h8_8_c: 374.4
put_hevc_epel_uni_w_h8_8_i8mm: 102.1
put_hevc_epel_uni_w_h12_8_c: 806.1
put_hevc_epel_uni_w_h12_8_i8mm: 225.6
put_hevc_epel_uni_w_h16_8_c: 1414.4
put_hevc_epel_uni_w_h16_8_i8mm: 333.4
put_hevc_epel_uni_w_h24_8_c: 3128.6
put_hevc_epel_uni_w_h24_8_i8mm: 713.1
put_hevc_epel_uni_w_h32_8_c: 5519.1
put_hevc_epel_uni_w_h32_8_i8mm: 1118.1
put_hevc_epel_uni_w_h48_8_c: 12364.4
put_hevc_epel_uni_w_h48_8_i8mm: 2541.1
put_hevc_epel_uni_w_h64_8_c: 21925.9
put_hevc_epel_uni_w_h64_8_i8mm: 4383.6
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
e652e7dcda
lavc/aarch64: new optimization for 8-bit hevc_pel_uni_pixels
...
put_hevc_pel_uni_pixels4_8_c: 35.9
put_hevc_pel_uni_pixels4_8_neon: 7.6
put_hevc_pel_uni_pixels6_8_c: 46.1
put_hevc_pel_uni_pixels6_8_neon: 20.6
put_hevc_pel_uni_pixels8_8_c: 53.4
put_hevc_pel_uni_pixels8_8_neon: 11.6
put_hevc_pel_uni_pixels12_8_c: 89.1
put_hevc_pel_uni_pixels12_8_neon: 25.9
put_hevc_pel_uni_pixels16_8_c: 106.4
put_hevc_pel_uni_pixels16_8_neon: 20.4
put_hevc_pel_uni_pixels24_8_c: 137.6
put_hevc_pel_uni_pixels24_8_neon: 47.1
put_hevc_pel_uni_pixels32_8_c: 173.6
put_hevc_pel_uni_pixels32_8_neon: 54.1
put_hevc_pel_uni_pixels48_8_c: 268.1
put_hevc_pel_uni_pixels48_8_neon: 117.1
put_hevc_pel_uni_pixels64_8_c: 346.1
put_hevc_pel_uni_pixels64_8_neon: 205.9
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-07-14 21:19:12 +03:00
Logan Lyu
e79686be96
lavc/aarch64: new optimization for 8-bit hevc_qpel_h hevc_qpel_uni_w_hv
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-06-06 12:50:18 +03:00