FFmpeg

mirror of https://github.com/FFmpeg/FFmpeg.git synced 2024-12-28 20:53:54 +02:00

Author	SHA1	Message	Date
Martin Storsjö	8f03c30a17	aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:20 +02:00
Martin Storsjö	717cc82d28	aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping For widths of 32 pixels and more, loop first horizontally, then vertically. Previously, this function would process a 16 pixel wide slice of the block, looping vertically. After processing the whole height, it would backtrack and process the next 16 pixel wide slice. When doing 8tap filtering horizontally, the function must load 7 more pixels (in practice, 8) following the actual inputs, and this was done for each slice. By iterating first horizontally throughout each line, then vertically, we access data in a more cache friendly order, and we don't need to reload data unnecessarily. Keep the original order in put_hevc_\type\()_h12_8_neon; the only suboptimal case there is for width=24. But specializing an optimal variant for that would require more code, which might not be worth it. For the h16 case, this implementation would give a slowdown, as it now loads the first 8 pixels separately from the rest, but for larger widths, it is a gain. Therefore, keep the h16 case as it was (but remove the outer loop), and create a new specialized version for horizontal looping with 16 pixels at a time. Before: Cortex A53 A72 A73 Graviton 3 put_hevc_qpel_h16_8_neon: 710.5 667.7 692.5 211.0 put_hevc_qpel_h32_8_neon: 2791.5 2643.5 2732.0 883.5 put_hevc_qpel_h64_8_neon: 10954.0 10657.0 10874.2 3241.5 After: put_hevc_qpel_h16_8_neon: 697.5 663.5 705.7 212.5 put_hevc_qpel_h32_8_neon: 2767.2 2684.5 2791.2 920.5 put_hevc_qpel_h64_8_neon: 10559.2 10471.5 10932.2 3051.7 Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:11 +02:00
Martin Storsjö	e3a54cabde	aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon This gets rid of a couple instructions, but the actual performance is almost identical on Cortex A72/A73. On Cortex A53, it is a handful of cycles faster. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:58:01 +02:00
Martin Storsjö	78db8405c0	aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon store temporary buffers on the stack. When consuming it, many of these functions use the stack pointer as incremental pointer for reading the data (instead of storing it in another register), which is rather unusual. Technically, this is fine as long as the pointer remains properly aligned. However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, after incrementing sp when reading data (within each 16 pixel wide stripe) it would then reset the stack pointer back to a lower value, for reading the next 16 pixel wide stripe, expecting the data to remain untouched. This can't be assumed; data on the stack below the stack pointer can be clobbered (e.g. by a signal handler). Some OS ABIs allow for a little margin that won't be touched, aka a red zone, but not all do. The ones that do, guarantee 16 or 128 bytes, not 9 KB. Convert this function to use a separate pointer register to iterate through the data, retaining the stack pointer to point at the bottom of the data we require to remain untouched. Signed-off-by: Martin Storsjö <martin@martin.st>	2024-03-26 08:57:55 +02:00
Logan Lyu	fa0470347e	lavc/aarch64: new optimization for 8-bit hevc_qpel_bi_hv put_hevc_qpel_bi_hv4_8_c: 433.7 put_hevc_qpel_bi_hv4_8_i8mm: 117.9 put_hevc_qpel_bi_hv6_8_c: 803.9 put_hevc_qpel_bi_hv6_8_i8mm: 252.7 put_hevc_qpel_bi_hv8_8_c: 1296.4 put_hevc_qpel_bi_hv8_8_i8mm: 316.2 put_hevc_qpel_bi_hv12_8_c: 2867.4 put_hevc_qpel_bi_hv12_8_i8mm: 669.2 put_hevc_qpel_bi_hv16_8_c: 4709.4 put_hevc_qpel_bi_hv16_8_i8mm: 929.9 put_hevc_qpel_bi_hv24_8_c: 9639.7 put_hevc_qpel_bi_hv24_8_i8mm: 2072.4 put_hevc_qpel_bi_hv32_8_c: 16663.7 put_hevc_qpel_bi_hv32_8_i8mm: 3391.4 put_hevc_qpel_bi_hv48_8_c: 36972.9 put_hevc_qpel_bi_hv48_8_i8mm: 7505.7 put_hevc_qpel_bi_hv64_8_c: 64106.4 put_hevc_qpel_bi_hv64_8_i8mm: 13145.2 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-01 21:25:39 +02:00
Logan Lyu	595f97028b	lavc/aarch64: new optimization for 8-bit hevc_qpel_bi_v put_hevc_qpel_bi_v4_8_c: 166.1 put_hevc_qpel_bi_v4_8_neon: 61.9 put_hevc_qpel_bi_v6_8_c: 309.4 put_hevc_qpel_bi_v6_8_neon: 75.6 put_hevc_qpel_bi_v8_8_c: 531.1 put_hevc_qpel_bi_v8_8_neon: 78.1 put_hevc_qpel_bi_v12_8_c: 1139.9 put_hevc_qpel_bi_v12_8_neon: 238.1 put_hevc_qpel_bi_v16_8_c: 2063.6 put_hevc_qpel_bi_v16_8_neon: 308.9 put_hevc_qpel_bi_v24_8_c: 4317.1 put_hevc_qpel_bi_v24_8_neon: 629.9 put_hevc_qpel_bi_v32_8_c: 8241.9 put_hevc_qpel_bi_v32_8_neon: 1140.1 put_hevc_qpel_bi_v48_8_c: 18422.9 put_hevc_qpel_bi_v48_8_neon: 2533.9 put_hevc_qpel_bi_v64_8_c: 37508.6 put_hevc_qpel_bi_v64_8_neon: 4520.1 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-01 21:25:39 +02:00
xufuji456	cc86343b96	lavc/hevcdsp_qpel_neon: using movi.16b instead of movi.2d Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b" Signed-off-by: xufuji456 <839789740@qq.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-11-28 15:54:49 +02:00
Logan Lyu	55f28eb627	lavc/aarch64: new optimization for 8-bit hevc_qpel_hv checkasm bench: put_hevc_qpel_hv4_8_c: 422.1 put_hevc_qpel_hv4_8_i8mm: 101.6 put_hevc_qpel_hv6_8_c: 756.4 put_hevc_qpel_hv6_8_i8mm: 225.9 put_hevc_qpel_hv8_8_c: 1189.9 put_hevc_qpel_hv8_8_i8mm: 296.6 put_hevc_qpel_hv12_8_c: 2407.4 put_hevc_qpel_hv12_8_i8mm: 552.4 put_hevc_qpel_hv16_8_c: 4021.4 put_hevc_qpel_hv16_8_i8mm: 886.6 put_hevc_qpel_hv24_8_c: 8992.1 put_hevc_qpel_hv24_8_i8mm: 1968.9 put_hevc_qpel_hv32_8_c: 15197.9 put_hevc_qpel_hv32_8_i8mm: 3209.4 put_hevc_qpel_hv48_8_c: 32811.1 put_hevc_qpel_hv48_8_i8mm: 7442.1 put_hevc_qpel_hv64_8_c: 58106.1 put_hevc_qpel_hv64_8_i8mm: 12423.9 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-31 14:14:21 +02:00
Logan Lyu	97a9d12657	lavc/aarch64: new optimization for 8-bit hevc_qpel_v checkasm bench: put_hevc_qpel_v4_8_c: 138.1 put_hevc_qpel_v4_8_neon: 41.1 put_hevc_qpel_v6_8_c: 276.6 put_hevc_qpel_v6_8_neon: 60.9 put_hevc_qpel_v8_8_c: 478.9 put_hevc_qpel_v8_8_neon: 72.9 put_hevc_qpel_v12_8_c: 1072.6 put_hevc_qpel_v12_8_neon: 203.9 put_hevc_qpel_v16_8_c: 1852.1 put_hevc_qpel_v16_8_neon: 264.1 put_hevc_qpel_v24_8_c: 4137.6 put_hevc_qpel_v24_8_neon: 586.9 put_hevc_qpel_v32_8_c: 7579.1 put_hevc_qpel_v32_8_neon: 1036.6 put_hevc_qpel_v48_8_c: 16355.6 put_hevc_qpel_v48_8_neon: 2326.4 put_hevc_qpel_v64_8_c: 33545.1 put_hevc_qpel_v64_8_neon: 4126.4 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-31 14:14:21 +02:00
Martin Storsjö	a4877f1ec1	aarch64: Only enable extensions in the intended files/regions This eases actual development of the assembly functions, by only allowing extension instructions within the sections that explicitly enable them, instead of having all extensions enabled everywhere. Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-24 14:46:20 +03:00
Martin Storsjö	7f905f3672	aarch64: Make the indentation more consistent Some functions have slightly different indentation styles; try to match the surrounding code. libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally uses a layered indentation style to visually show how different unrolled/interleaved phases fit together. Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-21 23:25:29 +03:00
Logan Lyu	8fa83ad70f	lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_hv checkasm bench: put_hevc_qpel_uni_hv4_8_c: 489.2 put_hevc_qpel_uni_hv4_8_i8mm: 105.7 put_hevc_qpel_uni_hv6_8_c: 852.7 put_hevc_qpel_uni_hv6_8_i8mm: 268.7 put_hevc_qpel_uni_hv8_8_c: 1345.7 put_hevc_qpel_uni_hv8_8_i8mm: 300.4 put_hevc_qpel_uni_hv12_8_c: 2757.4 put_hevc_qpel_uni_hv12_8_i8mm: 581.4 put_hevc_qpel_uni_hv16_8_c: 4458.9 put_hevc_qpel_uni_hv16_8_i8mm: 860.2 put_hevc_qpel_uni_hv24_8_c: 9582.2 put_hevc_qpel_uni_hv24_8_i8mm: 2086.7 put_hevc_qpel_uni_hv32_8_c: 16401.9 put_hevc_qpel_uni_hv32_8_i8mm: 3217.4 put_hevc_qpel_uni_hv48_8_c: 36402.4 put_hevc_qpel_uni_hv48_8_i8mm: 7082.7 put_hevc_qpel_uni_hv64_8_c: 62713.2 put_hevc_qpel_uni_hv64_8_i8mm: 12408.9 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-09-26 15:50:44 +03:00
Logan Lyu	23ca61b7de	lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_v checkasm bench: put_hevc_qpel_uni_v4_8_c: 146.2 put_hevc_qpel_uni_v4_8_neon: 43.2 put_hevc_qpel_uni_v6_8_c: 303.9 put_hevc_qpel_uni_v6_8_neon: 69.7 put_hevc_qpel_uni_v8_8_c: 495.2 put_hevc_qpel_uni_v8_8_neon: 74.7 put_hevc_qpel_uni_v12_8_c: 1100.9 put_hevc_qpel_uni_v12_8_neon: 222.4 put_hevc_qpel_uni_v16_8_c: 1955.2 put_hevc_qpel_uni_v16_8_neon: 269.2 put_hevc_qpel_uni_v24_8_c: 4571.9 put_hevc_qpel_uni_v24_8_neon: 832.4 put_hevc_qpel_uni_v32_8_c: 8226.4 put_hevc_qpel_uni_v32_8_neon: 1035.7 put_hevc_qpel_uni_v48_8_c: 18324.2 put_hevc_qpel_uni_v48_8_neon: 2321.2 put_hevc_qpel_uni_v64_8_c: 37659.4 put_hevc_qpel_uni_v64_8_neon: 4122.2 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-09-26 15:50:44 +03:00
Logan Lyu	e652e7dcda	lavc/aarch64: new optimization for 8-bit hevc_pel_uni_pixels put_hevc_pel_uni_pixels4_8_c: 35.9 put_hevc_pel_uni_pixels4_8_neon: 7.6 put_hevc_pel_uni_pixels6_8_c: 46.1 put_hevc_pel_uni_pixels6_8_neon: 20.6 put_hevc_pel_uni_pixels8_8_c: 53.4 put_hevc_pel_uni_pixels8_8_neon: 11.6 put_hevc_pel_uni_pixels12_8_c: 89.1 put_hevc_pel_uni_pixels12_8_neon: 25.9 put_hevc_pel_uni_pixels16_8_c: 106.4 put_hevc_pel_uni_pixels16_8_neon: 20.4 put_hevc_pel_uni_pixels24_8_c: 137.6 put_hevc_pel_uni_pixels24_8_neon: 47.1 put_hevc_pel_uni_pixels32_8_c: 173.6 put_hevc_pel_uni_pixels32_8_neon: 54.1 put_hevc_pel_uni_pixels48_8_c: 268.1 put_hevc_pel_uni_pixels48_8_neon: 117.1 put_hevc_pel_uni_pixels64_8_c: 346.1 put_hevc_pel_uni_pixels64_8_neon: 205.9 Signed-off-by: Martin Storsjö <martin@martin.st>	2023-07-14 21:19:12 +03:00
Logan Lyu	e79686be96	lavc/aarch64: new optimization for 8-bit hevc_qpel_h hevc_qpel_uni_w_hv Signed-off-by: Martin Storsjö <martin@martin.st>	2023-06-06 12:50:18 +03:00
Logan Lyu	15972cce8c	lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_w_h Signed-off-by: Martin Storsjö <martin@martin.st>	2023-06-06 12:50:18 +03:00
Logan Lyu	0b7356c1b4	lavc/aarch64: new optimization for 8-bit hevc_pel_uni_w_pixels and qpel_uni_w_v Signed-off-by: Martin Storsjö <martin@martin.st>	2023-06-06 12:50:18 +03:00
J. Dekker	9bed814e1d	lavc/aarch64: add hevc horizontal qpel/uni/bi checkasm --benchmark on Ampere Altra (Neoverse N1): put_hevc_qpel_bi_h4_8_c: 170.7 put_hevc_qpel_bi_h4_8_neon: 64.5 put_hevc_qpel_bi_h6_8_c: 373.7 put_hevc_qpel_bi_h6_8_neon: 130.2 put_hevc_qpel_bi_h8_8_c: 662.0 put_hevc_qpel_bi_h8_8_neon: 138.5 put_hevc_qpel_bi_h12_8_c: 1529.5 put_hevc_qpel_bi_h12_8_neon: 422.0 put_hevc_qpel_bi_h16_8_c: 2735.5 put_hevc_qpel_bi_h16_8_neon: 560.5 put_hevc_qpel_bi_h24_8_c: 6015.7 put_hevc_qpel_bi_h24_8_neon: 1636.0 put_hevc_qpel_bi_h32_8_c: 10779.0 put_hevc_qpel_bi_h32_8_neon: 2204.5 put_hevc_qpel_bi_h48_8_c: 24375.0 put_hevc_qpel_bi_h48_8_neon: 4984.0 put_hevc_qpel_bi_h64_8_c: 42768.0 put_hevc_qpel_bi_h64_8_neon: 8795.7 put_hevc_qpel_h4_8_c: 149.0 put_hevc_qpel_h4_8_neon: 55.7 put_hevc_qpel_h6_8_c: 321.2 put_hevc_qpel_h6_8_neon: 106.0 put_hevc_qpel_h8_8_c: 578.7 put_hevc_qpel_h8_8_neon: 133.2 put_hevc_qpel_h12_8_c: 1279.0 put_hevc_qpel_h12_8_neon: 391.7 put_hevc_qpel_h16_8_c: 2286.2 put_hevc_qpel_h16_8_neon: 519.7 put_hevc_qpel_h24_8_c: 5100.7 put_hevc_qpel_h24_8_neon: 1546.2 put_hevc_qpel_h32_8_c: 9022.0 put_hevc_qpel_h32_8_neon: 2060.2 put_hevc_qpel_h48_8_c: 20293.5 put_hevc_qpel_h48_8_neon: 4656.7 put_hevc_qpel_h64_8_c: 36037.0 put_hevc_qpel_h64_8_neon: 8262.7 put_hevc_qpel_uni_h4_8_c: 162.2 put_hevc_qpel_uni_h4_8_neon: 61.7 put_hevc_qpel_uni_h6_8_c: 355.2 put_hevc_qpel_uni_h6_8_neon: 114.2 put_hevc_qpel_uni_h8_8_c: 651.0 put_hevc_qpel_uni_h8_8_neon: 135.7 put_hevc_qpel_uni_h12_8_c: 1412.5 put_hevc_qpel_uni_h12_8_neon: 402.7 put_hevc_qpel_uni_h16_8_c: 2551.0 put_hevc_qpel_uni_h16_8_neon: 533.5 put_hevc_qpel_uni_h24_8_c: 5782.2 put_hevc_qpel_uni_h24_8_neon: 1578.7 put_hevc_qpel_uni_h32_8_c: 10586.5 put_hevc_qpel_uni_h32_8_neon: 2102.2 put_hevc_qpel_uni_h48_8_c: 23812.0 put_hevc_qpel_uni_h48_8_neon: 4739.5 put_hevc_qpel_uni_h64_8_c: 42958.7 put_hevc_qpel_uni_h64_8_neon: 8366.5 Signed-off-by: J. Dekker <jdek@itanimul.li>	2022-10-25 14:56:38 +02:00

18 Commits