1
0
mirror of https://github.com/FFmpeg/FFmpeg.git synced 2025-02-04 06:08:26 +02:00

8 Commits

Author SHA1 Message Date
Martin Storsjö
f32690a298 aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1
This is one cycle faster in total, and three instructions fewer.

Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2

This is cherrypicked from libav commit
3bf9c48320f25f3d5557485b0202f22ae60748b0.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
3fbbad2984 arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit
The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
After:
vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7

This is cherrypicked from libav commit
c582cb8537367721bb399a5d01b652c20142b756.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
c8d6eec85d aarch64: vp9lpf: Fix broken indentation/vertical alignment
This is cherrypicked from libav commit
07b5136c481d394992c7e951967df0cfbb346c0b.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
9f3a886364 aarch64: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

This is cherrypicked from libav commit
b0806088d3b27044145b20421da8d39089ae0c6a.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
f0ecbb13cf arm/aarch64: vp9lpf: Calculate !hev directly
Previously we first calculated hev, and then negated it.

Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
After:
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0

This is cherrypicked from libav commit
e1f9de86f454861b69b199ad801adc2ec6c3b220.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Janne Grunau
cb220eeef9 aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne};
The latter is 1 cycle faster on a cortex-53 and since the operands are
bytewise (or larger) bitmask (impossible to overflow to zero) both are
equivalent.

This is cherrypicked from libav commit
e7ae8f7a715843a5089d18e033afb3ee19ab3057.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-14 21:13:10 +01:00
Janne Grunau
62ea07d797 aarch64: vp9: use alternative returns in the core loop filter function
Since aarch64 has enough free general purpose registers use them to
branch to the appropiate storage code. 1-2 cycles faster for the
functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
(up to 2 cycles faster/slower) on a cortex-a57.

This is cherrypicked from libav commit
d7595de0b25e7064fd9e06dea5d0425536cef6dc.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-14 21:13:06 +01:00
Martin Storsjö
f1212e472b aarch64: vp9: Implement NEON loop filters
This work is sponsored by, and copyright, Google.

These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the loop filters with
16 pixels at a time. The implementation is fully templated, with
a single macro which can generate versions for both 8 and
16 pixels wide, for both 4, 8 and 16 pixels loop filters
(and the 4/8 mixed versions as well).

For the 8 pixel wide versions, it is pretty close in speed (the
v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
and h_8_8 filters seem to get some gain in the load/transpose/store
part). For the 16 pixels wide ones, we get a speedup of around
1.2-1.4x compared to the 32 bit version.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                       ARM AArch64
vp9_loop_filter_h_4_8_neon:          144.0   127.2
vp9_loop_filter_h_8_8_neon:          207.0   182.5
vp9_loop_filter_h_16_8_neon:         415.0   328.7
vp9_loop_filter_h_16_16_neon:        672.0   558.6
vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
vp9_loop_filter_v_4_8_neon:           89.0    88.7
vp9_loop_filter_v_8_8_neon:          141.0   137.7
vp9_loop_filter_v_16_8_neon:         295.0   272.7
vp9_loop_filter_v_16_16_neon:        546.0   453.7

The speedup vs C code in checkasm tests is around 2-7x, which is
pretty much the same as for the 32 bit version. Even if these functions
are faster than their 32 bit equivalent, the C version that we compare
to also became around 1.3-1.7x faster than the C version in 32 bit.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 4-5x.

Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
                         A57 gcc-5.3  neon
loop_filter_h_4_8_neon:        256.6  93.4
loop_filter_h_8_8_neon:        307.3 139.1
loop_filter_h_16_8_neon:       340.1 254.1
loop_filter_h_16_16_neon:      827.0 407.9
loop_filter_mix2_h_44_16_neon: 524.5 155.4
loop_filter_mix2_h_48_16_neon: 644.5 173.3
loop_filter_mix2_h_84_16_neon: 630.5 222.0
loop_filter_mix2_h_88_16_neon: 697.3 222.0
loop_filter_mix2_v_44_16_neon: 598.5 100.6
loop_filter_mix2_v_48_16_neon: 651.5 127.0
loop_filter_mix2_v_84_16_neon: 591.5 167.1
loop_filter_mix2_v_88_16_neon: 855.1 166.7
loop_filter_v_4_8_neon:        271.7  65.3
loop_filter_v_8_8_neon:        312.5 106.9
loop_filter_v_16_8_neon:       473.3 206.5
loop_filter_v_16_16_neon:      976.1 327.8

The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
is again 30-50% faster than the cortex-a53.

This is an adapted cherry-pick from libav commits
9d2afd1eb8c5cc0633062430e66326dbf98c99e0 and
31756abe29eb039a11c59a42cb12e0cc2aef3b97.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2016-11-15 15:10:03 -05:00