Mark that `huf_decompress_amd64.S` supports BTI and PAC, which it trivially does because it is empty for aarch64.
The issue only requested BTI markings, but it also makes sense to mark PAC, which is the only other feature.
Also run add a test for this mode to the ARM64 QEMU test. Before this PR it warns on `huf_decompress_amd64.S`, after it doesn't.
Fixes Issue #3841.
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of
`src + 6 + 8`. This is safe because the fast decoding loops guarantee
to never read below `ilowest` already. This allows the fast decoder to
run for at least two more iterations, because it consumes at most 7
bytes per iteration.
* Continue the fast loop all the way until the number of safe iterations
is 0. Initially, I thought that when it got towards the end, the
computation of how many iterations of safe might become expensive. But
it ends up being slower to have to decode each of the 4 streams
individually, which makes sense.
This drastically speeds up the Huffman decoder on the `github` dataset
for the issue raised in #3762, measured with `zstd -b1e1r github/`.
| Decoder | Speed before | Speed after |
|----------|--------------|-------------|
| Fallback | 477 MB/s | 477 MB/s |
| Fast C | 384 MB/s | 492 MB/s |
| Assembly | 385 MB/s | 501 MB/s |
We can also look at the speed delta for different block sizes of silesia
using `zstd -b1e1r silesia.tar -B#`.
| Decoder | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ |
|----------|--------|--------|--------|--------|---------|---------|---------|----------|
| Fast C | +11.2% | +8.2% | +6.1% | +4.4% | +2.7% | +1.5% | +0.6% | +0.2% |
| Assembly | +12.5% | +9.0% | +6.2% | +3.6% | +1.5% | +0.7% | +0.2% | +0.03% |
Add generic C versions of the fast decoding loops to serve architectures
that don't have an assembly implementation. Also allow selecting the C
decoding loop over the assembly decoding loop through a zstd
decompression parameter `ZSTD_d_disableHuffmanAssembly`.
I benchmarked on my Intel i9-9900K and my Macbook Air with an M1 processor.
The benchmark command forces zstd to compress without any matches, using
only literals compression, and measures only Huffman decompression speed:
```
zstd -b1e1 --compress-literals --zstd=tlen=131072 silesia.tar
```
The new fast decoding loops outperform the previous implementation uniformly,
but don't beat the x86-64 assembly. Additionally, the fast C decoding loops suffer
from the same stability problems that we've seen in the past, where the assembly
version doesn't. So even though clang gets close to assembly on x86-64, it still
has stability issues.
| Arch | Function | Compiler | Default (MB/s) | Assembly (MB/s) | Fast (MB/s) |
|---------|----------------|--------------|----------------|-----------------|-------------|
| x86-64 | decompress 4X1 | gcc-12.2.0 | 1029.6 | 1308.1 | 1208.1 |
| x86-64 | decompress 4X1 | clang-14.0.6 | 1019.3 | 1305.6 | 1276.3 |
| x86-64 | decompress 4X2 | gcc-12.2.0 | 1348.5 | 1657.0 | 1374.1 |
| x86-64 | decompress 4X2 | clang-14.0.6 | 1027.6 | 1659.9 | 1468.1 |
| aarch64 | decompress 4X1 | clang-12.0.5 | 1081.0 | N/A | 1234.9 |
| aarch64 | decompress 4X2 | clang-12.0.5 | 1270.0 | N/A | 1516.6 |
```
for f in $(find . \( -path ./.git -o -path ./tests/fuzz/corpora \) -prune -o -type f);
do
sed -i 's/Facebook, Inc\./Meta Platforms, Inc. and affiliates./' $f;
done
```
Apparently, even when the assembly file is empty (because
`ZSTD_ENABLE_ASM_X86_64_BMI2` is false), it still is marked as possibly
needing an executable stack and so the whole library is marked as such. This
commit applies a simple patch for this problem by moving the noexecstack
indication outside the macro guard.
This commit builds on #2857.
This commit addresses #2963.
Move portability macros to `lib/common/portability_macros.h`. This file
only contains platform/feature detection (e.g. 0/1 macros). This file is
shared between C and ASM code, so it cannot include any C code.
Rename `HUF_` ASM macros to be `ZSTD_` prefixed, and move to the new
header.
Restrict `ZSTD_ASM_SUPPORTED` to `__GNUC__`, because we need the GAS
assembler.
Finally, only include the ASM code if we are actually going to use it.
This disables it on all Windows platforms, which should resolve the
problem brought up in Issue #2789.
short-tests-0 were silently failing. I think because of the && make clean construction. Switch to ; instead.
Also fix all the test failures that were exposed.
`make all` is failing on CircleCI because it is missing Docker. Move that test
to GitHub actions, and switch the pedantic CircleCI test to `make allmost`.
Putting stack marking into every assembly files is required to indicate
that the stack does not need to be executable.
Executable flag on stack conflicts with some security measures, Systemd
MemoryDenyWriteExecute=yes for example.