Messages on stderr were being lost due to the error suppression used to customize the error message.
Also update the formatting to be more informative and concise.
Multi-repository implementations for the archive-push, check, info, stanza-create, stanza-upgrade, and stanza-delete commands.
Multi-repo configuration is disabled so there should be no behavioral changes between these commands and their current single-repo implementations.
Multi-repo documentation and integration tests are still in the multi-repo development branch. All unit tests work as multi-repo since they are able to bypass the configuration restrictions.
All unit tests now require full coverage so the "full" keyword is obsolete and has been removed.
The covered code modules are simply listed, with only "no code" modules annotated.
These options specify the number of local worker job retries and the retry interval after one immediate retry.
There is some value in allowing retries to be specified by the user but for the most part these options are for suppressing retries during testing, which can save a lot of time. The bug introduced in d1d25c7 and fixed in 8b86d5e also suggests it is better not to use retries in tests.
Remove the default delayed retries for archive-get/archive-push, leaving only the immediate retry. These commands are retried by PostgreSQL so it doesn't make sense to do too many retries internally.
These options are currently internal.
No timeout is expected here but the small timeout prevents errors from being thrown.
This is not a bug since the error would be thrown on the next archive-get call but it does make the tests harder to debug when there is an error.
It is not clear why there was a timeout here at all. It is likely cruft from a prior test or a copy/paste error.
Testing on Travis-CI has been getting slower (from ~18 minutes to 3-6 hours) and the travis-ci.org service will be terminated at the end of the year. Moving to travis-ci.com is an option but the quotas are too low for our purposes.
Instead use Github Actions, which does not currently have quotas, and runs our current tests with just a few tweaks.
This still leaves multi-architecture tests on Travis-CI but we may be able to run those and stay within the new quotas.
Also fix a minor bug in restoreTest.c exposed by Github Actions using a different name for the user and group.
Bug Fixes:
* Allow [, #, and space as the first character in database names. (Reviewed by Stefan Fercot, Cynthia Shang. Reported by Jefferson Alexandre.)
* Create standby.signal only on PostgreSQL 12 when restore type is standby. (Fixed by Stefan Fercot. Reviewed by David Steele. Reported by Keith Fiske.)
Features:
* Expire history files. (Contributed by Stefan Fercot. Reviewed by David Steele.)
* Report page checksum errors in info command text output. (Contributed by Stefan Fercot. Reviewed by Cynthia Shang.)
* Add repo-azure-endpoint option. (Reviewed by Cynthia Shang, Brian Peterson. Suggested by Brian Peterson.)
* Add pg-database option. (Reviewed by Cynthia Shang.)
Improvements:
* Improve info command output when a stanza is specified but missing. (Contributed by Stefan Fercot. Reviewed by Cynthia Shang, David Steele. Suggested by uspen.)
* Improve performance of large file lists in backup/restore commands. (Reviewed by Cynthia Shang, Oscar.)
* Add retries to PostgreSQL sleep when starting a backup. (Reviewed by Cynthia Shang. Suggested by Vitaliy Kukharik.)
Documentation Improvements:
* Replace RHEL/CentOS 6 documentation with RHEL/CentOS 8.
Update RHEL/CentOS 7 to cover the versions that were previously covered by RHEL/CentOS 6.
Since RHEL/CentOS 7/8 work the same update the documentation logic and labels to reflect this compatibility.
CentOS6 EOL'd and the mirrors were swiftly deleted, leading to failures in tests and documentation.
Remove CentOS 6 for now to get builds going again with the intention to replace it in the near future with CentOS 8.
Improve locking on remote processes by introducing an exec-id that is unique to the main process and passed to all remote processes. This allows the remote processes to determine if a lock is held by a remote from the same main process. If so, the lock is allowed.
The exec-id is also useful for associating remote logs with main logs for debugging purposes.
Currently indexes above 1 do not have dependencies checked, so this doesn't error.
In a future commit we will enable those checks and this will error if it is not fixed.
Add older PostgreSQL versions to the u18 container that were not available before.
This also updates all minor versions for prior versions of PostgreSQL.
These values are not used by the Perl integration tests so maybe it would be better to remove them, but for now just update since they should not be changing again for PG13.
Currently each module that needs to collect statistics implements custom code to do so. This is cumbersome.
Create a general purpose module for collecting and reporting statistics. Statistics are output in the log at detail level, but there are other uses they could be put to eventually.
No new functionality is added. This is just a drop-in replacement for the current statistics, with the advantage of being more flexible.
The new stats are slower because they involve a list lookup, but performance testing shows stats can be updated at about 40,000/ms which seems fast enough for our purposes.
This loop was using a lot of memory without freeing it at intervals.
Rewrite to use char arrays when possible to reduce memory that needs to be allocated and freed.
Remove all check and stanza-* tests except for the ones that are intended to succeed. The successful tests show that the queries run with expected results against each version of PG which should also validate queries for the failure tests in the unit tests.
Also remove the tests for --no-online backups since they don't require a database and are well tested in the unit tests.
There are a few non version specific tests that need to be run in integration because we can't get coverage in the unit tests.
To save some time we'll only run those tests against the same version we use for expect testing.
This caused restore to replace files based on timestamp and size rather than overwriting, which meant some files that should have been updated were left unchanged. Normal restore and restore --delta were not affected by this issue.
Azure and Azure-compatible object stores can now be used for repository storage.
Currently only shared key authentication is supported but SAS will be added soon.
There is no need to have parallelism enabled in a backup control session. In particular, 9.6 marks pg_stop_backup() as parallel-safe but an error will be thrown if pg_stop_backup() is run in a worker.
Test matrices were previously simplified for the mock/* tests (e.g. d4410611, d489eb87) but not for real/all since the rules for which tests would run with which options was extremely complex. This only got more complex when new compression formats were added.
Because the loop-generated matrix was so large, mosts tests were skipped for most option combinations following arcane logic which was nearly impossible to decipher even when reading the code, and completely impossible from the test.pl interface. As a consequence, important tests got excluded. For example, backup from standby was excluded for most versions of PostgreSQL because it was only run once per distro, against the latest version to be included in that distro.
Simplify the tests by having a single run per PostgreSQL version and vary test parameters according to the capabilities of each version and the underlying distro. So, ZST testing is based on whether the distro supports ZST. Every test is run for each set of parameters based on the capabilities of the PostgreSQL version, e.g. backup from standby is not attempted on versions that don't support it.
Note that since more tests are running the overall time to run the mock/all tests has increased by about 20-25%. Some time may be saved my removing tests that are adequately covered by unit tests but that should the subject of another commit. Another option would be to limit some non version-specific tests to a single, well defined version of PostgreSQL, .e.g the version that is run by expect tests, currently 9.6.
The motivation for this refactor is that new storage drivers are coming and the loop-generated test matrix simply was not up to the task of adding them.
The following is an example of the new test log (note longer runtime of each test):
module=real, test=all, run=1, pg-version=10 (106.91s)
module=real, test=all, run=1, pg-version=9.5 (151.09s)
module=real, test=all, run=1, pg-version=9.2 (123.11s)
module=real, test=all, run=1, pg-version=9.1 (129s)
vs. the old test log (sub-second tests were skipped entirely):
module=real, test=all, run=2, pg-version=10 (0.31s)
module=real, test=all, run=3, pg-version=10 (0.26s)
module=real, test=all, run=4, pg-version=10 (60.39s)
module=real, test=all, run=1, pg-version=10 (69.12s)
module=real, test=all, run=6, pg-version=10 (34s)
module=real, test=all, run=5, pg-version=10 (42.75s)
module=real, test=all, run=2, pg-version=9.5 (0.21s)
module=real, test=all, run=3, pg-version=9.5 (0.21s)
module=real, test=all, run=4, pg-version=9.5 (0.21s)
module=real, test=all, run=5, pg-version=9.5 (0.26s)
module=real, test=all, run=6, pg-version=9.5 (0.21s)
module=real, test=all, run=1, pg-version=9.2 (72.78s)
module=real, test=all, run=2, pg-version=9.2 (0.26s)
module=real, test=all, run=3, pg-version=9.2 (0.31s)
module=real, test=all, run=4, pg-version=9.2 (0.21s)
module=real, test=all, run=5, pg-version=9.2 (0.21s)
module=real, test=all, run=6, pg-version=9.2 (0.21s)
module=real, test=all, run=1, pg-version=9.5 (88.41s)
module=real, test=all, run=2, pg-version=9.1 (0.21s)
module=real, test=all, run=3, pg-version=9.1 (0.26s)
module=real, test=all, run=4, pg-version=9.1 (0.21s)
module=real, test=all, run=5, pg-version=9.1 (0.31s)
module=real, test=all, run=6, pg-version=9.1 (0.26s)
module=real, test=all, run=1, pg-version=9.1 (72.4s)
The unsupported version error is showing up on older versions of PostgreSQL (e.g. 9.1, 9.2) on RHEL6 when setting up a standby with streaming replication. The error occurs when a client does not properly send a version number and it's not clear why it is happening here, but it does not appear to have anything to do with pgBackRest and only affects RHEL6, i.e. 9.1 and 9.2 do not show this error on other distros.
For now ignore the error since RHEL6 is nearly EOL.
There is no sense in generating detailed coverage reports in CI environments where they will never be seen. It takes time and format differences in some older versions can cause problems in the report generation code.
Note that missing coverage will still be reported on stdout and the test will fail.
This aligns better with general PostgreSQL usage and our own documentation (updated in 4bcef702).
Usage in the backup.manifest tests has not been updated since it might break the file format.
There don't appear to be any behavioral changes since PostgreSQL 12 and all the tests pass.
Changes to the control/catalog/WAL versions in subsequent betas may break compatibility but pgBackRest will be updated with each release to keep pace.
Vendorized code is copied from another project when a library is not available and a git subproject won't work. Currently all the vendorized code is copied from PostgreSQL but it makes sense to have a more general mechanism for indicating vendorized code.
The .vendor extension will be used to denote vendorized code in the same way that .auto is used to denote auto-generated code.
Rather than bS3 use strStorage which can indicate more than two storage types.
For the moment there are still only two storage types but this change is required before more can be added.
These tests required sudo to achieve complete coverage.
Add a new coverage exception, vm_covered, that applies to code that can only be covered in a container. When the test is run outside of a container code sections that require a container will be excluded with TEST_CONTAINER_REQUIRED and the coverage exception will be added to prevent a coverage error.
This does require marking up the core code with vm_covered, which in some modules (e.g. common/io/tls/client) can be extensive. It's possible that some of these tests can be rewritten to be less dependent on sudo but no attempt was made to do that here.
Only allow coverage summaries in a vm since coverage summaries outside a vm will not be complete, which was true even before this commit.
Newer versions of sudo output this message to stderr when run in a container:
sudo: setrlimit(RLIMIT_CORE): Operation not permitted
See https://github.com/sudo-project/sudo/issues/42 for details.
A simple workaround is to prevent sudo from disabling core dumps. This seems safe enough because if sudo is segfaulting then core files are the least of our worries.
There are a number of Valgrind errors on Ubuntu 12.04 which do not happen on newer distro versions. However, suppressions for these errors have masked legitimate issues in subsequent code.
Instead, make suppressions VM specific so errors in other VMs are not masked.
bzip2 is a widely available, high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), while being around twice as fast at compression and six times faster at decompression.
bzip2 is currently available on all supported platforms.
Zstandard is a fast lossless compression algorithm targeting real-time compression scenarios at zlib-level and better compression ratios. It's backed by a very fast entropy stage, provided by Huff0 and FSE library.
Zstandard version >= 1.0 is required, which is generally only available on newer distributions.
Previously when retention-archive was set (either by the user or by default), archives prior to the archive-start of the oldest remaining full backup (after backup expiration occurred) would be expired even though the retention-archive threshold had not been met. For example, if there were 1 full backup remaining after backup expiration and the retention-archive was set to 2 and retention-archive-type=full, then archives prior to the archive-start of the remaining full backup would still be removed even though retention-archive required 2 full backups remaining before archives should be expired.
The thought was to keep the archive directory clean and since the full backup did not require prior archives, it was safe to delete them. However, this has caused problems for some users in the past (because they needed the WAL for other purposes) and with the new adhoc and time-based retention features, it was decided that the archives should remain until the threshold was met. The archives will eventually be removed and if having them causes space issues, the expire command and the retention-archive can always be run and adjusted.
The specified backup set (i.e. the backup label provided and all of its dependent backups, if any) will be expired regardless of backup retention rules except that at least one full backup must remain in the repository.
Allows casting const-ness away from an expression, but doesn't allow changing the type. Enforcement of the latter currently only works for gcc-like compilers.
Note that it is not safe to cast const-ness away if the result will ever be modified (it would be undefined behavior). Doing so can cause compiler mis-optimizations or runtime crashes (by modifying read-only memory). It is only safe to use when the result will not be modified, but API design or language restrictions prevent you from declaring that (e.g. because a function returns both const and non-const variables).
Note that this only works in function scope, not for global variables (it would be nice, but not trivial, to improve that).
UNCONSTIFY() requires static assert which is a feature in its own right.