Bug Fixes:
* Unnest HTTP/TLS/socket timeouts. (Reviewed by David Christensen.)
* Fix possible segfault in page checksum error message. (Fixed by Zsolt Parragi. Reviewed by David Steele.)
Features:
* Add repo-symlink option to suppress creation of repository symlinks. (Reviewed by Douglas J Hunley. Suggested by Ron Johnson.)
Improvements:
* Add HTTP retries for 408 and 429 errors. (Reviewed by David Christensen.)
On recent versions of Docker it is not necessary to specify the architecture of the require image. The --platform option takes care of it.
Rebuild test images that were modified by this change.
This test was removed in c64cd8e0 because it was taking too long to be scheduled, which held up testing.
Now that Github Actions supports aarch64 move the test there.
Similar to adc5e5b23, STRDEF assignment to reason can be out of scope, which may lead to either garbage in the log message or a crash with SIGSEGV.
Fix by using simple char * constants for the reason.
The SFTP storage driver did not set the pathSync method but had logic to see if pathSync was set. This led to some tortured tests since pathSync had to be injected into the interface after the storage object had been created.
Remove the pathSync parameter from storageWriteSftpNew() and also remove related tests.
These timeouts were nested, i.e. an error in the socket connection would also retry in the TLS and HTTP layers. This led to a multiplying effect such that it took nine minutes to fully timeout with the default io-timeout of 60 seconds.
A fix for this was attempted in 5314dbff but it reduced retries by too much and had to be reverted in fa5b2d44.
Instead fix by moving connection attempts to the lower layer (e.g. TLS -> socket) out of the exception block but leave it within the retry loop. So, for example, if the socket connection fails after retries then the error will not be retried by TLS. But if the TLS session fails the socket will be reconnected.
The STRDEF definition of plural is out of scope in the error message just below it, which may lead to either garbage in the log message or a crash with SIGSEGV.
This is particularly visible with pg_tde, which encrypts tables: if pgbackrest is run with checksum enabled, it tries to emit this message for all encrypted files.
Fix by using simple char * constants in the error message.
Rebuild all containers to get the most recent versions of PostgreSQL.
Update the Debian repository install to match current recommendations. This has already been done for the documentation in fcd00a45.
HTTP client errors 408 and 429 were not being retried but there seems to be some benefit to doing so.
408 has only been seen once and in that case the server was returning 400 (with Request Timeout in the response body) but it seems worth doing since this could happen during times of high congestion. Requests are not sent until request content is ready so there is not much else to be done to handle this error.
429 has been seen occasionally in the past but now seems to be common on Cloudflare R2. Ideally we would have a large initial back off here but it is not clear that is worth it at this time. The existing Fibonacci back off should be enough to allow operations to proceed if possible during the configured timeout.
Bug Fixes:
* Fix issue with adhoc expiration when no backups in a repository. (Reviewed by Stefan Fercot. Reported by Anup Gupta.)
Features:
* Add restore progress to info command output. (Contributed by Denis Garsh, Maxim Michkov. Reviewed by David Steele.)
* Add progress-only detail level for info command output. (Contributed by Denis Garsh. Reviewed by David Steele, Stefan Fercot.)
Improvements:
* Retry failed reads on object stores. (Reviewed by David Christensen.)
* Fix defaults in command-line help. (Reviewed by David Christensen, Chris Bandy.)
Documentation Improvements:
* Describe discrete option values in a list where appropriate. (Contributed by Anton Kurochkin. Reviewed by David Steele.)
* Fix "less than" in help output for archive-mode option. (Contributed by Anton Kurochkin. Reviewed by David Steele.)
The prior implementation of the restore command did not provide any progress information. Implement it similarly to the backup command and also update the info command to display restore progress alongside backup progress.
Restore is a read operation and should not block other commands. The only exception is that multiple restores with the same lock and repository path are not allowed, as each restore must write progress to a separate file. Therefore, no lock is needed for restores with a remote role and the restore command should not be terminated when the stop command is executed.
The info command fetches a lot of information from the repository about backups and archives, so this operation can be slow. Because progress data is stored in local lock files, accessing the repository is unnecessary when only progress information is required.
This patch introduces the `--detail-level=[progress|full]` option, with `full` as the default. The `progress` level limits the info command output to progress details without querying the repository. The only remaining operations are scanning the folder structure to list available stanzas and reading lock files.
Note: When `progress` is selected, the info command performs no checks beyond verifying stanza availability.
If there are no backups in one or more repositories then the following error occurs during adhoc expiration:
ASSERT: [025]: cannot get index 0 from list with 0 value(s)
Fix this by skipping the adhoc logic when there are no backups in a repository.
The nonstring variable attribute specifies that an object or member declaration with type array of char, signed char, or unsigned char, or pointer to such a type is intended to store character arrays that do not necessarily contain a terminating NUL.
Newer versions of gcc will warn if this attribute is missing.
Some defaults and allow lists were determined in cfgLoadUpdateOption() because they depended on the values of other options, .e.g. compress-level.
Instead build this functionality into the config parser. Not only does this standardize the defaults and allow lists but it makes it possible to automate the documentation, which is also done in this commit.
It is currently not possible to determine the default of all options knowing just the command. Some defaults are set in cfgLoadUpdateOption() and in an upcoming commit defaults may be based on the value of other options.
It would be possible to update parser to provide this information but that will complicate the parser and since the logic is only used to simplify options passed to remotes it does not seem worth the effort.
For the same reason cfgParseOptionRequired() can also be removed.
There is currently a retry if the initial get request fails (depending on the error code) but if the read fails later on while fetching blocks of data it is fatal. In most cases there is a higher level retry (e.g. restore) but restarting the restore job might be expensive depending on how many files are being restored.
Add a retry that will catch read errors and retry from where the last data was successfully read.
A bit of history -- this patch was first started three years ago but the memory context model at that time would not allow the interface (StorageRead) to own the driver (e.g. StorageReadS3). Subsequent improvements in memory contexts have allowed this ownership model and in fact it is now the default so no ownership changes are required in this patch except in StorageReadRemote which was not updated in f6e3073.
In the documentation and help output the possible values for most options are described as a list but there were several commands for which formatting was done in a different style.
Formatting has been applied to commands and their options to display option values as a list:
* `info` command, `--type` option
* `verify` command, `--output` option
* `version` command, `--output` option
* `manifest` command, `--output` option
A few bugs have been reported over the years on Alpine with musl libc and generally it has been possible to figure them out without testing on that platform but a few newer ones cannot be reproduced elsewhere.
Also, testing with additional libc implementations helps portability so it makes sense to add support.
For the most part musl libc works as expected with a few caveats:
1) The FD_*() macros won't accept an int file descriptor without warning when -Wsign-conversion is enabled. I opened https://www.openwall.com/lists/musl/2025/05/23/4 to discuss this and I was referred to https://www.openwall.com/lists/musl/2024/07/16/1 which explains why this happens. It was not a very satisfying answer but clearly it is not going to be addressed so a meson probe was added to detect the issue and disable the warning.
2) Floating point numbers are rounded differently than in GNU/Mac libc when formatted with printf() and friends. This is fine for the core code but causes issues in the unit tests that expect log entries to match exactly. This was solved in ad7ba46b by adding our own rough and ready formatting routines.
3) Some error messages are different from GNU/Mac libc. This was solved with a new error macro that accepts multiple messages in b5fbb16c.
4) For some reason ninja on Alpine outputs "nothing to do" messages to stderr whereas they go to stdout on other distributions. Redirecting stderr to stdout is our standard fix for this issue so do that. A non-zero result code will let us know that something has gone wrong.
5) It appears that profiling is not supported on Alpine, which is pretty surprising. For now fix this by only unit testing the profiling code when coverage is enabled. This is not a great solution since we would rather test profiling on any system that supports it but for now this will do.
This macro allows the error to be tested against multiple strings and it passes if any of them match.
This is useful for supporting multiple libc versions/implementations that have different error messages and is needed for an upcoming commit to add unit test support for musl libc.
Some C libraries (e.g.musl) render floating point numbers differently when using printf(). This does not cause any problems for the core code but the unit tests require more predictability to function smoothly without a lot of exceptions.
To accomplish this, wrap the various floating point operations in functions that mostly continue doing math using double but format the output string using integers. This leads to more predictable output at the cost of some complexity.
Rounding could also be accomplished using nearbyint() but this would require linking the math library, which does not seem worth it for a fairly simple operation.
The prior check would result in a segfault from musl libc, which is understandable but not OK for a test.
Instead set all the flags to get the expected invalid result.
The second invalid test was not required for coverage so remove it. This was likely added before the earlier test was required for additional coverage.
The list pointer was not nulled out so calling lstClear() caused a double free.
This is not a production issue but was noticed in some upcoming test code.
The assignment was obviously not kosher but seemed fine for a test in short-lived fork. However, musl libc defines stderr as read-only so this trick won't work.
Instead use dup2(), which is pretty clearly the correct way to do this.
This is another attempt to fix broken defaults in the command-line help. PR #2395 tried to do this by removing current and default values (at least from the list) but this removed much of the value of command-line help being context sensitive. Improvements were made to current values in b3ca2e34 but defaults were left as-is, i.e. broken in some cases.
This new approach handles defaults much the same way as current values. If there are multiple defaults for an option then it is displayed in the option list as <multi> and enumerated in the specific option help. For example, if repo1-type=s3 and repo2-type=azure then the defaults for repo1-storage-upload-chunk-size and repo2-storage-upload-chunk-size will differ.
However, getting the above right means we can no longer display defaults for options that are not valid in the current context. For instance, if repo1-type=posix then none of the repo1-s3-* options are valid and therefore their defaults are not displayed in help. In this case they don't have defaults and if you tried to set one of these options you would get an error.
Since the addition of libcurl4-openssl-dev requires a rebuild of the Debian containers go ahead and rebuild all containers to include new PostgreSQL minor release versions.
This aligns with what is reported by 'uname -m' and required by Docker naming conventions.
Also convert a 'aarch64' string to the VM_ARCH_AARCH64 constant.
These data types are equivalent so no code changes required -- just renaming.
The advantage is that uint8_t makes it clear that this is byte data. It also has the advantage of being shorter.
These defaults were being set in cfgLoadUpdate() but they can be better documented if they are part of the option rules. This is part of a general trend to move away from custom-coded defaults.