These timeouts were nested, i.e. an error in the socket connection would also retry in the TLS and HTTP layers. This led to a multiplying effect such that it took nine minutes to fully timeout with the default io-timeout of 60 seconds.
A fix for this was attempted in 5314dbff but it reduced retries by too much and had to be reverted in fa5b2d44.
Instead fix by moving connection attempts to the lower layer (e.g. TLS -> socket) out of the exception block but leave it within the retry loop. So, for example, if the socket connection fails after retries then the error will not be retried by TLS. But if the TLS session fails the socket will be reconnected.
The STRDEF definition of plural is out of scope in the error message just below it, which may lead to either garbage in the log message or a crash with SIGSEGV.
This is particularly visible with pg_tde, which encrypts tables: if pgbackrest is run with checksum enabled, it tries to emit this message for all encrypted files.
Fix by using simple char * constants in the error message.
HTTP client errors 408 and 429 were not being retried but there seems to be some benefit to doing so.
408 has only been seen once and in that case the server was returning 400 (with Request Timeout in the response body) but it seems worth doing since this could happen during times of high congestion. Requests are not sent until request content is ready so there is not much else to be done to handle this error.
429 has been seen occasionally in the past but now seems to be common on Cloudflare R2. Ideally we would have a large initial back off here but it is not clear that is worth it at this time. The existing Fibonacci back off should be enough to allow operations to proceed if possible during the configured timeout.
Bug Fixes:
* Fix issue with adhoc expiration when no backups in a repository. (Reviewed by Stefan Fercot. Reported by Anup Gupta.)
Features:
* Add restore progress to info command output. (Contributed by Denis Garsh, Maxim Michkov. Reviewed by David Steele.)
* Add progress-only detail level for info command output. (Contributed by Denis Garsh. Reviewed by David Steele, Stefan Fercot.)
Improvements:
* Retry failed reads on object stores. (Reviewed by David Christensen.)
* Fix defaults in command-line help. (Reviewed by David Christensen, Chris Bandy.)
Documentation Improvements:
* Describe discrete option values in a list where appropriate. (Contributed by Anton Kurochkin. Reviewed by David Steele.)
* Fix "less than" in help output for archive-mode option. (Contributed by Anton Kurochkin. Reviewed by David Steele.)
The prior implementation of the restore command did not provide any progress information. Implement it similarly to the backup command and also update the info command to display restore progress alongside backup progress.
Restore is a read operation and should not block other commands. The only exception is that multiple restores with the same lock and repository path are not allowed, as each restore must write progress to a separate file. Therefore, no lock is needed for restores with a remote role and the restore command should not be terminated when the stop command is executed.
The info command fetches a lot of information from the repository about backups and archives, so this operation can be slow. Because progress data is stored in local lock files, accessing the repository is unnecessary when only progress information is required.
This patch introduces the `--detail-level=[progress|full]` option, with `full` as the default. The `progress` level limits the info command output to progress details without querying the repository. The only remaining operations are scanning the folder structure to list available stanzas and reading lock files.
Note: When `progress` is selected, the info command performs no checks beyond verifying stanza availability.
If there are no backups in one or more repositories then the following error occurs during adhoc expiration:
ASSERT: [025]: cannot get index 0 from list with 0 value(s)
Fix this by skipping the adhoc logic when there are no backups in a repository.
Some defaults and allow lists were determined in cfgLoadUpdateOption() because they depended on the values of other options, .e.g. compress-level.
Instead build this functionality into the config parser. Not only does this standardize the defaults and allow lists but it makes it possible to automate the documentation, which is also done in this commit.
There is currently a retry if the initial get request fails (depending on the error code) but if the read fails later on while fetching blocks of data it is fatal. In most cases there is a higher level retry (e.g. restore) but restarting the restore job might be expensive depending on how many files are being restored.
Add a retry that will catch read errors and retry from where the last data was successfully read.
A bit of history -- this patch was first started three years ago but the memory context model at that time would not allow the interface (StorageRead) to own the driver (e.g. StorageReadS3). Subsequent improvements in memory contexts have allowed this ownership model and in fact it is now the default so no ownership changes are required in this patch except in StorageReadRemote which was not updated in f6e3073.
In the documentation and help output the possible values for most options are described as a list but there were several commands for which formatting was done in a different style.
Formatting has been applied to commands and their options to display option values as a list:
* `info` command, `--type` option
* `verify` command, `--output` option
* `version` command, `--output` option
* `manifest` command, `--output` option
Some C libraries (e.g.musl) render floating point numbers differently when using printf(). This does not cause any problems for the core code but the unit tests require more predictability to function smoothly without a lot of exceptions.
To accomplish this, wrap the various floating point operations in functions that mostly continue doing math using double but format the output string using integers. This leads to more predictable output at the cost of some complexity.
Rounding could also be accomplished using nearbyint() but this would require linking the math library, which does not seem worth it for a fairly simple operation.
This is another attempt to fix broken defaults in the command-line help. PR #2395 tried to do this by removing current and default values (at least from the list) but this removed much of the value of command-line help being context sensitive. Improvements were made to current values in b3ca2e34 but defaults were left as-is, i.e. broken in some cases.
This new approach handles defaults much the same way as current values. If there are multiple defaults for an option then it is displayed in the option list as <multi> and enumerated in the specific option help. For example, if repo1-type=s3 and repo2-type=azure then the defaults for repo1-storage-upload-chunk-size and repo2-storage-upload-chunk-size will differ.
However, getting the above right means we can no longer display defaults for options that are not valid in the current context. For instance, if repo1-type=posix then none of the repo1-s3-* options are valid and therefore their defaults are not displayed in help. In this case they don't have defaults and if you tried to set one of these options you would get an error.
These defaults were being set in cfgLoadUpdate() but they can be better documented if they are part of the option rules. This is part of a general trend to move away from custom-coded defaults.
20bfd14 removed content-md5 where allowed by the specification but failed to notice that either content-md5 or x-amz-content-* is required for PUT when object lock is enabled.
On top of that it appears Scality S3 (at least?) won't accept alternate content checksums when object lock is enabled. Technically this is a violation of the specification but nonetheless the change breaks working installations.
For now it seems safer to revert this change and pursue a better solution for a future feature release.
Bug Fixes:
* Fix block incremental restore issue on non-default repository. (Reviewed by David Christensen, Aleksander Łukasz. Reported by Aleksander Łukasz.)
* Do not set recovery_target_timeline=current for PostgreSQL < 12. (Reviewed by Stefan Fercot.)
* Fix expire archive range logging. (Reviewed by Stefan Fercot. Reported by Aleš Zelený.)
* Fix error reporting for queries with no results. (Reviewed by Stefan Fercot. Reported by Susantha Bathige.)
Features:
* Verify recovery target timeline. (Reviewed by Stefan Fercot.)
* Allow verification of a specified backup. (Contributed by Maxim Michkov. Reviewed by David Steele.)
* Add support for S3/GCS requester pays. (Contributed by Timothée Peignier. Reviewed by David Steele.)
* PostgreSQL 18 experimental support. (Reviewed by Stefan Fercot.)
* Allow connections to PostgreSQL on abstract domain sockets. (Reviewed by Chris Bandy. Suggested by Chris Bandy.)
* Add numeric output to version command. (Contributed by Stefan Fercot. Reviewed by David Steele.)
Improvements:
* Allow backup command to operate on remote repositories. (Reviewed by Stefan Fercot.)
* Use lz4 for protocol compression. (Reviewed by Stefan Fercot.)
* Calculate content-md5 on S3 only when required. (Reviewed by David Christensen.)
* Warn when a value for a multi-key option is overwritten. (Reviewed by David Christensen, Stefan Fercot.)
* Add detail logging for expired archive path. (Contributed by Stefan Fercot. Reviewed by David Steele.)
* Remove support for PostgreSQL 9.4. (Reviewed by Stefan Fercot.)
* Remove autoconf/make build. (Reviewed by David Christensen.)
Documentation Improvements:
* Fix documentation for specifying multiple stanzas with tls-server-auth. (Reviewed by David Christensen, Stefan Fercot. Suggested by Terry MacAndrew.)
* Clarify incremental backup expiration. (Reviewed by Stefan Fercot.)
* Clarify requirement for local/remote pgBackRest versions to match. (Contributed by Greg Clough. Reviewed by David Steele.)
* Add FAQ about exporting self-contained cluster. (Contributed by Stefan Fercot. Reviewed by David Steele.)
* Caveat --tablespace-map-all regarding tablespace creation. (Reviewed by Stefan Fercot, Christophe Courtois. Suggested by Christophe Courtois.)
* Clarify behavior of --repo-retention-full-type. (Reviewed by Antoine Beaupré. Suggested by Antoine Beaupré.)
* Change --process-max recommendation for object stores to --repo-bundle. (Reviewed by Stefan Fercot.)
* Update unix_socket_directory to unix_socket_directories. (Contributed by hyunkyu han. Reviewed by David Steele.)
* Recommend not placing spool-path within pg_xlog/pg_wal. (Reviewed by Martín Marqués, Don Seiler. Suggested by Martín Marqués.)
87776bc9 updated the RHEL documentation to PostgreSQL 13/14 but did not update recovery highlighting to be compatible with RHEL. This was not caught because the RHEL documentation was being build as PDF, which does not do highlighting.
Instead build the RHEL documentation as HTML in the first stage (and PDF in the second) so the error is caught.
Finally, fix the RHEL documentation to generate the highlight by concatenating the log.
The documentation was a bit misleading regarding how incremental backups are expired. Update the misleading part ("Differential and Incremental backups are count-based...") and move the explanation of how incremental expiration works out of differential expiration into the introductory paragraph.
Also add a note about how full backups are considered as differential for the purpose of expiration.
Some options can contain multiple key/value pairs. However, if if the key is specified again the value will be silently overwritten. In most cases one value per key is appropriate, but it makes sense to warn the user about the overwrite.
Currently the pg-socket-path option is required to be a valid absolute path but this precludes the use of abstract domain sockets.
Set the option type to string so abstract domain sockets are allowed. This removes some validation but libpq will report if the path is invalid and we don't use it for any other purpose.
The content-md5 header was generated for all requests with content but it is only required for batch delete requests. It is not clear why this header is required when x-amz-content-sha256 is also provided or why it
is required only for this request but the documentation is clear on the matter. However, the content for these requests is relatively small compared to uploading files so omitting content-md5 where possible will
save some CPU cycles.
Current AWS S3 and recent Minio don't complain if this header is missing but since it is still required by older versions of Minio and it is specified in the documentation for batch delete it is makes sense to
keep it.
The prior documentation said that multiple stanzas should be specified by repeating the tls-server-auth option. This is incorrect -- in fact a comma-separated list of stanza should be used.
Fix the documentation and add a test to confirm this behavior.
In passing add some const qualifiers that were missing in the relevant code.
This method was introduced in cad595f but on further reflection it does not seem worth the added complexity just to make restore consistency faster without improving the speed of the overall backup. The most common recovery case is PITR and this method produces diminishing returns as the recovery time gets further from the backup end time.
A better solution (not implemented here) is to copy unmodified files from prior backups. This is much faster than recopying and compressing files from the cluster (especially on object stores with a copy command) and can even be done after the backup window to further reduce WAL replay required for consistency. It also reduces load on the host where the backup is made.
Add support for the verify command --set option. This (internal) option was already accepted without errors but was not implemented.
The default behavior for verify is to check all the backups present. With the --set option only the specified backup will be verified. If the specified backup label is not valid an error is added to the result and verification is skipped. In addition, only WAL required to make the specified backup consistent will be verified.
RHEL 7 is EOL so remove it from the label. Rather than update the version range just remove it from the label since the user guide is generally applicable to RHEL.
PostgreSQL 12 is EOL and no longer available in the yum.postgresql.org repository.
Update the base and update versions of the RHEL and Debian documentation to better cover supported versions.
If the selected backup to restore was not in the default (lowest number) repository and block incremental was used, then restore would erroneously try to load the file super block list from the default repository. Specifying --repo would fix this since it changed the default repository.
Fix by updating the super block read to the specified repository.
This macro is only ever called last in functions so this is not an active issue, but it makes sense to fix since it would pose a risk for future development.