The expect tests were originally a rough-and-ready type of unit test so monitoring changes in the expect log helped us detect changes in behavior.
Now the stanza code is heavily unit-tested so the detailed logs mainly cause churn and don't have any measurable benefit.
Reduce the log level to DETAIL to make the logs less verbose and volatile, yet still check user-facing log messages.
The same test configurations are run on all four test VMs, which seems a real waste of resources.
Vary the tests per VM to increase coverage while reducing the total number of tests. Be sure to include each major feature (remote, s3, encryption) in each VM at least once.
The expect tests were originally a rough-and-ready type of unit test so monitoring changes in the expect log helped us detect changes in behavior.
Now the archive code is heavily unit-tested so the detailed logs mainly cause churn and don't have any measurable benefit.
Reduce the log level to DETAIL to make the logs less verbose and volatile, yet still check user-facing log messages.
The file write object destructors called close() and finalized the file even if it was not completely written. This was an issue in both the C and Perl code.
Rewrite the destructors to simply free resources (like file handles) rather than calling the close() method. This leaves the temp file in place for filesystems that use temp files.
Add unit tests to prevent regression.
Reported by blogh.
This was not being caught because the integration tests for S3 were running remotely and going through the Perl code rather than the new C code.
Implement the exists method for the S3 driver and add tests to prevent a regression.
Reported by mibiio.
Multiple status files were being created by asynchronous archiving if a high-level error occurred after one or more WAL segments had already been transferred successfully. Error files were being written for every file in the queue regardless of whether it had already succeeded. To fix this, add an option to skip writing error files when an ok file already exists.
There are other situations where both files might exist (various fsync and filesystem error scenarios) so it seems best to retry in the case that multiple status files are found rather than throwing a hard error (which then means that archiving is completely stuck). In the case of multiple status files, a warning will be logged to alert the user that something unusual is happening and the command will be retried.
Reported by fpa-postgres, Joe Ayers, Douglas J Hunley.
The C info code has already been committed but this commit wires it into main.
Also remove the info Perl code and tests since they are no longer called.
After a stanza-upgrade backups for the old cluster are displayed until they expire. Cluster info was output newest to oldest which meant after an upgrade the most recent backup would no longer be output last.
Update the text output ordering so the most recent backup is always output last.
Contributed by Cynthia Shang.
Suggested by Ryan Lambert.
- Add detail to errors when info files are loaded with incorrect encryption settings.
- Throw FileMissingError rather than FileOpenError when both copies of the info file are missing.
- If one file is present (but errors) and the other is missing, then return the error for the file that was present.
Contributed by Cynthia Shang.
Previously chown() would be called even when no ownership changes were required.
In most cases changes are not required and it seems better to perform an extra stat() rather than an extra chown().
Also add unit tests for owner() since there weren't any.
The previous error message only showed the last error. In addition, some errors were missed (such as directory permission errors) that could prevent the copy from being checked.
Show both errors below a generic "unable to load" error. Details are now given explaining exactly why the primary and copy failed.
Previously if one file could not be loaded a warning would be output. This has been removed because it is not clear what the user should do in this case. Should they do a stanza-create --force? Maybe the best idea is to automatically repair the corrupt file, but on the other hand that might just spread corruption if pgBackRest makes the wrong choice.
The decryption filter was added in archiveGetFile() and archiveGetCheck() was modified to return the WAL decryption key stored in archive.info. The rest was plumbing.
The mock/archive/1 integration test added encryption to provide coverage for the new code paths while mock/archive/2 dropped encryption to provide coverage for the existing code paths. This caused some churn in the expect logs but there was no change in behavior.
The only change required was to remove the filter that prevented S3 storage from being used. The archive-get command did not require any modification which demonstrates that the storage interface is working as intended.
The mock/archive/3 integration test was modified to run S3 storage locally to provide coverage for the new code paths while mock/stanza/3 was modified to run S3 storage remotely to provide coverage for the existing code paths. This caused some churn in the expect logs but there was no change in behavior.
There are a number of cases where a checksum delta is more appropriate than the default time-based delta:
* Timeline has switched since the prior backup
* File timestamp is older than recorded in the prior backup
* File size changed but timestamp did not
* File timestamp is in the future compared to the start of the backup
* Online option has changed since the prior backup
A practical example is that checksum delta will be enabled after a failover to standby due to the timeline switch. In this case, timestamps can't be trusted and our recommendation has been to run a full backup, which can impact the retention schedule and requires manual intervention.
Now, a checksum delta will be performed if the backup type is incr/diff. This means more CPU will be used during the backup but the backup size will be smaller and the retention schedule will not be impacted.
Contributed by Cynthia Shang.
We were already retrying 500 errors but 503 (rate-limiting) errors were not being retried and would cause an instant failure which aborted the command.
There are only two 5xx errors currently implemented by S3 but instead of adding 503 simply retry all 5xx errors. This is consistent with the http definition of this error class, "the server failed to fulfill an apparently valid request."
Suggested by Craig A. James.
This calculation was missed when the WAL segment size was made dynamic in preparation for PostgreSQL 11.
Fix the calculation by checking the actual WAL file sizes instead of using an estimate based on WAL segment size. This is more accurate because it takes into account .history and .backup files, which are smaller. Since the calculation is done in the async process the additional processing time should not adversely affect performance.
Remove the PG_WAL_SIZE constant and instead use local constants where the old value is still required. This is only the case for some tests and PostgreSQL 8.3 which does not provide a way to get the WAL segment size from pg_control.
If an error occurred while acquiring a lock on a remote server the error would be reported correctly, but the queue max detection code was not reached. The tests failed to detect this because they fixed the connection before queue max, allowing the ccde to be reached.
Move the queue max code before the lock so it will run even when remote connections are not working. This means that no attempt will be made to transfer WAL once queue max has been exceeded, but it makes it much more likely that the code will be reach without error.
Update tests to continue errors up to the point where queue max is exceeded.
Reported by Lardière Sébastien.
PostgreSQL 11 introduces configurable WAL segment sizes, from 1MB to 1GB.
There are two areas that needed to be updated to support this: building the archive-get queue and checking that WAL has been archived after a backup. Both operations require the WAL segment size to properly build a list.
Checking the archive after a backup is still implemented in Perl and has an active database connection, so just get the WAL segment size from the database.
The archive-get command does not have a connection to the database, so get the WAL segment size from pg_control instead. This requires a deeper inspection of pg_control than has been done in the past, so it seemed best to copy the relevant data structures from each version of PostgreSQL and build a generic interface layer to address them. While this approach is a bit verbose, it has the advantage of being relatively simple, and can easily be updated for new versions of PostgreSQL.
Since the integration tests generate pg_control files for testing, teach Perl how to generate files with the correct offsets for both 32-bit and 64-bit architectures.
Use checksums rather than timestamps to determine if files have changed. This is useful in cases where the timestamps may not be trustworthy, e.g. when performing an incremental after failing over to a standby.
If checksum delta is enabled then checksums will be used for verification of resumed backups, even if they are full. Resumes have always used checksums to verify the files in the repository, enabling delta performs checksums on the database files as well.
Note that the user must manually enable this feature in cases were it would be useful or just keep in enabled all the time. A future commit will address automatically enabling the feature in cases where it seems likely to be useful.
Contributed by Cynthia Shang.
Apparently we never needed to run this function remotely.
It will be needed by the backup checksum delta feature, so implement it now.
Contributed by Cynthia Shang.
For read-only repositories the Posix and CIFS drivers behave exactly the same. Since that's all we support in C right now it's valid to treat them as the same thing. An assertion has been added to remind us to add the CIFS driver before allowing the repository to be writable.
Mostly we want to make sure that the C code does not blow up when the repository type is CIFS.
% characters caused issues in backup/restore due to filenames being appended directly into a format string.
Reserved XML characters (<>&') caused issues in the S3 driver due to improper escaping.
Add a file with all common special characters to regression testing.
The new archive-get C code can't run (yet) when encryption is enabled. Therefore move the encryption tests so we can test the new C code. We'll move it back when encryption is enabled in C.
Also, push one WAL segment with compression to test decompression in the C code.
A return code of 1 from the archive-get was being logged as an error message at info level but otherwise worked correctly.
Also improve info messages when an archive segment is or is not found.
Previously an error would be generated if other files were present and not owned by the PostgreSQL user. This hasn't been a big deal in practice but it could cause issues.
Also add tests to make sure the same logic applies with links to files, i.e. all other files in the directory should be ignored. This was actually working correctly, but there were no tests for it before.
The log-subprocess feature added in 22765670 failed to take into account the naming for remote processes spawned by local processes. Not only was the local command used for the naming of log files but the process id was not pass through. This meant every remote log was named "[stanza]-local-remote-000" which is confusing and meant multiple processes were writing to the same log.
Instead, pass the real command and process id to the remote. This required a minor change in locking to ignore locks if process id is greater than 0 since remotes started by locals never lock.
Relative link paths were being combined with the paths of previous links (relative or absolute) due to the $strPath variable being modified in the current iteration rather than simply being passed to the next level of recursion.
This issue did not affect absolute links and relative tablespace links were caught by other checks, though the error was confusing.
Reported by Cynthia Shang.
Offline operation runs counter to the purpose of this command, which is to check if archiving and backups are working correctly.
Reported by Jason O'Donnell.
Implemented using the same logic as the patches adding this feature to PostgreSQL, 8694cc96 and 920a5e50. Temporary relation exclusion is enabled in PostgreSQL ≥ 9.0. Unlogged relation exclusion is enabled in PostgreSQL ≥ 9.1, where the feature was introduced.
Contributed by Cynthia Shang.
A regression in v0.82 removed the timestamp comparison when deciding which files from the aborted backup to keep on resume. All resumed backups should be considered inconsistent. A resumed backup can be identified by checking the log for the message "aborted backup of same type exists, will be cleaned to remove invalid files and resumed".
Reported by David Youatt, Yogesh Sharma, Stephen Frost.
S3 (and gateways) always set content-length or transfer-encoding but HTTP 1.1 does not require it and proxies (e.g. HAProxy) may not include either.
Suggested by Adam K. Sumner.
pgBackRest currently has no way to request new credentials so the entire command (e.g. backup, restore) must complete before the credentials expire.
Contributed by Yogesh Sharma.
Many options that were set per test can instead be inferred from the types, i.e. container, c, expect, and individual.
Also finish renaming Perl unit tests with the -perl suffix.
Configuration files are loaded from the directory specified by the --config-include-path option.
Add --config-path option for overriding the default base path of the --config and --config-include-path option.
Contributed by Cynthia Shang.
The Perl process was exiting directly when called but that interfered with proper locking for the forked async process. Now Perl returns results to the C process which handles all errors, including signals.
This provides correct matching in the event there are system-id and db-version duplicates (e.g. after reverting a pg_upgrade).
Fixed by Cynthia Shang.
Reported by Adam K. Sumner.
This allows specific options in pgbackrest.conf to be ignored (and set to default) which reduces the need to write new configuration files for specific needs.
Note that boolean, non-command-line options are already negatable.
When a backup host is present, backups should only be allowed on the backup host and restores should only be allowed on the database host unless an alternate configuration is created that ignores the remote host.
Reported by Lardière Sébastien.
Required to test restores on the backup server, a fairly common scenario.
Improve the restore function to accept optional parameters rather than a long list of parameters. In passing, clean up extraneous use of strType and strComment variables.
When more than one db was specified the path, port, and socket path would for db1 were passed no matter which db was actually being addressed.
Reported by Uspen.
If the backup cannot map a group to a name it stores the group in the manifest as false then uses either the owner of $PGDATA to set the group during restore or failing that the group of the current user. This logic was not working correctly because the selected group was overwriting the user on restore leaving the group undefined and the user incorrectly set to the group. (Reported by Jeff McCormick.)
The existing static files would not work with 32-bit or big-endian systems so create functions to generate these files dynamically rather than creating a bunch of new static files.
After a stanza-upgrade it should still be possible to restore backups from the previous version and perform recovery with archive-get. However, archive-get only checked the most recent db version/id and failed.
Also clean up some issues when the same db version/id appears multiple times in the history.
Fixed by Cynthia Shang.
Reported by Clinton Adams.
* Exclude contents of pg_snapshots, pg_serial, pg_notify, and pg_dynshmem from backup since they are rebuilt on startup.
* Exclude pg_internal.init files from backup since they are rebuilt on startup.
The archive_status directory is now recreated on restore to support PostgreSQL 8.3 which does not recreate it automatically like more recent versions do.
Also fixed log checking after PostgreSQL shuts down to include FATAL messages and disallow immediate shutdowns which can throw FATAL errors in the log.
Reported by Stephen Frost.
Modified the info command (both text and JSON output) to display the archive ID and minimum/maximum WAL currently present in the archive for the current and prior, if any, database cluster version.
Contributed by Cynthia Shang.
The integration tests that were supposed to prevent this regression did not work as intended. They verified the contents of a table in the (supposedly) restored tablespace, deleted the table, and then deleted the tablespace. All of this was deemed sufficient to prove that the tablespace had been restored correctly and was valid.
However, PostgreSQL will happily recreate a tablespace on the basis of a single full-page write, at least in the affected versions. Since writes to the test table were replayed from WAL with each recovery, all the tests passed even though the tablespace was missing after the restore.
The tests have been updated to include direct comparisons against the file system and a new table that is not replayed after a restore because it is created before the backup and never modified again.
Versions ≥ 9.0 were not affected due to numerous synthetic integration tests that verify backups and restores file by file.
* More optimized container suite that greatly improves build time.
* Added static Debian packages for Devel::Cover to reduce build time.
* Add deprecated state for containers. Deprecated containers may only be used to build packages.
* Remove Debian 8 from CI because it does not provide additional coverage over Ubuntu 14.04 and Ubuntu 16.04.
The options accommodate systems where CAs are not automatically found by IO::Socket::SSL, i.e. RHEL7, or to load custom CAs.
Suggested by Scott Frazer.
* Combine hardlink and non/compressed in synthetic tests to reduce test time and improve coverage.
* Change log level of hardlink logging to detail.
* Cast size in S3 manifest to integer.
Refactor storage layer to allow for new repository filesystems using drivers. (Reviewed by Cynthia Shang.)
Refactor IO layer to allow for new compression formats, checksum types, and other capabilities using filters. (Reviewed by Cynthia Shang.)