Bug Fixes:
* Fix resume after partial delete of backup by prior resume. (Reviewed by Cynthia Shang. Reported by Tom Swartz.)
Features:
* Add repo-ls command. (Reviewed by Cynthia Shang, Stefan Fercot.)
* Add repo-get command. (Contributed by Stefan Fercot, David Steele. Reviewed by Cynthia Shang.)
* Add archive-mode-check option. (Contributed by Stefan Fercot. Reviewed by David Steele, Michael Banck.)
Improvements:
* Improve archive-get performance. (Reviewed by Cynthia Shang.)
The stackTrace and memContext error handlers were hard-coded which made testing the error module in isolation impossible.
Making the error handlers configurable also makes adding new ones in the future easier.
This is useful for initialization that needs to be done for the test and all subsequent tests.
Use the new defines to implement initialization for sockets and statistics.
In preparation for multi-repo support, a repo tag is added in this commit to the expire command log and error messages. This change also affects the expect logs and the user-guide. The format of the tag is "repoX:" where X is the repo key used in the configuration.
Until multi-repo support has been completed, this tag will always be "repo1:".
When building tests only include files covered by the current test or by prior tests. This increases performance (less compilation and linking) and also helps detect cross-dependencies in the code. Since there are currently cross-dependencies the depend option is used to document them and allow compilation. The idea is to resolve them incrementally over time.
Add the harness option to include harness modules when the minimum requirements for compilation are met.
Add the feature option to indicate which features are now available in the harness (based on source modules already tested). This allows conditional compilation in harness modules when some features are not yet available.
This is required for coverage when the common/error module is run with just the source files required to make it run, rather than all source files as we do now.
Likely something in the harness is providing coverage, but cover it explicitly so the coverage won't be lost if the harness changes.
The original intention was to enclose complex code in braces but somehow braces got propagated almost everywhere.
Document the standard for braces in switch statements and update the code to reflect the standard.
At one time Minio had stability problems with latest but that appears to be resolved for the last year or so.
Use latest so we'll know if something breaks since Minio is frequently used in production.
This is phase 2 of verify command development (phase 1 was processing the archives and phase 3 will be reconciling the archives and backups). In this phase the backups are verified by verifying each file listed in the manifest for the backup and creating a result set with the list of invalid files, if any. A summary is then rendered.
Unit tests have been added and duplicate tests have been removed.
The info command provides total sizes for files in the backup on the database as well as the repository. The text output and associated user documentation has been updated to provide more clarity regarding the sizes being displayed.
In addition, the info command is updated to allow a user to optionally specify the repository when requesting a specific backup set. In this case, the text output will reflect the status of the stanza, the cipher types and archive min/max over all the repositories instead of a single repository when the repo option is specified.
The unit test Makefile generation was a hodge-podge of constants and rules based on distros/versions that easily got out of date and did not work on an unknown system. All of this dates from the mixed Perl/C unit test implementation.
Instead use configure to generate most of the important Makefile variables, which allows the unit tests to run on multiple platforms, e.g. MacOS and FreeBSD.
There is plenty of work to be done here and not all the unit tests work on MacOS and FreeBSD for various reasons.
As a POC update the MacOS and FreeBSD tests on Cirrus-CI to run a few command unit tests.
MacOS does not allow files to be removed recursively unless the owner has write and execute permissions on all the directories.
Some tests leave the permissions in a bad state so fix them up before trying to delete.
The exact message is platform dependent so get the platform error to use in the expect.
It doesn't matter what the message is as long as there is an error and it is logged.
YAML::XS requires libyaml so it not as portable as pure Perl versions of YAML.
Instead of using YAML:PP just use the general YAML::Any module which uses whatever is installed. We are not concerned about performance for YAML so whatever works is fine.
Messages on stderr were being lost due to the error suppression used to customize the error message.
Also update the formatting to be more informative and concise.
MacOS has a very old version of rsync that does not support this option.
Rather than require a newer version of rsync exclude the option since the plan is to remove the requirement for it.
This is a more appropriate place for the check and means test.pl can avoid loading any XML files if --no-gen is specified.
The XML::Checker::Parser module originally selected for XML in Perl is not very portable so the requirement reduces the number of platforms where tests can be run.
Clang justifiably complains about pointer arithmetic on a known NULL value during testing. We know this is fine but use uintptr_t to silence the warnings.
Found on MacOS M1.
The return value is not checked because we are happy with a truncated result in this case, which is guaranteed by passing the buffer size.
Found on MacOS M1.
Multi-repository implementations for the archive-push, check, info, stanza-create, stanza-upgrade, and stanza-delete commands.
Multi-repo configuration is disabled so there should be no behavioral changes between these commands and their current single-repo implementations.
Multi-repo documentation and integration tests are still in the multi-repo development branch. All unit tests work as multi-repo since they are able to bypass the configuration restrictions.
The option portion was not being capitalized or replacing - with _.
The parser does not care, but in cases where we have mixed hrnCfgEnv*()/setenv() calls the env variable might not get cleared, which can lead to funny test results.
The default lock path should fail since the test VM gives ownership of /tmp to root.
For some reason this was not working as expected under u18 but it fails under u20.
All unit tests now require full coverage so the "full" keyword is obsolete and has been removed.
The covered code modules are simply listed, with only "no code" modules annotated.
Check that archive files exist in the main process instead of the local process. This means that the archive.info file only needs to be loaded once per execution rather than once per file to get.
Stop looking when a file is missing or in error. PostgreSQL will never request anything past the missing file so there is no point in getting them. This also reduces "unable to find" logging in the async process.
Cache results of storageList() when looking for multiple files to reduce storage I/O.
Look for all requested archive files in the archive-id where the first file is found. They may not all be there, but this reduces the number of list calls. If subsequent files are in another archive id they will be found on the next archive-get call.
Append "asynchronously" to messages when the async process fetched the file (not in the actual async process log, though).
Add "repo1" to make it clear what archive we are talking about. This is not very useful by itself but soon we'll be able to add the archive id, which is very useful.
Add constants for messages that are used multiple times to ensure they stay consistent.
If files other than backup.manifest.copy were left in a backup path by a prior resume then the next resume would skip the backup rather than removing it. Since the backup path still existed, it would be found during backup label generation and cause an error if it appeared to be later than the new backup label. This occurred if the skipped backup was full.
The error was only likely on object stores such as S3 because of the order of file deletion. Posix file systems delete from the bottom up because directories containing files cannot be deleted. Object stores do not have directories so files are deleted in whatever order they are provided by the list command. However, the issue can be reproduced on a Posix file system by manually deleting backup.manifest.copy from a resumable backup path.
Fix the issue by removing the resumable backup if it has no manifest files. Also add a new warning message for this condition.
Note that this issue could be resolved by running expire or a new full backup.
These options specify the number of local worker job retries and the retry interval after one immediate retry.
There is some value in allowing retries to be specified by the user but for the most part these options are for suppressing retries during testing, which can save a lot of time. The bug introduced in d1d25c7 and fixed in 8b86d5e also suggests it is better not to use retries in tests.
Remove the default delayed retries for archive-get/archive-push, leaving only the immediate retry. These commands are retried by PostgreSQL so it doesn't make sense to do too many retries internally.
These options are currently internal.
The test was pretty old and written in stages during the migration, so storage use was a bit archaic and the organization was poor.
Update using the new storage macros and reorganize the tests to provide better coverage.
The macros should make it much easier to write complex tests, especially when compression and encryption are involved.
Update the command/archiveGet test to show how the new macros are used.
This avoids the need for strLstJoin() when testing lists.
Lists are \n delimited (rather than command or pipe) so that non-trivial lists can be more easily diff'd.
Add separation and some visual cues to help identify the start of a test.
Also add a counter which can be used to search for a specific test, which is useful if there is a lot of debug output to search through.
These were required to deal with the legacy Perl code being unable to load new options between tests.
The C code does not have this issue so remove the forks and update process ids in the log tests.
No timeout is expected here but the small timeout prevents errors from being thrown.
This is not a bug since the error would be thrown on the next archive-get call but it does make the tests harder to debug when there is an error.
It is not clear why there was a timeout here at all. It is likely cruft from a prior test or a copy/paste error.
Tests that are duplicated are being removed from the info command unit tests. Specifically tests where the only thing different was whether a lock was held or not which affects only the status display. Removing these tests will reduce churn in the upcoming multi-repo support.
The data returned by the protocol has not been sorted yet so it is vulnerable to differences in collation.
Multiple records are not needed for this test so limit it to one path to solve this issue.
The pg option only has one current usage, to let the backup local know which pg index it should copy files from.
There are other possible uses for this option, but they need thought, tests, and documentation.
This option was added in advance of the multi-repo functionality but it has no purpose and it is not clear what the validity rules should be.
The option will be added back when multi-repo functionality is committed.
There is an inconsistency when the JSON is output for the case when a stanza is requested and it does not exist in the repo. This was the only case where the archive array was not added to the JSON. Adding it will simplify the upcoming multi-repo support code.
Also, a redundant test was removed rather than updating it for this case.
This was a hack to prevent the remote from loading host settings, which is now handled by option validity for command roles.
These options are still useful so don't remove them, but do leave them internal for now.
Building on 23f5712, limit option validity by role. This is mostly for options that weren't needed for certain roles but were harmless. However, the upcoming multi repository functionality requires the granularity implemented here.
The remote role benefits since host options can automatically excluded when building the options. Also, many options that are only required for the default role (e.g. repo-retention-full) no longer need to be passed in tests for other roles.
Some tests used options in contexts that are currently valid but are not correct usage, i.e. usage of internal options for the default role.
Update these tests in advance of the option validity becoming stricter.
Validity by command was not granular enough so numerous options needed be marked internal so users would not stumble across them. Options were also needlessly being passed to roles that had no use for them.
Introduce per-role validity lists that depend on what roles are valid per command. Also add a check to ensure that only valid roles are used with a command.
This commit adds the functionality but does not introduce any new behavior, i.e. all options are valid for all roles that the command is valid for. A subsequent commit will introduce the new role restrictions to make the changes easier to audit.
Data required for parsing was spread between the config and defined modules, mostly for historical reasons because the same data was used by Perl.
Requiring all the parse rules to be accessed with function interfaces makes the code more complicated and new rules harder to implement.
Instead, move the data to the parse module so in the most complex cases no interface functions are needed. This reduces the total amount of code and paves the way for more complex parse rules.
The help data can be represented more compactly in a pack and this separates data needed for help from data needed for parsing, freeing each to have a more appropriate representation.
This was done in the internal versions but not the user-facing function. That meant the field had to be explicitly read after determining it was NULL, which is wasteful.
Since there is only one behavior now, remove pckReadDefaultNull() and move the logic to pckReadNullInternal().
Testing on Travis-CI has been getting slower (from ~18 minutes to 3-6 hours) and the travis-ci.org service will be terminated at the end of the year. Moving to travis-ci.com is an option but the quotas are too low for our purposes.
Instead use Github Actions, which does not currently have quotas, and runs our current tests with just a few tweaks.
This still leaves multi-architecture tests on Travis-CI but we may be able to run those and stay within the new quotas.
Also fix a minor bug in restoreTest.c exposed by Github Actions using a different name for the user and group.
The pack type is an architecture-independent format for serializing data compactly, inspired by ProtocolBuffers and Avro.
Also add ioReadSmall(), which is optimized for small binary reads, similar to ioReadLineParam().
The C code does not use doubles to represent seconds like the Perl code did so time can be represented as an integer which reduces the number of data types that config has to understand.
Also remove Variant doubles since they are no longer used.
Note that not all double code was removed since we still need to display times to the user in seconds and it is possible for the times to be fractional. In the future this will likely be simplified by storing the original user input and using that value when the time needs to be displayed.
Bug Fixes:
* Allow [, #, and space as the first character in database names. (Reviewed by Stefan Fercot, Cynthia Shang. Reported by Jefferson Alexandre.)
* Create standby.signal only on PostgreSQL 12 when restore type is standby. (Fixed by Stefan Fercot. Reviewed by David Steele. Reported by Keith Fiske.)
Features:
* Expire history files. (Contributed by Stefan Fercot. Reviewed by David Steele.)
* Report page checksum errors in info command text output. (Contributed by Stefan Fercot. Reviewed by Cynthia Shang.)
* Add repo-azure-endpoint option. (Reviewed by Cynthia Shang, Brian Peterson. Suggested by Brian Peterson.)
* Add pg-database option. (Reviewed by Cynthia Shang.)
Improvements:
* Improve info command output when a stanza is specified but missing. (Contributed by Stefan Fercot. Reviewed by Cynthia Shang, David Steele. Suggested by uspen.)
* Improve performance of large file lists in backup/restore commands. (Reviewed by Cynthia Shang, Oscar.)
* Add retries to PostgreSQL sleep when starting a backup. (Reviewed by Cynthia Shang. Suggested by Vitaliy Kukharik.)
Documentation Improvements:
* Replace RHEL/CentOS 6 documentation with RHEL/CentOS 8.
Update RHEL/CentOS 7 to cover the versions that were previously covered by RHEL/CentOS 6.
Since RHEL/CentOS 7/8 work the same update the documentation logic and labels to reflect this compatibility.
Inaccuracies in sleep time or clock skew might make a single sleep insufficient to reach the next second.
Add a few retries to make the process more reliable but still avoid an infinite loop if something is seriously wrong.
CentOS6 EOL'd and the mirrors were swiftly deleted, leading to failures in tests and documentation.
Remove CentOS 6 for now to get builds going again with the intention to replace it in the near future with CentOS 8.
There is not a lot to be done in this case since it looks like PostgreSQL disconnected while the query was running, but at least improve the error message and remove the assert, which indicates a coding error.
Refactor the code to allow a dynamic number of indexes for indexed options, e.g. pg-path. Our reliance on getopt_long() still limits the number of indexes we can have per group, but once this limitation is removed the rest of the code should be happy with dynamic numbers of indexes (with a reasonable maximum).
Add an option to set a default in each group. This was previously handled by the host-id option but now there is a specific option for each group, pg and repo. These remain internal until they can be fully tested with multi-repo support. They are fully tested for internal usage.
Remove the ConfigDefineOption enum and use the ConfigOption enum instead. They are now equal since the indexed options (e.g. cfgOptRepoHost2) have been removed from ConfigOption.
Remove the config/config test module and add required tests to the config/parse test module. Parsing is now the only way to load a config so this removes some redundancy.
Split new internal config structures and functions into a new header file, config.intern.h. More functions will need to be moved over from config.h but that will need to be done in a future commit to reduce churn.
Add repoIdx to repoIsLocal() and storageRepo*(). Multi-repository support requires that repo locality and storage be accessible by index. This allows, for example, multiple repos to be iterated in a loop. This could be done in a separate commit but doesn't seem worth it since the code is related.
Remove the type parameter from storageRepoGet(). This parameter existed solely to provide coverage for the case where the storage type was invalid. A better pattern is to check that the type is S3 once all other types have been ruled out.
Improve locking on remote processes by introducing an exec-id that is unique to the main process and passed to all remote processes. This allows the remote processes to determine if a lock is held by a remote from the same main process. If so, the lock is allowed.
The exec-id is also useful for associating remote logs with main logs for debugging purposes.
When restore type standby is provided, the recovery.signal isn't needed and may lead to some confusion (see #1236).
Lately, when using pg_basebackup --write-recovery-conf, only the standby.signal file is created. This change would then align with that behaviour.
If a user reset an option such as pg-default on the command-line then an override in the code would not take effect.
Ignore a reset when the code explicitly sets an option to prevent this.