This error was lost during the migration to C. The error that occurred instead (generally an SSH auth error) was hard to debug.
Restore the original behavior by throwing an error immediately if pg1-host is configured for any of these commands. reset-pg1-host can be used to suppress the error when required.
The main improvement is a double-fork to prevent zombie processes if the parent process exits after the (child) async process. This is a real possibility since the parent process sticks around to monitor the results of the async process.
In the first fork, ignore SIGCHLD in the very unlikely case that the async process exits before the first fork. This is probably only possible if the async process exits immediately, perhaps due to a chdir() failure. Set SIGCHLD back to default in the async process so waitpid() will work as expected.
Also update the comment on chdir() to more accurately reflect what is happening.
Finally, add a test in certain debug builds to ensure the first fork exits very quickly. This only works when valgrind is not in use because valgrind makes forking so slow that it is hard to tell if the async process performed work or not (in the case that the second fork goes missing and the async process is a direct child).
2a06df93 removed the error file so an old error would not be reported before the async process had a chance to try again. However, if the async process was already running this might lead to a timeout error before reporting the correct error.
Instead, remove the error files once we know that the async process will start, i.e. after the archive lock has been acquired.
This effectively reverts 2a06df93.
Auto-selection is performed only when --set is not specified. If a backup set for the given target time cannot not be found, the latest (default) backup set will be used.
Currently a limited number of date formats are recognized and timezone names are not allowed, only timezone offsets.
Validate that checksums exist for zero size files. This means that the checksums for zero size files are explicitly set by backup even though they'll always be the same. Also validate that zero length files have the correct checksum.
Validate that repo size is > 0 if size is > 0. No matter what compression type is used a non-zero amount of data cannot be stored in zero bytes.
This is a modest start but it addresses the specific issue that was caused by the bug fixed in 45ec694a. This validation will produce an immediate error rather than erroring out partway through the restore.
More validations are planned but this is the most important one and seems safest for this release.
If a file was removed by PostgreSQL during the backup (or was missing from the standby) then the next file might not be copied and updated in the manifest. If this happened then the backup would error when restored.
The issue was that removing files from the manifest invalidated the pointers stored in the processing queues. When a file was removed, all the pointers shifted to the next file in the list, causing a file to be unprocessed. Since the unprocessed file was still in the manifest it would be saved with no checksum, causing a failure on restore.
When process-max was > 1 then the bug would often not express since the file had already been pulled from the queue and updates to the manifest are done by name rather than by pointer.
The last queue was not being sorted when a primary queue was added first.
This did not affect the backup or integrity but could lead to slightly lower performance since large files were not always copied first.
Previously memNew() used memset() to initialize all struct members to 0, NULL, false, etc. While this appears to work in practice, it is a violation of the C specification. For instance, NULL == 0 must be true but neither NULL nor 0 must be represented with all zero bits.
Instead use designated initializers to initialize structs. These guarantee that struct members will be properly initialized even if they are not specified in the initializer. Note that due to a quirk in the C99 specification at least one member must be explicitly initialized even if it needs to be the default value.
Since pre-zeroed memory is no longer required, adjust memAllocInternal()/memReallocInternal() to return raw memory and update dependent functions accordingly. All instances of memset() have been removed except in debug/test code where needed.
Add memMewPtrArray() to allocate an array of pointers and automatically set all pointers to NULL.
Rename memGrowRaw() to the more logical memResize().
The timeline is required to verify WAL segments in the archive after a backup. The conversion was performed base 10 instead of 16, which led to errors when the timeline was ≥ 0xA.
This macro block encapsulates the common pattern of switching to the prior (formerly called old) mem context to return results from a function.
Also rename MEM_CONTEXT_OLD() to memContextPrior(). This violates our convention of macros being in all caps but memContextPrior() will become a function very soon so this will reduce churn.
The local, remote, archive-get-async, and archive-push-async commands were used to run functionality that was not directly available to the user. Unfortunately that meant they would not pick up options from the command that the user expected, e.g. backup, archive-get, etc.
Remove the internal commands and add roles which allow pgBackRest to determine what functionality is required without implementing special commands. This way the options are loaded from the expected command section.
Since remote is no longer a specific command with its own options, more manipulation is required when calling remote. This might be something we can improve in the config system but it may be worth leaving as is because it is a one-off, for now at least.
Time is supported in all drivers with the update to S3 at 61538f93, so it is now possible to add time to the ls command and have it work on all repo types.
Parameter lists that are passed directly to exec*() do not need quoting when spaces are present. Worse, the quotes will not be stripped and the option value will be garbled.
Unfortunately this still does not fix all issues with quoting since we don't know how it might need to be escaped to work with SSH command configuration. The answer seems to be to pass the options in the protocol layer but that's beyond the scope of this commit.
PostgreSQL >= 9.6 uses non-exclusive backup which has implicit stop-auto since the backup will stop when the connection is terminated.
The warning was made more verbose in 1f2ce45e but this now seems like a bad idea since there are likely users with mixed version environments where stop-auto is enabled globally. There's no reason to fill their logs with warnings over a harmless option. If anything we should warn when stop-auto is explicitly set to false but this doesn't seem very important either.
Revert to the prior behavior, which is to warn and reset when stop-auto is enabled on PostgreSQL < 9.3.
Remove embedded Perl from the distributed binary. This includes code, configure, Makefile, and packages. The distributed binary is now pure C.
Remove storagePathEnforceSet() from the C Storage object which allowed Perl to write outside of the storage base directory. Update mock/all and real/all integration tests to use storageLocal() where they were violating this rule.
Remove "c" option that allowed the remote to tell if it was being called from C or Perl.
Code to convert options to JSON for passing to Perl (perl/config.c) has been moved to LibC since it is still required for Perl integration tests.
Update build and installation instructions in the user guide.
Remove all Perl unit tests.
Remove obsolete Perl code. In particular this included all the Perl protocol code which required modifications to the Perl storage, manifest, and db objects that are still required for integration testing but only run locally. Any remaining Perl code is required for testing, documentation, or code generation.
Rename perlReq to binReq in define.yaml to indicate that the binary is required for a test. This had been the actual meaning for quite some time but the key was never renamed.
For the most part this is a direct migration of the Perl code into C except as noted below.
A backup can now be initiated from a linked directory. The link will not be stored in the manifest or recreated on restore. If a link or directory does not already exist in the restore location then a directory will be created.
The logic for creating backup labels has been improved and it should no longer be possible to get a backup label earlier than the latest backup even with timezone changes or clock skew. This has never been an issue in the field that we know of, but we found it in testing.
For online backups all times are fetched from the PostgreSQL primary host (before only copy start was). This doesn't affect backup integrity but it does prevent clock skew between hosts affecting backup duration reporting.
Archive copy now works as expected when the archive and backup have different compression settings, i.e. when one is compressed and the other is not. This was a long-standing bug in the Perl code.
Resume will now work even if hardlink settings have been changed.
Reviewed by Cynthia Shang.
Commit 7168e074 tried to use cwd() as PGDATA but this would disagree with the path configured in pgBackRest if PGDATA was symlinked.
If cwd() does not match the pgBackRest path then chdir() to the path and make sure the next cwd() matches the result from the first call.
This module will eventually contain various useful zero-terminated string functions.
For now, using NULL_Z instead of strPtr(NULL_STR) avoids a strict aliasing warning on RHEL 6. This is likely a compiler issue, but adding these constants seems like a good idea anyway and we are not going to get a fix in a gcc that old.
A recopy would occur if the size or checksum was invalid but on error the backup would terminate.
Instead, recopy the resumed file on any error. If the error is systemic (e.g. network failure) then it should show up again during the recopy.
Using the same macros for formatted and unformatted logging had several disadvantages.
First, the compiler was unable to verify the format string against the parameters.
Second, legitimate % characters in messages were being interpreted as format characters with garbage output ensuing.
Add _FMT() variants and update all call sites to use the correct variant.
Adding a dummy column which is always set by the P() macro allows a single macro to be used for parameters or no parameters without violating C's prohibition on the {} initializer.
-Wmissing-field-initializers remains disabled because it still gives wildly different results between versions of gcc.
Previously, options were being filtered based on what was currently valid. For chained commands (e.g. backup then expire) some options may be valid for the first command but not the second.
Filter based on the command definition rather than what is currently valid to avoid logging options that are not valid for subsequent commands. This reduces the number of options logged and will hopefully help avoid confusion and expect log churn.
Previously we were using int64_t to debug time_t but this may not be right depending on how the compiler represents time_t, e.g. it could be a float.
Since a mismatch would have caused a compiler error we are not worried that this has actually happened, and anyway the worst case is that the debug log would be wonky.
The primary benefit, aside from correctness, is that it makes choosing a parameter debug type for time_t obvious.
Using pg1-path, as we were doing previously, could lead to WAL being copied to/from unexpected places. PostgreSQL sets the current working directory to PGDATA so we can use that to resolve relative paths.
Note that building the manifest on each host has been temporarily removed.
This feature will likely be brought back as a non-default option (after the manifest code has been fully migrated to C) since it can be fairly expensive.