<p>Github has a good facility for reporting issues but it doesn't work as well for feature requests, some of which might take time to implement or never get implemented at all. The result is a long list of issues which makes the <backrest/> project look as if it does not handle problems in a timely fashion, when in fact the vast majority of the issues are not bugs. </p>
<p>Feature requests submitted on Github will be moved here (unless they can be satisfied immediately) and the feature issue will be closed, but a link will be preserved so comments can be added. This is not ideal but seems like the best compromise at this time.</p>
<p>The text (default) info output should include everything (or nearly everything) that is in the JSON output nicely formatted for human consumption.</p>
</section>
<sectionid="delete-backup-label">
<title>Delete backup_label as soon as possible</title>
<p>If <postgres/> crashes during a backup it may not be able to recover if the backup label is present. Copy and delete right after start_backup(). Stop backup will want to delete it so it might be necessary to copy it back or at least touch a file that <postgres/> can delete. Check after the backup is complete to make sure it's really gone.</p>
</section>
<sectionid="processes-vs-threads">
<title>Abandon threads and go to processes</title>
<p>Even with the thread refactor they are not reliable on all platforms. Processes would be more compatible across platforms and basic testing has shown there are no significant performance tradeoffs.</p>
<p>Create a setting that allows a time period to be set for retention. Right now only a certain number of backups can be set. If <setting>retention-full=2</setting> and two backups are done back to back, then the time period of protection will be very short.</p>
<p>The new option retention-period will be expressed in hours, days, weeks, months, years and will work with current retention like this: The time period will be honored but middle backups will be pruned to match retention-full. For example, if <setting>retention-period=2</setting> weeks <setting>full-rentention=2</setting> and a third full backup was taken, the first one would be pruned if it were older than two weeks, else the middle one will be pruned. This gives two weeks of backup coverage but keeps the number of backups down.</p>
<p>As part of this feature, WAL should be expired as part of retention-period if nothing else is set. This will support installations that are using backrest only for archiving.</p>
</section>
<sectionid="processes-vs-threads">
<title>Only use fully-qualified paths remotely when used locally</title>
<p>If <backrest/> is run without being fully-qualified locally it should also be done when running remotely. They might be in different paths but still on the search path. The remote-cmd option can still be used to set it explicitly.</p>
</section>
<sectionid="link-only-option">
<title>Implement --link-only option</title>
<p>This option would allow the user to restore the link but none of the data at the destination. This would be useful for linked configuration files which are good to backup but not appropriate to restore in every situation, e.g. a standby that has its own custom config. Another example would be a link to the log directory from within $PGDATA.</p>
<p>Certain parameters like db-path and repo-path must be configured on both sides when the backup server is remote. It would be better if these parameters were pulled from the remote side so they aren't repeated.</p>
<p>Make async archiving work when pg_receivexlog is writing directly to the out spool directory on the backup server. Setting this up with a replication slot will make it more reliable.</p>
<p>BackRest will not directly call pg_receivexlog but it can work in concert for more reliable logging.</p>
</section>
<sectionid="two-archivers-same-repo">
<title>Support two archivers on same repository</title>
<p>The <setting>archive_mode=always</setting> setting in 9.5 allows both the master and the standby to log to the same archive for redundancy. Currently <backrest/> will error if this mode is set because it is not supported.</p>
<p>Getting one archive file at a time can be tedious if the cluster is very far behind. An async get with some sort of prefetch would speed the process a lot.</p>
<p>Should be able to specify how may archive logs to prefetch.</p>
</section>
<sectionid="multi-process-archive-push-get">
<title>Multi-processing for archive-get and archive-push</title>
<p>Multi-processing would improved performance for these operations, especially archive-push. However, even very large systems have been working well with asynchronous archiving so this is not a big priority.</p>
<p>Checksums are calculated during the backup process, but the delta is still done during diff/incr backups. Add a new option checksum_delta (default n) that does the delta using checksums. Of course, if the timestamp or size has changed the checksum does not need to be calculated.</p>
</section>
<sectionid="interactive-log-level">
<title>Set --log-level-console=info for interactive sessions</title>
<p>Suggested by <linkurl="{[github-url-root]}/terrorobe">Michael Renner</link>
<p>If an archive log is missing in the middle of an archive stream <backrest/> will return a soft error (1), even though there is probably no chance of that archive log showing up.</p>
<p>If an archive log is missing then check to see if the next one is present - if so return a hard error. This is tricky because there is a question of how long to wait. With parallel async push it's very possible that the WALs could arrive out of order.</p>
<p>Here's a possible solution: <backrest/> on the database server knows the oldest WAL segment that is currently on the db server and not pushed. If this is reported to the backup server, then it can determine if a hole in the archive stream may be filled, or if it is a permanent condition.</p>
</section>
<sectionid="async-archive-sleep">
<title>Add configurable sleep to archiver process to reduce ssh connections</title>
<p>The async archiver exits as soon as there are no files left to transfer. A configurable sleep would be good because it would reduce the number of SSH connections made to the remote.</p>
<p>Preserve exact WAL timestamps to make measurement of WAL rates more accurate in monitoring. Timestamp should be taken from the file before copying so delay in archiving can also be measured.</p>
<p>Allow user to indicate that a backup is locked and should be preserved until unlocked. This could be handy for the last backup of a previous PG version or just to save data that is known to be important for any reason.</p>
</section>
<sectionid="archive-info-stanza">
<title>Write stanza name into archive.info and check it</title>
<p>Add a throttling feature to limit the amount of disk i/o and or network bandwidth being used by the backup.</p>
<p>Make this a per thread limitation to start. That simplifies the problem quite a bit and most users who are throttling will probably be single threaded.</p>
<p>It is possible to notify users earlier if archiving is not working during a backup. Check in the main backup loop to see if archiving is proceeding - if not then fail after a configurable amount of time.</p>
</section>
<sectionid="remote-sudo-user">
<title>Allow sudo user to be specified when calling remote via ssh</title>
<p>Add new options db-sudo-user and backup-sudo-user to allow the backrest command to be run through sudo for security. This is especially important on the db side.</p>
</section>
<sectionid="db-user-different">
<title>Allow db user to be different than OS user for backup</title>
<p>Although the file system backup needs to run as <postgres/>, it can be advantageous to have the start/stop backup run as a less privileged database user with the REPLICATION role. This will need to be tested to see if it works.</p>
<p>Ideally the file system backup could be run as a user in the <postgres/> group rather than <postgres/> itself but <postgres/> does not grant group permissions so sadly it is not possible to backup as a user other than the database owner at the file-system level. This limitation will need to be addressed in core <postgres/>.</p>
<p><postgres/> >= 9.3 has the option to enable checksums on data files. <backrest/> should be able to test these checksums and report if it finds any issues. This could also be a stand-alone function.</p>
<p>For extra credit, test the checksums in WAL.</p>
<p>Consider alternative storage methods like S3. Ideally there would be an option to store a certain number of backups (at least the last one) locally for fast restores, while using S3 for long-term storage.</p>
<p>Some ssh options like ForceCommand can modify the command line passed to the remote. Also pass the command line in the protocol layer to ensure no destructive changes were made.</p>
</section>
<sectionid="diff-incr-checksum">
<title>Don't keep incr/diff files when the checksum matches</title>
<p>If a file is recopied in incr/diff because of timestamp changes, there may still be cases where the file was actually not modified. Since we are doing checksums anyway, it's possible to check it against the previous file and create a reference when the checksums match.</p>
</section>
<sectionid="move-file-object-create">
<title>Move File object creation to Config.pm</title>
<p>File objects are created in a bunch of places but it's all basically the same code. Move this to a common function that looks like protocolGet().</p>
</section>
</section>
<sectionid="testing">
<title>Testing</title>
<sectionid="test-recovery">
<title>Test to ensure recovery=none with backup_label replay before last checkpoint</title>
<p>Low-level regression tests to be sure locking works as expected locally and remotely. This should include tests on NFS since this is a popular scenario.</p>
</section>
<sectionid="test-debug-lines">
<title>Separate debug params onto separate lines</title>
<p>Low-level regression tests to be sure locking works as expected locally and remotely. This should include tests on NFS since this is a popular scenario.</p>
</section>
<sectionid="test-points-different-times">
<title>Allow test points to have different times</title>
<p>Currently is is possible to have multiple test points but they must all have the same delay time. Make it so each test point can have its own delay.</p>
<p>A perfect test case would be adding keep alive testing to restore. The RESTORE_START test should have delay 1 while the KEEP_ALIVE test should have delay 0.</p>
</section>
</section>
<sectionid="documentation">
<title>Documentation</title>
<sectionid="doc-option-allowed-range">
<title>Automatically write option allowed range into docs</title>
<p>The debug params all end up on a single line so if one value changes it's tough to tell which one changed. Separate them out onto separate lines to aid debugging (even though this will add a lot of lines to the file.)</p>
<title>Automatically document options that can be passed multiple times</title>
<p>Some options can be passed multiple times on the command line (or the config file) and this should be written into the reference guide automatically rather than being manually written per option.</p>
<p>Discuss how <backrest/> is different from other <postgres/> backup solutions using this <linkurl="https://consul.io/intro/vs/zookeeper.html">consul comparison</link> as a model.</p>