Stop services while creating snapshots during backup?

Avid Amoeba · edit-2 1 year ago

Stop services while creating snapshots during backup?

Admiral Patrick · edit-2 1 year ago

Wouldn’t restoring from such a backup be equivalent to kill -9 or pulling the cable and restarting the service?

Disclaimer: Not familiar with Immich, but this is what I’ve experienced generally.

AFAIK, effectively yes. The only thing you might lose is anything in memory that hasn’t been written to disk at the time the snapshot was taken (which is still effectively equivalent to kill -9).

At work, we use Veeam which is snapshot based, and database server restores (or spinning up a test DB based off of production) work just fine. That said, we still take scheduled dumps/backups of the database servers just to have known-good states to roll back to if ever the need arises.

Avid Amoeba · edit-2 1 year ago

Thanks for validating my reasoning. And yeah, this isn’t Immich-specific, it would be valid for any process and its data.

@BCsven@lemmy.ca · 1 year ago

What i have seen for corporate server is when backup is started the database goes into a different mode, and a temp writable partition is used while readonly database is backed up, at end of backup that blob created is also stored.

Avid Amoeba · edit-2 1 year ago

Yeah if you’re making a backup using the database system itself, then it would make sense for it do something like that if it stays live while backing up. If you think about it, it’s kinda similar to taking a snapshot of the volume where an app’s data files are while it still runs. It keeps writing as normally while you copy the data from the snapshot, which is read-only. Of course there’s no built-in way to get the newly written data without stopping the process. But you could get the downtime to a small number. 😄

@gedhrel@lemmy.world · 1 year ago

The other thing to watch out for is if you’re splitting state between volumes, but i think you’ve already ruled that out.

Avid Amoeba · edit-2 1 year ago

Oh yeah, that would be a disaster. If not handled correctly.

@gedhrel@lemmy.world · 1 year ago

I’d be cautious about the “kill -9” reasoning. It isn’t necessarily equivalent to yanking power.

Contents of application memory lost, yes. Contents of unflushed OS buffers, no. Your db will be fsyncing (or moral equivalent thereof) if it’s worth the name.

This is an aside; backing up from a volume snapshot is half a reasonable idea. (The other half is ensuring that you can restore from the backup, regularly, automatically, and the third half is ensuring that your automated validation can be relied on.)

Avid Amoeba · edit-2 1 year ago

Contents of application memory lost, yes. Contents of unflushed OS buffers, no. Your db will be fsyncing (or moral equivalent thereof) if it’s worth the name.

Good point. I guess kill -9 is somewhat less catastrophic than a power-yank. If a service is written well enough to handle the latter it should be able to handle the former. Should, subject to very interesting bugs that can hide in the difference.

This is an aside; backing up from a volume snapshot is half a reasonable idea. (The other half is ensuring that you can restore from the backup, regularly, automatically, and the third half is ensuring that your automated validation can be relied on.)

I’m currently thinking of setting up automatic restore of these backups on the off-site backup machine. That is the backups are transferred to the off-site machine, restored to the dirs of the services, then the services are started. This should cover the second half I think. Of course those services can’t be used to store new data because they’ll be regularly overwritten with every backup. In the event of a hard snafu where the main machine disappears, I could stop the auto restore on the off-site machine and start using the services from it, effectively making it the main machine. If this turns out to be reasonable and working, I might trash all of the file-based backup-and-transfer mechanisms and switch to ZFS send/recv. That should allow to shrink the data delta between main and off-site to minutes instead of hours or days. Does this make any sense?