Jump to content

Hi 🙂

 

I need to do the following:

 1) compress folders into (multi TB) archive files

 2) move the archive files onto a remote server

And then at some point, potentially:

 3) retrieve the archive files from the remote server

 4) extract the archive files back into folders to make the data available again.

 

I was told I should use MD5 checksum in order to be able to check if my archive files are corrupted after I retrieve them from the remote server. I am not familiar with this (MD5 checksums).

AFAIU this consist in an additional step after step 2) (let's call that 2b )

 2b) generate an MD5 checksum hash for each archive file

 2c) keep the MD5 hash somewhere safe

Then:

 3b) use the MD5 checksum hash to make sure the archive file is not corrupted in any way.

 

Does this sound about right?

 

I guess the step where the data is most likely to get corrupted is during the transfer over the network right? (I understand bit rot can occur both on local and remote drives, but I assume this is less likely. Maybe not...)

 

My question is the following:

I consider saving time by combining steps 1) and 2). What I mean is that I would compress the data while I'm copying them onto the remote server.

But then, if I run the MD5 checksum on the archive file on the remote server, it could already be corrupted, right?

Does it make any sense to still run an MD5 checksum in this case?

I could extract the archive file onto the remote server to make sure it's not corrupted, though (immediately erasing the extracted files after the test)... Does it make any sense?

 

Thank you very much in advance for your advice and recommendations.

 

Best,

-a-

Link to comment
https://linustechtips.com/topic/1516329-at-what-step-do-data-get-corrupted/
Share on other sites

Link to post
Share on other sites

Compressed storage formats usually contain a checksum internally for each file. It will know if the data on decompression is mismatched from that and can flag it.

 

Network protocols also contain error detection. In case data is corrupted in transit, it can be managed, for example by re-transmitting the affected block.

 

IMO the highest risk time is creating the compressed archive. If an error occurred during that time, it could create a compressed corrupted file, but because that was also used to create the internal checksum, it would appear to match. As long as you don't manually overclock your CPU or run extreme ram, the risks from this should be insignificant.

 

One other option you might consider is the use of parity files. I last saw these in usenet era so a very long time ago, and have no idea what the current state of it is. The concept is you break your file into blocks. Parity files can be generated from that, to check the contents are correct, and if not, can offer recovery of such. More parity files = more chance of recovery of missing or corrupted data. https://en.wikipedia.org/wiki/Parchive

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, MSI Ventus 3x OC RTX 5070 Ti, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 4070 FE, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to post
Share on other sites

2 minutes ago, porina said:

One other option you might consider is the use of parity files. I last saw these in usenet era so a very long time ago, and have no idea what the current state of it is. The concept is you break your file into blocks. Parity files can be generated from that, to check the contents are correct, and if not, can offer recovery of such. More parity files = more chance of recovery of missing or corrupted data. https://en.wikipedia.org/wiki/Parchive

Pretty much this is what I was going to say.  I use multipar, it's open source, it's apparently supporting GPU now (never really used that feature).

 

I like the Par2 files in that they can be easily separated and since it's a standard you don't really have to worry too much about some other standards not working.  And as Porina said, it can actually recover from some bit flips (as long as there aren't too many)

 

To @asheenlevrai I would say it all depends on what your goal is for transferring and compressing like that.

 

Like as a question, what type of data are you compressing?  If it's something like video lets say, I might personally just skip that step as you won't get too much extra compression and you are wasting CPU resources to do the compression.

 

If it's something like ROM data, then it might make sense to compress (large file sizes, but lots of repeated data); so that's why I would ask...also it's about the fault tolerance.  If one bit flips in a movie file the likelihood is that you won't really notice it (it's why I don't worry about it for my video files).

 

As Porina said though, there usually is checksums and such with compressed files.  You might also consider something like Veeam if this is corporate data, or data that you can't live without.  I personally enjoyed using it at my workplace, and you can make it run health checks on the backed up data.

3735928559 - Beware of the dead beef

Link to post
Share on other sites

17 hours ago, wanderingfool2 said:

Pretty much this is what I was going to say.  I use multipar, it's open source, it's apparently supporting GPU now (never really used that feature).

 

I like the Par2 files in that they can be easily separated and since it's a standard you don't really have to worry too much about some other standards not working.  And as Porina said, it can actually recover from some bit flips (as long as there aren't too many)

 

To @asheenlevrai I would say it all depends on what your goal is for transferring and compressing like that.

 

Like as a question, what type of data are you compressing?  If it's something like video lets say, I might personally just skip that step as you won't get too much extra compression and you are wasting CPU resources to do the compression.

 

If it's something like ROM data, then it might make sense to compress (large file sizes, but lots of repeated data); so that's why I would ask...also it's about the fault tolerance.  If one bit flips in a movie file the likelihood is that you won't really notice it (it's why I don't worry about it for my video files).

 

As Porina said though, there usually is checksums and such with compressed files.  You might also consider something like Veeam if this is corporate data, or data that you can't live without.  I personally enjoyed using it at my workplace, and you can make it run health checks on the backed up data.

It's corporate cold data that will be archived on tape. I mentioned compression but actually meant just making an archive file (.tar or .zip without compression, to make it faster rather than save space). We need to store these files as archives rather than a collection of files since technical restrictions won't allow us to store smaller files.

 

So if I understood correctly the highest risk of corrupting data is indeed during the "compression"/archiving step. AFAIK the various programs that make/extract archives (7zip and the likes) will let me know if the archive is corrupted when I try to extract it but I doubt I would receive any warning during the archiving step, right?

Link to post
Share on other sites

11 hours ago, asheenlevrai said:

It's corporate cold data that will be archived on tape. I mentioned compression but actually meant just making an archive file (.tar or .zip without compression, to make it faster rather than save space). We need to store these files as archives rather than a collection of files since technical restrictions won't allow us to store smaller files.

 

So if I understood correctly the highest risk of corrupting data is indeed during the "compression"/archiving step. AFAIK the various programs that make/extract archives (7zip and the likes) will let me know if the archive is corrupted when I try to extract it but I doubt I would receive any warning during the archiving step, right?

If it's corporate data, and being put onto tape; I would highly recommend one of the backup programs that can deal with that stuff (since when doing tape and it being corporate the licensing I don't think would be too high to justify for the ease of use/consistency).

 

Like Veeam has the following

https://helpcenter.veeam.com/docs/backup/vsphere/creating_backup_to_tape_jobs.html?ver=120

If you were also worried, they have ways to verify as well

 

There are also other backup solutions like

https://www.altaro.com/

but it's been like a decade since I used Altaro (but I remember liking it back when I did)

 

*edit

11 hours ago, asheenlevrai said:

So if I understood correctly the highest risk of corrupting data is indeed during the "compression"/archiving step. AFAIK the various programs that make/extract archives (7zip and the likes) will let me know if the archive is corrupted when I try to extract it but I doubt I would receive any warning during the archiving step, right?

To answer this part, the highest risk isn't compressing it...although if something is compressed as you lose a single bit depending on the compression the entire thing might be lost...without compression a single bit won't have as much of an effect.

 

I would say the biggest risk is just the storage medium.  Eventually tapes degrade, magnetic things flip...etc..some storage mediums actually have parity corrections to prevent it (like old CDs/DVDs had a lot of error correction on it).

 

The biggest thing is to figure out why you are wanting things like corruption detection.  As things like MD5 won't do you much good other than telling you that you had a bit flip or more...so at that stage if it's important to have the data intact then there's not much you could do.  Again you could do things like par2 files which do have error correction, at which point it allows you to recover.

 

It's why I still would look into something like Veeam as you can verify backups and it does support tape drives

3735928559 - Beware of the dead beef

Link to post
Share on other sites

  • 3 weeks later...
On 6/30/2023 at 6:07 PM, wanderingfool2 said:

If it's corporate data, and being put onto tape; I would highly recommend one of the backup programs that can deal with that stuff (since when doing tape and it being corporate the licensing I don't think would be too high to justify for the ease of use/consistency).

 

Like Veeam has the following

https://helpcenter.veeam.com/docs/backup/vsphere/creating_backup_to_tape_jobs.html?ver=120

If you were also worried, they have ways to verify as well

 

There are also other backup solutions like

https://www.altaro.com/

but it's been like a decade since I used Altaro (but I remember liking it back when I did)

 

*edit

To answer this part, the highest risk isn't compressing it...although if something is compressed as you lose a single bit depending on the compression the entire thing might be lost...without compression a single bit won't have as much of an effect.

 

I would say the biggest risk is just the storage medium.  Eventually tapes degrade, magnetic things flip...etc..some storage mediums actually have parity corrections to prevent it (like old CDs/DVDs had a lot of error correction on it).

 

The biggest thing is to figure out why you are wanting things like corruption detection.  As things like MD5 won't do you much good other than telling you that you had a bit flip or more...so at that stage if it's important to have the data intact then there's not much you could do.  Again you could do things like par2 files which do have error correction, at which point it allows you to recover.

 

It's why I still would look into something like Veeam as you can verify backups and it does support tape drives

Thanks 🙂

 

Unfortunately, we're not free to use what we want. We need to use the service that is provided.

 

Thanks again 🙂

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×