Posted By: zian | Dec 22nd, 2007 @ 6:18 PM
page 1 of 1
Comments: 21 | Views: 3298
zian
zian
Exploding heads since 1988

Let's say that I give you a USB drive with 10 Word documents. How will you convince me that the files are not corrupted?

Answer: You can't.

What if I told you that the files should not have been modified since January 1, gave you the MD5 hash for each file, and told you that the hash was generated at the end of the day on January 1?

Answer: It would be trivial to check.

Is there anything in between?

evildictaitor
evildictaitor
if( !succeed( try() ) ) { while(true) try(); }
zian wrote:
gave you the MD5 hash for each file


I wouldn't be so sure... (SHA256 might convince me)
If the word docs include the date then you're ok but hashes like those used by ftp clients don't take file properties into account so the date of the file is irrelevant ... Expressionless
This, of course, is why simply copying files to another drive is not really a backup and you're much better off with proper incremental backups.
ManipUni
ManipUni
Proving QQ for 5 years!
zian wrote:
Is there anything in between?


What's the point of this exercise? ...

What good is proving that a file is corrupt? If you are successful in proving that it is the case then you're stuck with a useless file which you'll try and use anyway; if it isn't the case then you haven't changed anything.

If you are going to go to this much trouble to prove that something is corrupt you may have well go to the next logical step and just do a second backup or backup with built in redundancy like RAID-3.

Nothing in this thread will increase security or prove how a file has been intentionally altered.

PS - If I created this USB drive with ten word files on it I could make it VERY easy to prove if corruption has occurred. I mean you dump a hash + size into three text files. It isn't that hard.

ManipUni
ManipUni
Proving QQ for 5 years!
AndyC wrote:
This, of course, is why simply copying files to another drive is not really a backup and you're much better off with proper incremental backups.


Perhaps but that is expensive and annoying to setup. Hot swapping two mirrors is cheap, dirty, and easy.

ManipUni wrote:


PS - If I created this USB drive with ten word files on it I could make it VERY easy to prove if corruption has occurred. I mean you dump a hash + size into three text files. It isn't that hard.


Surely the last thing you want are valid .doc files that you think are corrupt when they are not, because the files containing the hash are corrupt Smiley
ManipUni
ManipUni
Proving QQ for 5 years!
Rossj wrote:
Surely the last thing you want are valid .doc files that you think are corrupt when they are not, because the files containing the hash are corrupt


All three?

Hell... Have File2 contain a hash/size for File1, and File3 to contain one for File2. Then you can follow the chain up to three levels to find one that you know is valid.

PS - I'm not saying I would do any of this. I really don't understand what the original question is trying to accomplish. I just mirror.



ManipUni wrote:


Perhaps but that is expensive and annoying to setup. Hot swapping two mirrors is cheap, dirty, and easy.



Even incremental backups to one drive are better than a straight copy, especially if its a RAID set. Multiple backup sets, kept in separate locations, are obviously much better and vital for really important data.
ManipUni wrote:
Rossj wrote:Surely the last thing you want are valid .doc files that you think are corrupt when they are not, because the files containing the hash are corrupt


All three?

Hell... Have File2 contain a hash/size for File1, and File3 to contain one for File2.

Then you can follow the chain up to three levels to find one that you know is valid.


And if only File 1 is corrupt? Or file 2? We're just adding opportunities for things to go wrong here. Smiley

I think the original point was validating that a backup was successful and is still valid now. Or tomorrow.
ManipUni
ManipUni
Proving QQ for 5 years!
Rossj wrote:
And if only File 1 is corrupt? Or file 2? We're just adding opportunities for things to go wrong here.


If file1 is corrupt then it won't match the test contained in two. Or if two is corrupt then it won't pass the test in file3.

As long as all three aren't corrupt you're set. Hell; you could just have it stack, so three contains file1 and file2.

e.g.
File1

[Hash List] [Sizes]

File2

[Hash List] [Sizes]

{File1 [Hash] [Size]}

File3

[Hash List] [Sizes]

{File1 [Hash] [Size]
File2 [Hash] [Size]}

Then you can prove if any one single file is corrupt using any of the other two (and yes, you CAN prove if file3 is corrupt even without a hash in either file1 or file2 simply by looking at its data).

AndyC wrote:
 Even incremental backups to one drive are better than a straight copy, especially if its a RAID set. Multiple backup sets, kept in separate locations, are obviously much better and vital for really important data.


I disagree.

If you have two complete backups you know you always have a minimal of two copies. With your way you could end up in situations where there is only one redundant copy on the drive (although you could also end up with an unlimited number of copies too).

So what is better a two copy, or a one to unlimited copy system?

ManipUni wrote:


So what is better a two copy, or a one to unlimited copy system?



Imagine the simple case, where you accidentally overwrite or delete one file. In a copy/mirror scenario (unless you are lucky enough to notice quickly enough) that accidental change is replicated to your backup and your original file is lost - tough luck. With incremental backups, you still have the previous version and can easily restore without issue.

Anything you can attempt to add resilience to a mirror solution (such as periodically replacing the drive you backup to) can also be done with incrementals, offering much more resilience in the process.
You could also hash file segments. That would allow you to narrow in on the location of the corruption. I'd also recommend the PAR2 redundancy/recovery system.
RichardRudek
RichardRudek
So what do you expect for nothin'... :P
Um, I would just archive the files (7-Zip, WinZip, RAR, etc). I especially do this when sending email attachments. Always.

Note that you do not have to use the compression ability of these Archivers, so speed shouldn't be an issue. Assuming, of course, you know how to use the Archiver. eg Archiving and copying in the one step, and not two.

The Archives will minimally have some form of CRC, which is good enough for corruption detection, though you guys seem to be going off in security/integrity tangent... Cool
RichardRudek wrote:
Um, I would just archive the files (7-Zip, WinZip, RAR, etc). I especially do this when sending email attachments. Always.

Second that, and it saves space too. Smiley

And for some compression formats, it even allow to insert recovery information, so if the file is somehow physically corrupted, you have some chance to repair it correctly. (Although in most cases I've seen, the repairation only means removal of bad parts)
I learned a hard lesson 5+ years ago when I win-aced a large collection of data to CD's and lost pretty much everything - after the corruption point it wasn't able to recover the remaining data (90%). It was one solid archive using maximum compression settings. So PAR2 is a nice friend to have. But file size isn't that important anymore and for big files, they are often already compressed using a domain-specific algorithm.
zian wrote:

This was intended to be a thought experiment, actually. *ducks the incoming bricks from Manip*

I was reading about file backups and it started talking about verifying the integrity of a file. At that point, I ran into the problem that I used to start the entire thread.

Don't worry about it, this sort of stuff is fun - and makes people think a little more about back-up Smiley


GoddersUK
GoddersUK
I CAN has cheezburger and you CAN'T has stop me!
zian wrote:


Let's say that I give you a USB drive with 10 Word documents. How will you convince me that the files are not corrupted?

Answer: You can't.

What if I told you that the files should not have been modified since January 1, gave you the MD5 hash for each file, and told you that the hash was generated at the end of the day on January 1?

Answer: It would be trivial to check.

Is there anything in between?



Open it see what happens?
evildictaitor
evildictaitor
if( !succeed( try() ) ) { while(true) try(); }
esoteric wrote:
I learned a hard lesson 5+ years ago when I win-aced a large collection of data to CD's and lost pretty much everything - after the corruption point it wasn't able to recover the remaining data (90%). It was one solid archive using maximum compression settings. So PAR2 is a nice friend to have. But file size isn't that important anymore and for big files, they are often already compressed using a domain-specific algorithm.


If you were to do a maximum compression on the data and then overlay each block with a huffman encoding 5 (or 7), then you end up with a bigger file, but one that can lose 2 (or 3) bits from every block before the corruption breaks the file.

Sadly I don't know of any standard compression libraries with this tool, but I'm sure you could make your own Tongue Out
Tom Servo
Tom Servo
W-hat?
I like my mirror that comes with cheap snapshots. Granted, snapshots don't prevent short term data loss due to mistakes (short term as the timespan between snapshot schedules), but beyond that, they do.

As far as real backup, I don't even bother. Can't afford a tape drive with even remotely big enough tapes to do the periodic full backup. You can't live solely on incremental ones.
page 1 of 1
Comments: 21 | Views: 3298
Microsoft Communities