In this video you can see that if you want to transfer a file from one computer to another and it is really important to known that it’s got there intact in one piece.
You could send it multiple times and then complete them all, but what generally gets used is something called a hash algorithm.
What is Hash Algorithm?
A hash algorithm is kind of like the check digit in a bar code on a credit card. I think James Grime talked about this a long time ago on numberphile. The Last digit in a bar code on a credit card is determined by all the other digits on it and if you change one of those digits. Then the last one changes as well.
So, as you typed into a computer you can know instantly if you’re missed a key somewhere. So a hash algorithm is kind of like that, but for an entire file that might be megabytes or gigabytes in size.
What it gives you is a code 12 or 32 or 64 characters generally hexadecimal. Basically just one long Number expressed in that way that is a ‘Sum up’ of everything that is in that file.
If you crushed it down and if you do all these manipulations to it and crush it down, crush it down, crush it down and what it comes out with this thing that says this is a summary of that file. You can never have made it work backwards you can’t pull that data back out. But it’s like a signature.
It is the simplest hash algorithm I can think of I would just be something like that’s five digits, add up all the digits in the file which is 4,9,14,23, that’s not a good hash algorithm for a few reasons.
Algorithm have three main requirements the first one is speed. It’s got to be reasonably fast it should be able to churn through a big file in a second or two at most. But it also should not be too quick. If it is too quick, it’s easy to break.
The Second requirement is that, if you change one byte one bit anywhere in the file of the start of the middle at the end then the whole hash should be completely different. This is something called the avalanche effect. If you’re interested in how this is achieved do look up the actual algorithms themselves.
The Third requirement is that you’ve got to able to avoid what are called hash collisions. This is where you have two documents which have the same hash. Obviously there is a mathematical principle called the pigeonhole principle.
There are incredible numbers of documents out that possible with the hash meanwhile it’s just one fairly long number that will be files out there which naturally have the same hash and that’s fine. Because the odds against it are so unlikely that we can deal with it’s never going to happen naturally.
But if you can artificially create a hash collision and If you can create a file and change your name, then we have a problem and that’s where security comes into these because if I can make a file that sums to be a certain hash. Then I can fake documents I ca send different things and have this signature match.
If the hash is too fast and if you create new ones In a few then you can fairly easily create documents that match a particular hash. It is in a very real sense an arms race as I said for many years md5 was the accepted algorithm and it’s still used for a few things but md5 is now thoroughly broken because computers are fast enough and there is a few sort of interesting tricks you can use to try and create hash collisions deliberately.
The other problem with md5 is because it was used so much everywhere on the web.