Several companies have followed EMC’s lead and are delivering content-addressable storage systems. These produce unique file references based on a file’s contents, hence content-addressable. These references are built using a mathematical method. In the case of EMC’s Centra a 27-character key is constructed for each file. The method used is termed a hash algorithm. What is it and why are CAS systems appearing?

CAS address the storage of fixed-content data; mortgage records, insurance policies, health records. Such data may be stored for a long time, many years, but be accessed rarely. Nevertheless it has to be kept. Other such data consists of e-mails, now having to be retained for regulatory reasons. There can be millions of such files and two problems exist; one is regulatory in nature and the other is simply finding the file of interest.

The regulatory problem is simply knowing that the file you are looking at is the file originally stored and not some corrupt copy. The number problem is the time taken to find one mortgage record out of 5 million. It’s easier to find one record out of a list of ten records than one in a list of ten million.

A way is needed to subdivide the list of ten million into many smaller lists or pots. The method used is called ‘hashing’ and involves the construction of a key or hash that represents some aspect of the file: its name; its date; its contents; a combination. It doesn’t matter so long as the hash uniquely represents the file and cuts down the search space for finding it.

The idea of hashing started in the earliest days of computing. The first true electronic computers began to run in 1949 and 1950. A proposal for hash search was described by Hans Peter Luhn in an IBM technical memorandum in 1953. What he wanted was a function that would deliberately abuse keys producing practically the equivalent of the mathematical concept of uniformly distributed random variables.[bashe86] The uniform distribution of hashes would cut down search time.

Searching can be made more efficient if one can quickly narrow one's search by locating a subset of items to search. This is a general search problem, one not specific to fixed-content data storage. Each item is characterized by its hash or key, which uniquely identifies it; such as an employee-number. Hashing creates a subset of items to search by computing from the key a characteristic value, called the hash, and then groups items with the same or similar hash; the group is called a bin. It may be termed a hash table.

There are several methods for grouping items; the most common makes a linked list of the items with the same hash. To retrieve an item the program computes the hash from the key that you want to search for, and it then searches for the item in the bin associated with the hash.

If item keys are the names of people and one uses the first character of the last name as the hash, then a hash search will work well for last names starting with X, because not many last names start with X. But the search works poorly for finding Mr. John Smith, because there are many names that begin with S.

A clean hash function, when given a series of keys, gives back a reasonable approximation of a uniform random distribution of hashes. (This is the problem Luhn, the IBM researcher, was focussed on.) This means that all bins will probably be about the same size, so that, on the average, searching will take about the same amount of time to find a record, any record, in any bin.

Hash searching is the most powerful method of retrieving data known. It works well even with rather poor hash functions, that is those that do not distribute the input cleanly. It is so powerful that most automated retrieval mechanisms today are based on hashing.

Spell checker example

The method of creating a hash from an element of a file or record is called a hash function or algorithm. A general hash function rule is simply that all of the bits in the key must be used.

From the design of spelling checkers, it is known that the erroneous form of a word almost always differs from the intended one by a single change of one of four kinds: insertion, deletion or alteration of a single letter, or transposition of two letters. A spell checker could be built by hashing the word to be checked, and comparing the hash to a set of valid word hashes. We must ensure that each letter in the word contributes to the hash value. This might be done, for example, by adding up their ASCII or Unicode values.

It’s not enough though. Because using this method ‘beta’ and ‘beat’ would have the same value. The uniqueness requirement is not met. A common solution to this problem is to multiply each ASCII code by its position; this trick is used, for example, to calculate check digits in ISBNs, bank account numbers, etc.

CD name example

For CD's, what software music players often like to do is have a world-wide CD database. When users put their disk in the CD player, they get a full table of contents on their own computer. These tables are not on the disks themselves. No information about the songs is stored on the CD. It must be downloaded from the database.

The problem is that CD's have no ID numbers stored on them. So how can the computer know which CD is in the player?

It can use the only information on the CD, the track length, and every CD is different. So a hash, a large number is constructed from the track lengths, a "signature" that identifies the CD in the drive. For example, a number of length of 8 or 10 hexadecimal digits is so constructed. It is then sent to the data base, and that data base just looks for the closest match. The reason being, that track length may not be measured exactly.

Encryption Hash algorithms are fundamental to many cryptographic applications. SHA-1 and MD5 are two of the most widely known, trusted and used cryptographic hash function specifications. The vital aspect of hashing here is the unique identifier and the quality that, without knowing how a hash was constructed, you can’t generate the original data.

The key used, for example, in public-key encryption, is based on a hash value. Here is an example of how one might be created:

- Input number: 10,667 - Hashing function: multiple input number by 143 - Hash value: 1,525,381

It is tremendously difficult to compute that the value 1,525,381 came from the multiplication of 10,667 and 143. If you knew that the multiplier was 143 it would be very easy. But, in the encryption situation, only the sender and receiver know this.

Public key encryption generally use complex algorithms and very large hash values for encrypting, up to 128-bit numbers which have a possible 3,402,823,669,209,384,634,633,746,074,300,000,000,000,000,000,000,000,000,000,000,000,000 different combinations! This would be like trying to find a single specific sand grain in the Sahara Desert. Computationally it is impractical to decrypt this.

Back to storage Hashing functions can be used in the fixed content storage case to uniquely identify files or records based on their content. This means that only one copy of the file need be stored with pointers to whomsoever has filed it. Thus multiply distributed e-mail attachments can be stored as one copy, saving on storage space.

Using hashing functions also speeds up search time for located fixed content files in stores like Centera.

And, by using the hashing function's uniqueness property and the ability to reconstruct the original data if you know the hashing algorithm, then data integrity can be reasonably ensured even though you store the data on inexpensive and less reliable media such as SATA and ATA drive arrays.