Thursday, January 31, 2008

Understanding RAID Systems: RAID 0, 1, 5, 0+1 Arrays

RAID implemented on motherboards is becoming more common. The SATA standard includes many of the features of the SCSI command set used in most high end RAID systems. This has permitted simplified RAID to be put on low priced motherboards selling for under $200. RAID 0, RAID 1 and RAID 5 are on the motherboard in my workstation. I bought it for for $98. Due to the ubiquity of RAID systems today, I thought it would be interesting to explain how RAID systems work highlighting their advantages and disadvantages.

RAID started as a search for creating large storage systems from inexpensive small drives. There were 2 goals: first, to increase storage available and; second, to increase speed of response by taking advantage of the asynchronous nature of separate drives. I don't want to go too deep into the history in this post. However, I wanted to point out that the first attempt was to create an array of drives, now called, RAID 0, wasn't really redundant. This is a trap for many users. RAID 0 is dangerous. It creates a single volume from all the drives so that if any one of the drives fail, the whole volume is lost. It was discarded quickly as unusable. RAID 0 is much more common than it should be. Users believe that the increase in speed is worth the increased risk of failure. We see many failed RAID 0 arrays.

RAID is an acronym for Redundant Array of Independent Drives. While there are many different configurations, most are combinations of RAID 0, RAID 1 and RAID 5. All RAID systems use a technique called "striping," that writes data in contiguous blocks in drive order. So, stripe 0 is written to drive 0, stripe 1 is written to drive 1, etc. When the end of the array is reached the write starts over at drive 0. For example in a simple array with 3 drives, stripe 3 (the 4th stripe) would be written to drive 0, stripe 4 would be written to drive 1, etc. (Note actual placement of the data varies according to the type of RAID configuration implemented).

RAID 1 is called mirroring. In this configuration there are 2 drives that are written to at the same time. They are exact copies of each other. The first drive in the set is the one that is read but all write operations are performed on both drives. If for any reason the first drive fails to read, the second drive is read and an error condition on the array is asserted. This drops the failed drive from the array. All subsequent write operations are only performed on the good drive. RAID 1 is used where storage space and speed are much less important than redundancy. Often we see RAID 1 on the system drives in servers where the first 2 drives in big arrays are set up as RAID 1 (the system boot) and the rest are RAID 5.

The second attempt to create a large storage system incorporated features of both RAID 0 and RAID 1. It is called appropriately, RAID 0 + 1. In this configuration both RAID 0 and RAID 1 are combined such that each drive has a mirror thus eliminating the risk of cataclysmic failure when one drive fails. An even number of drives is required as each member of the striped set has a mirror drive. This is very wasteful of storage capacity. It does have the advantage of speed and redundancy. As in RAID 1, only the first set is read but both sets are written to at the same time. There was still felt to be much room for improvement which led to several more attempts to design a better system.

Understanding RAID 5 and Stripting with Parity

After testing several different configurations and methods, a configuration was found that would protect against single drive failure while providing both significant increases in speed and capacity for very low cost. This was the birth of RAID 5. The secret to RAID 5 is actually quite simple. It uses a technique called, "parity". Parity looks at each bit stored on each of the drives and puts them in columns and rows. To visualize this yourself, think of each drive as occupying a column and each "bit" occupying a row. A bit is the smallest amount of data stored on digital media, representing a binary number, either 0 or 1. Parity is simply the binary sum of those bits retaining only the one's column stripping any carry over. Below is an example:


Parity Example:

Drive: 0 1 2 3 Parity

Bit 0: 0 1 0 1 0

Bit 1: 1 1 1 0 1

Bit 2: 0 0 1 1 0

Table 1

Note in the example above that in the case where there is an even number of 1's, the parity is 0 and in the case of where there is an odd number of 1's, the parity is 1. Thus, in computer jargon we say the parity is even when it adds up to 0 and odd when it adds up to 1. This parity rule holds true for any number of drives or bits. In case it wasn't clear and to refresh our memory from prior posts, all data is stored on drives as little areas of magnetic polarity which depending on their orientation represent a binary '0' or a '1'. These are grouped together into bytes (8 bits) and sectors (512 bytes)for ease of control and integrity testing. Each byte can be thought of in our table above as bits 1 - 8 and each sector as 512 collections of those 8 bits. On RAID systems sectors are collected into "Stripes" usually a multiple of 2 such as 128 sectors per stripe (most common size).

I probably digressed into a little too much detail. To return to understanding RAID 5, several drives are grouped together such that 1 stripe out of the number of drives is defined as the parity stripe. And for each bit on each of the drives there is a corresponding parity bit on that stripe. This means that if there are 'n' drives, the real data capacity is equal to (n-1) * capacity of each drive. So,if there are 7 36GB drives in the RAID 5 Array, you multiply the capacity (36GB) by (7 - 1) = 6... (6 * 36) to get 216 GB as the size of the RAID volume. As a side note, that parity stripe is actually spread out over all the drives. It turned out that it was much slower to keep parity on a designated parity drive.

So the big question is, "How does it continue to work when one drive fails?" It turns out be a simple mathematical problem that computers are able to perform extremely quickly. The parity checking and result are easily performed in memory within the time it takes to assemble the data packet for use by the system... Just by adding the remaining bits back together and adding the parity bit, you reproduce the missing bit. This is the whole crux of the function of the redundancy. To return to our examples above...

Parity Example with Recovery Bit:

Drive: 0 1 2 3 Parity Recovered Bit

Bit 0: 0 1 X 1 0 0

Bit 1: 1 1 X 0 1 1

Bit 2: 0 0 X 1 0 1

Table 2


If you compare the recovered bits to the missing column in Table 1 you will see that they match. As a mental exercise blank out any of the columns in Table 1 and see what the results are.

This shows how a single parity stripe on a 14 drive array can reproduce the missing data from a failed drive quickly and accurately.

We will continue with our discussion of RAID arrays in the next post.

Peter

Friday, January 11, 2008

Part 3: How Disks Fail

Now, we have to imagine how all the parts work together. The platters spin at reasonably high rates, anywhere from 3400 rpm (old drives) to 15000 rpm. The heads move at high speeds as well. The controller card maintains the speed of the platters and manages the movements of the heads. All the rest of the components are there to refine the signals and manage the communications.

head crash imageThe first and most obvious failure is the head crash. A head crash occurs when a head touches the platter and damages the media below. If you recall from the prior post there are several layers to the media. The last 2 layers are a hardening layer and a lubricating layer. Once the head has abraded through these 2 layers, the data layer is easy to damage. The last 2 layers are there for that reason to protect the magnetic layer. Heads can touch the media without damaging them. However, the heads are made out of the same materials that IC's are made from, glass and silicone. Glass is one of the hardest substances and can easily scratch most other materials. Sandpaper is made from glass. Considering how fast those heads are moving in relationship to the platters, it doesn't take much to scratch the media. A weird consequence of the hard crash is "stiction". Stiction is just as it sounds. It is occurs when the head sticks to the platter. Stiction occurs due to several factors including magnetic attraction, smoothness, electrostatic attraction and the stickiness of silicon.

The second common type of failure is electronic. Electronic failure is damage to any of the electrical components. The chain of components includes the control circuits, reading and writing circuits, and the communication circuits. These circuits can fail as a result of heat, cold solder joints, manufacturers defects, and externally generated failures (surges, physical forces, etc). It is interesting to note that an electronic failure can easily mimic any of the other failure even a head crash.

The last type of failure I will talk about is firmware failure. On each hard disk is an eprom that holds information and software that manages the functions of the drive. In addition some manufacturers put a part of the information and programming for the hard drive on the platter. This is commonly called the "firmware". We separate out this type of failure from electronic failure because it is addressed differently when we perform data recovery.

Here is how we describe the failures that can occur:

Head Crash with:

damaged heads
media damage
stiction

Electronic Failure with:

arc damage to the media
damage to the firmware

Next Post I will talk about software failures.


Peter