Monday, February 25, 2008

RAID Striping Algorithms: The Missing Link II

In the prior post I explained how physical stripes including parity are laid out in a RAID 5 configuration. I introduced data mapping as a separate layer from striping.

To continue... while the physical striping is ordered and easy to understand, the data mapping layer or logical stripes aren't always laid out in a simple contiguous pattern. When I first started writing my unstriping tools, this confounded me no end.


To understand the differences we need to return to my simplified annotation of the 3 points of view of the stripes. We can use the model at the right to help. The first stripe represented has reverse parity with the data contiguously laid out from Disk 0 to Disk 2. The parity stripe is on Disk 3. A1 corresponds to S1D0 and PV1 as well as LV1. The second level is well ordered simply eliminating the parity stripe from the logical volume.

So, you ask, "Where is all the complication?" The difficulty comes when you reach the parity stripe. In a simple schema you expect the next logical stripe (LV4) to be B1 (S2D0: 128 - 255 sectors) which would also correspond to PV5. However, this isn't what always happens. A common striping algorithm puts the next stripe (LV4) on the second stripe of the last drive (S2D3) as if it jumps up to the next level. Then, the second stripe on the first drive (S2D0) becomes LV5 and the second stripe on the second drive (S2D1) becomes LV6. It doesn't seem to make sense adding an unnecessary complication.

In the Compaq schema from my previous post it was even more complex. Compaq created an array using a factor of 16:1 parity stripes to data stripes. For each 16 data stripes on the data drives one drive was designated as the parity drive. Then, it would switch in reverse parity fashion to the next drive for 16 data stripes on the other drives. To make things even more complicated, they still rotated the data stripes on the other drives. It wasn't just a simple big stripe of 2048 (16 * 128) sectors which was my first attempt.

There is no substitution for spending the time and doing a good analysis of the stripes. After making several unstripe attempts adjusting parameters and testing the outcome on that Compaq recovery, I came close to recoverying data. However, I would get some files back but not others. On inspection of the rebuilt volume there were "holes" in the file system. The array contained 14 drives and each needed to be inspected. The failed drives had both been successfully recovered. So, there was no excuse for missing data. To check continuity, I look for easy to read files, usually text files. Long MS Word documents also contain sufficient text data. I want to find files that cross stripe boundaries. When an easily readable file crosses a stripe boundary, I can connect the stripes. This helps me to create a map of the array laying out all the stripes in both logical order and physical order. Using the map I am able to construct the algorithm for unstriping.

It would be too long to go into further detail. Unstriping tools are a book unto themselves and go beyond the scope of this blog.

Next I will take a look at some of the NAS devices we have recovered and their issues.

Tuesday, February 12, 2008

RAID Striping Algorithms: The Missing Link

In the previous post we learned about RAID configurations and how parity works. The last piece of the puzzle is striping algorithms. Striping algorithms or schemes refer to how stripes are mapped on the disk array. Being a hands on person, I like to drill down to the bit how things work. This has helped me solve many problems with RAID systems. Striping schemes are complex layouts involving 2 layers, the physical stripes with parity and the logical striping where the data is mapped.

Our first step in working on a RAID recovery is to review each disk's raw data to determine stripe size and orientation.

Stripe size was mentioned previously but to review, Stripes are a collection of sectors. Each stripe contains a number sectors equal to a power of 2. The smallest stripe size we have see is 16 sectors (2^4). The most common size is 128 sectors (2^7) . In my real life example below, the stripe size was 2048 sectors (2^11).

Stripe mapping can be very confusing. In an effort to lessen the confusion, I will talk about 3 different views of the same stripe, the disk view, the physical volume view and the logical volume view. The physical volume and logical volume correspond to the 2 layers mentioned above.

The physical stripes reside on the disks in order. I will use a simplified annotation scheme based on stripe number and disk number, for example; S1D0 = first stripe on disk 0 (128 sectors from 0 - 127), S2D0 = next stripe on disk 0 (128 sectors from 128 - 255), etc. For the physical RAID volume representation I will use the letters 'PV'. So, the first stripe will be PV1, the second will be PV2 and so on. The last view of the data is the logical view. This is what the operating system sees. I will refer to the first stripe of the logical volume as LV1, LV2, etc. For clarity, I will use the term, stripe set, to refer to a parity stripe and its contiguous stripes from each drive (S1D0, S1D1, S1D2, S1D3, etc).

In RAID 0 and RAID 0+1 stripes are mapped contiguously in order of the drives. The mapping for these arrays is simple. In the RAID 0 example at the right, A1, A2, etc are the same as my volume representation (PV) above. Just substitute PV for A. A1 is the first stripe and resides on Disk 0, A2 is the second stripe and resides on the first stripe, A3 (the second stripe on Disk 0) is the third stripe, etc. There are no parity stripes to worry about.

The most interesting stripe map is found in RAID 5 Arrays. RAID 5 was originally defined in 1988 in a paper, "A Case for Redundant Arrays of Inexpensive Disks (RAID)" in June 1988 at the SIGMOD conference. The new idea in this paper was for one stripe out of each stripe set to be used for parity in a group of striped drives while distributing the parity stripe over all the drives to minimize the impact on performance. This concept of data storage and redundancy revolutionized the way people looked at data storage. The concept of distributed data with redundancy has moved out of simple disk management and is used as the basis for distributed data in global networks. Newer schemes for redundancy in large arrays such as RAID 6 designate 2 stripes for holding parity to address the vulnerability of a RAID array to data loss during rebuild of a failed drive. We will focus on RAID 5 for the purposes of this post.

RAID 5 sets aside one physical stripe for parity for each stripe set. The parity stripe alternates from drive to drive. What determines where the parity stripe resides is based on 2 standard schemes, reverse parity and forward parity. In the image at the left is an example of reverse parity. In reverse parity, the first parity stripe is on the last drive (Disk 3) and is designated A(p) in the image. In my notation this would be represented as S1D3 for the first stripe of Disk 3. It is the 4th stripe of the RAID volume (PV4) and isn't included in the logical volume at all since it is a parity stripe. In reverse parity the first stripe of the first drive (S1D0) is also the first stripe of the RAID Volume (PV1) and the first stripe of the Logical Volume (LV1). This scheme makes it easy conceptually to see the logical volume as this information is where it should be as on a single drive.

To continue the second parity stripe is on Disk 2 and is designated B(p) in the example (in my notation it would be S2D2). The mapping of data stripes are simply in order from Disk 0 to Disk 3 skipping the parity stripe. I will discuss different data mapping schemes in a future post. In this scheme, the first drive holds the first stripe of the volume (LV1). It is easy conceptually as the first drive looks like it is really the first part of the volume as it holds the master boot record (MBR) and often the bios parameter block (BPB). The second stripe (LV2) is on second drive (S1D1), etc. Note that the number of stripes in the logical volume is less than the physical volume. The difference is simply (n - 1) * the total number of stripes where n = the total number of drives in the array.

In the forward parity scheme, the first parity stripe resides on the first stripe on the first drive (S1D0) in the array. The second parity stripe resides on the second stripe on the second drive (S2D1). The third parity stripe resides on the third stripe on the third drive (S3D2). This continues until the last drive in the array. Then, this scheme starts over on the first drive. The first data stripe of the volume is first stripe on the second hard disk (A2 in the image, S1D1 in my notation). It contains the MBR and most likely, the BPB, depending on the size of the stripe.

I use the term logical volume to describe the actual look of the volume to the operating system. The RAID array is usually completely hidden from the operating system and is seen by the OS as a regular disk volume like a single hard drive. I wanted to return to this representation to show how the size of the resulting volume is calculated. Using the formula above for determining the number of stripes works equally well for determining the size. The size of the RAID's logical volume is equal to the number of drives minus 1 times the size of the smallest drive or in mathematical terms :

LV = (N - 1) * S

Where LV is the size of the logical volume, N is the number of hard drives and S is the size of the smallest drive.

There are many other parity striping schemes including proprietary schemes. Compaq created one that had me stumped for a while. Compaq, when they took over HP's Surestore division, created a parity mapping scheme that spanned several data stripes. Instead of have one parity stripe for each physical stripe (PV), they used 16 contiguous stripes on a drive for parity. the data stripes remained 128 sectors. Thus, for each of those 16 stripes it looks like RAID 4.

Using a standard algorithm to unstripe this set yielded a strange result. It resembled a reverse parity misconfiguration when unstriped with a forward parity algorithm and a forward parity misconfiguration when unstriped with a reverse parity algorithm. When I looked at the file system with a data recovery tool, many files would be visible but more than half wouldn't open. On inspection of the unstriped volume, parity stripes were evident in the the place that data should be. After careful review of each drive I could see that the data was striped at 128 sectors but the parity was striped at 2048 sectors (16 x 128). I was able to rewrite my unstripe routine to account for this difference and get the data back.

In my next post I will address data mapping schemes.