Discover Technology: EMC - iSilon

1. Overview

In OneFS, protection is calculated on individual files. To calculate protection individual files are logically broken down into 128 KB stripe units. Stripe width is the number of stripe units that you can create before you need to create a protection stripe unit. Each file is broken down into smaller 128 KB stripes units, then protection is calculated for the file and protection stripe units are created. The data stripe units and the protection stripe units together form a stripe. Stripe units are sent to individual nodes in the cluster. As a result, when a file is needed, multiple nodes in the cluster are able to deliver the data back to the requesting user or application. This dramatically improves overall performance, especially when hundreds, and even thousands, of these requests are made simultaneously from an application. Due to the way in which OneFS applies protection, files that are 128 KB in size or smaller are actually mirrored.
OneFS does not use RAID to protect cluster data. OneFS uses the Reed-Solomon algorithm for N+M protection. Reed-Solomon is used because it is an industry standard that enables data to have very high protection levels. In the N+M data protection model, N represents the number of data stripe units and M represents the number of simultaneous failures of nodes or drives—or a combination of nodes and drives—that the cluster can withstand without incurring data loss. M also equals the number of protection stripe units that are created within a stripe. N must be larger than M. For many N+M protection levels, there are no RAID equivalents. On an Isilon cluster you can enable N+1, N+2, N+3, or N+4 protection, which allows the cluster to sustain two, three, or four simultaneous failures without resulting in data loss.

OneFS is a distributed clustered file system that runs on all nodes in the cluster. As nodes are added, the file system grows dynamically and content is evenly distributed to every node. There is no master or controlling node in the cluster--all information is shared among nodes, so the entire file system is accessible by clients connecting to any node in the cluster.
OneFS uses advanced data layout algorithms to determine data layout for maximum efficiency and performance. Data is evenly distributed across nodes as it is written. Because data protection is handled by OneFS, the system can continuously reallocate data and make storage space more usable and efficient. As the cluster size increases, the system stores large files more efficiently.

All write operations from a client are spread across the nodes of a cluster in a process called striping. The example in this animation demonstrates how a file is broken down into data stripe units, after which it is striped across disks in the cluster along with protection information. Even though a client is connected to only one node, when that client saves data to the cluster, the write operation occurs in multiple nodes in the cluster. This is also true for read operations. A client is connected to only one
node at a time, however when that client requests a file from the cluster, the node to which the client is connected will not have the entire file locally on its drives. The client’s node retrieves and rebuilds the file across the backend InfiniBand network.

Data and metadata in OneFS are striped or mirrored across all nodes in a cluster or node pool, which provides redundancy, availability, and in some cases performance benefits. There are two types of protection that comes with FlexProtect. Data may be protected using FEC (Forward Error Correction) or mirroring.
FEC is similar to how parity is used in traditional storage environments but is different. FEC information is distributed across nodes and across drives. In the event of a drive or node failure, the missing data is rebuilt from the mirror or calculated using the FEC and the remaining data.
Mirroring copies data to multiple locations therefore creating redundancy. With mirroring, you can have up to 8 copies of data in a single cluster, compared to only 1 with the traditional RAID-1. Both mirroring and FEC protection generate additional storage overhead. The percentage of FEC overhead can decline as the cluster becomes larger, especially for larger files.
EMC Isilon allows you to define protection on a node pool (group of similar nodes), a directory or even an individual file, as well as have multiple protection levels configured throughout the cluster. This feature makes it possible to match the protection level to the value of the data that is being protected.

2. N+M:B

In addition to the basic N+M model of data protection, OneFS also supports N+M:B, which enables the separation of drive failure protection and node failure protection. In the N+M:B notation, B = the number of node failures. For example, N+2 allows the cluster to sustain the loss of two drives or 2 nodes. But, N+2:1 data protection allows the cluster to sustain the loss of two drives or one node without data loss. The important difference is the amount of overhead lost to protection data. Overhead is greatly reduced when using N+M:B. N+3:1 data protection allows the cluster to sustain the loss of three drives or one node without data loss. The default node pool protection level is N+2:1.
Protection is applied at the file level and OneFS allows different protection levels on files, directories and node pools. By default, a file inherits the protection level of it’s parent directory, but any protection level can be changed at any time. Also, metadata is protected at the same level as the corresponding file.
You can set a protection level that is higher than the cluster can support. For example, in a 4-node cluster you can set the protection level at N+2. OneFS will protect the data at 2X until a 5th node is added to the cluster. Then, it will automatically re-protect the data at N+2

When you protect data from hardware failure, you lose an amount of disk space to protection. The protection overhead depends on the protection setting, the file size and number of nodes in the cluster. The percentage of protection overhead for an Isilon cluster declines as the cluster nodes are added.
The reason that the overhead declines as the number of nodes grows is because the stripe width increases as the number of nodes increases. Stripe width is the number of stripe units that are in a stripe. The maximum stripe width in an Isilon cluster is 20 stripe units, up to 16 of which can be data stripe units and up to 4 of which can be protection stripe units. Since a large file is broken down into data and protection stripe units, files that are larger than 2 MB (16 X 128 KB) need more than one stripe.
For example:

N+1 at 3 nodes = 2+1 (max stripe width of 3)

N+1 at 17 nodes = 16+1 (max stripe width of 17)

N+4 at 9 nodes = 5+4 (max stripe width of 13)

N+4 at 20 nodes = 16+4 (max stripe width of 20)

The protection overhead for each protection level depends on the file size and the number of nodes in the cluster. The percentage of protection overhead declines as the cluster gets larger. In general, +1 protection has a protection overhead equal to one node's capacity, +2 protection has a protection overhead equal to two nodes' capacity, +3 is equal to three nodes' capacity, and so on.
OneFS also supports optional data mirroring from 2x-8x, allowing from two to eight mirrors of the specified content. Data mirroring requires significant storage overhead and may not always be the best data-protection method. For example, if you enable 3x mirroring, the specified content is explicitly duplicated three times on the cluster; depending on the amount of content being mirrored, this can require a significant amount of capacity.

3. Drives & Node Failures

Drive has three possible states:

Healthy. The drive is in its normal operating condition.

SmartFail. A restripe process is taking place.

Not in use. The drive has either physically or logically been removed from the node.

Under normal operating conditions, all data on the cluster is protected against one or more failures of a node or drive. If a node or drive fails, the protection status is considered to be in a degraded state until the data has been reprotected to the configured protection level.
FlexProtect is responsible for restriping and reprotecting the data. Occasionally a drive may fail without the system preemptively detecting a problem; in this case, FlexProtect automatically starts rebuilding the data to available free space on the cluster. After confirming that the FlexProtect operation has completed with no errors, you can hot-swap the drive and then add the new drive to the cluster by using the web administration interface or the command-line interface.
You can identify a node with a failed drive by the alert light on the front panel. You must remove the front panel of a mode to access the front drive bays. The individual drive bay of the failed hard drive will also be indicated by a red light. If the failed drive is in a 4U node, the failed drive may be in the rear of the node. To replace a drive you need to first release the locking handle. To do this, pull the drive locking handle toward you until it releases the drive. You can then carefully remove the drive from the node.

A node loss is often a temporary issue, so FlexProtect does not automatically start reprotecting data when a node does not respond or goes offline. If a node reboots, the file system does not need to be rebuilt because it remained intact during the temporary failure. In an N+2:1 configuration, if one node fails, all data is still accessible from every other node in the cluster. If the node comes back online, it rejoins the cluster automatically without requiring a full rebuild.

To maintain an accurate cluster state, if you physically remove a node from the cluster, you must also logically remove the node from the cluster. After you logically remove a node, it automatically reformats its own drives, and resets itself to the factory default settings. The reset occurs only after OneFS has confirmed that all data has been protected again. You can logically remove a node using the SmartFail process.
During the SmartFail process, the node that is to be removed is placed in a read-only state while the cluster performs a FlexProtect process to logically move all data from the affected node. After all data migration is complete, the cluster logically changes its available maximum stripe width to the new configuration; at this point, it is safe to physically remove the node. It is important that you use the SmartFail process only when you want to permanently remove a node from the cluster.
It is more efficient to add a replacement node to the cluster before failing the old node, because FlexProtect can immediately use the replacement node to rebuild the failed node's data. If you remove the failed node first, FlexProtect must rebuild the node's data into available space in the cluster, and AutoBalance then transfers the data to the added replacement node when it is added to the cluster.