X-Git-Url: http://ftp.safe.ca/?a=blobdiff_plain;f=Documentation%2Fmd.txt;h=e4e893ef3e012e1fa2629b63e3054fcd15f5600d;hb=a33f32244d8550da8b4a26e277ce07d5c6d158b5;hp=b19978e035fca1f457ab8a39bca2fbb6191309a8;hpb=16f17b39f385212b73278a76d482cdcaaebe6c02;p=safe%2Fjmp%2Flinux-2.6 diff --git a/Documentation/md.txt b/Documentation/md.txt index b19978e..e4e893e 100644 --- a/Documentation/md.txt +++ b/Documentation/md.txt @@ -62,7 +62,7 @@ be reconstructed (due to no parity). For this reason, md will normally refuse to start such an array. This requires the sysadmin to take action to explicitly start the array -desipite possible corruption. This is normally done with +despite possible corruption. This is normally done with mdadm --assemble --force .... This option is not really available if the array has the root @@ -136,7 +136,7 @@ raid_disks != 0. Then uninitialized devices can be added with ADD_NEW_DISK. The structure passed to ADD_NEW_DISK must specify the state of the device -and it's role in the array. +and its role in the array. Once started with RUN_ARRAY, uninitialized spares can be added with HOT_ADD_DISK. @@ -154,29 +154,63 @@ contains further md-specific information about the device. All md devices contain: level - a text file indicating the 'raid level'. This may be a standard - numerical level prefixed by "RAID-" - e.g. "RAID-5", or some - other name such as "linear" or "multipath". + a text file indicating the 'raid level'. e.g. raid0, raid1, + raid5, linear, multipath, faulty. If no raid level has been set yet (array is still being - assembled), this file will be empty. + assembled), the value will reflect whatever has been written + to it, which may be a name like the above, or may be a number + such as '0', '5', etc. raid_disks a text file with a simple number indicating the number of devices in a fully functional array. If this is not yet known, the file - will be empty. If an array is being resized (not currently - possible) this will contain the larger of the old and new sizes. - Some raid level (RAID1) allow this value to be set while the - array is active. This will reconfigure the array. Otherwise - it can only be set while assembling an array. + will be empty. If an array is being resized this will contain + the new number of devices. + Some raid levels allow this value to be set while the array is + active. This will reconfigure the array. Otherwise it can only + be set while assembling an array. + A change to this attribute will not be permitted if it would + reduce the size of the array. To reduce the number of drives + in an e.g. raid5, the array size must first be reduced by + setting the 'array_size' attribute. chunk_size - This is the size if bytes for 'chunks' and is only relevant to - raid levels that involve striping (1,4,5,6,10). The address space + This is the size in bytes for 'chunks' and is only relevant to + raid levels that involve striping (0,4,5,6,10). The address space of the array is conceptually divided into chunks and consecutive chunks are striped onto neighbouring devices. - The size should be atleast PAGE_SIZE (4k) and should be a power + The size should be at least PAGE_SIZE (4k) and should be a power of 2. This can only be set while assembling an array + layout + The "layout" for the array for the particular level. This is + simply a number that is interpretted differently by different + levels. It can be written while assembling an array. + + array_size + This can be used to artificially constrain the available space in + the array to be less than is actually available on the combined + devices. Writing a number (in Kilobytes) which is less than + the available size will set the size. Any reconfiguration of the + array (e.g. adding devices) will not cause the size to change. + Writing the word 'default' will cause the effective size of the + array to be whatever size is actually available based on + 'level', 'chunk_size' and 'component_size'. + + This can be used to reduce the size of the array before reducing + the number of devices in a raid4/5/6, or to support external + metadata formats which mandate such clipping. + + reshape_position + This is either "none" or a sector number within the devices of + the array where "reshape" is up to. If this is set, the three + attributes mentioned above (raid_disks, chunk_size, layout) can + potentially have 2 values, an old and a new value. If these + values differ, reading the attribute returns + new (old) + and writing will effect the 'new' value, leaving the 'old' + unchanged. + component_size For arrays with data redundancy (i.e. not raid0, linear, faulty, multipath), all components must be the same size - or at least @@ -191,14 +225,17 @@ All md devices contain: about the array. It can be 0.90 (traditional format), 1.0, 1.1, 1.2 (newer format in varying locations) or "none" indicating that the kernel isn't managing metadata at all. - - level - The raid 'level' for this array. The name will often (but not - always) be the same as the name of the module that implements the - level. To be auto-loaded the module must have an alias - md-$LEVEL e.g. md-raid5 - This can be written only while the array is being assembled, not - after it is started. + Alternately it can be "external:" followed by a string which + is set by user-space. This indicates that metadata is managed + by a user-space program. Any device failure or other event that + requires a metadata update will cause array activity to be + suspended until the event is acknowledged. + + resync_start + The point at which resync should start. If no resync is needed, + this will be a very large number (or 'none' since 2.6.30-rc1). At + array creation it will default to 0, though starting the array as + 'clean' will set it much larger. new_dev This file can be written but not read. The value written should @@ -210,34 +247,100 @@ All md devices contain: safe_mode_delay When an md array has seen no write requests for a certain period of time, it will be marked as 'clean'. When another write - request arrive, the array is marked as 'dirty' before the write - commenses. This is known as 'safe_mode'. + request arrives, the array is marked as 'dirty' before the write + commences. This is known as 'safe_mode'. The 'certain period' is controlled by this file which stores the period as a number of seconds. The default is 200msec (0.200). Writing a value of 0 disables safemode. - sync_speed_min - sync_speed_max - This are similar to /proc/sys/dev/raid/speed_limit_{min,max} - however they only apply to the particular array. - If no value has been written to these, of if the word 'system' - is written, then the system-wide value is used. If a value, - in kibibytes-per-second is written, then it is used. - When the files are read, they show the currently active value - followed by "(local)" or "(system)" depending on whether it is - a locally set or system-wide value. - - sync_completed - This shows the number of sectors that have been completed of - whatever the current sync_action is, followed by the number of - sectors in total that could need to be processed. The two - numbers are separated by a '/' thus effectively showing one - value, a fraction of the process that is complete. - - sync_speed - This shows the current actual speed, in K/sec, of the current - sync_action. It is averaged over the last 30 seconds. - + array_state + This file contains a single word which describes the current + state of the array. In many cases, the state can be set by + writing the word for the desired state, however some states + cannot be explicitly set, and some transitions are not allowed. + + Select/poll works on this file. All changes except between + active_idle and active (which can be frequent and are not + very interesting) are notified. active->active_idle is + reported if the metadata is externally managed. + + clear + No devices, no size, no level + Writing is equivalent to STOP_ARRAY ioctl + inactive + May have some settings, but array is not active + all IO results in error + When written, doesn't tear down array, but just stops it + suspended (not supported yet) + All IO requests will block. The array can be reconfigured. + Writing this, if accepted, will block until array is quiessent + readonly + no resync can happen. no superblocks get written. + write requests fail + read-auto + like readonly, but behaves like 'clean' on a write request. + + clean - no pending writes, but otherwise active. + When written to inactive array, starts without resync + If a write request arrives then + if metadata is known, mark 'dirty' and switch to 'active'. + if not known, block and switch to write-pending + If written to an active array that has pending writes, then fails. + active + fully active: IO and resync can be happening. + When written to inactive array, starts with resync + + write-pending + clean, but writes are blocked waiting for 'active' to be written. + + active-idle + like active, but no writes have been seen for a while (safe_mode_delay). + + bitmap/location + This indicates where the write-intent bitmap for the array is + stored. + It can be one of "none", "file" or "[+-]N". + "file" may later be extended to "file:/file/name" + "[+-]N" means that many sectors from the start of the metadata. + This is replicated on all devices. For arrays with externally + managed metadata, the offset is from the beginning of the + device. + bitmap/chunksize + The size, in bytes, of the chunk which will be represented by a + single bit. For RAID456, it is a portion of an individual + device. For RAID10, it is a portion of the array. For RAID1, it + is both (they come to the same thing). + bitmap/time_base + The time, in seconds, between looking for bits in the bitmap to + be cleared. In the current implementation, a bit will be cleared + between 2 and 3 times "time_base" after all the covered blocks + are known to be in-sync. + bitmap/backlog + When write-mostly devices are active in a RAID1, write requests + to those devices proceed in the background - the filesystem (or + other user of the device) does not have to wait for them. + 'backlog' sets a limit on the number of concurrent background + writes. If there are more than this, new writes will by + synchronous. + bitmap/metadata + This can be either 'internal' or 'external'. + 'internal' is the default and means the metadata for the bitmap + is stored in the first 256 bytes of the allocated space and is + managed by the md module. + 'external' means that bitmap metadata is managed externally to + the kernel (i.e. by some userspace program) + bitmap/can_clear + This is either 'true' or 'false'. If 'true', then bits in the + bitmap will be cleared when the corresponding blocks are thought + to be in-sync. If 'false', bits will never be cleared. + This is automatically set to 'false' if a write happens on a + degraded array, or if the array becomes degraded during a write. + When metadata is managed externally, it should be set to true + once the array becomes non-degraded, and this fact has been + recorded in the metadata. + + + As component devices are added to an md array, they appear in the 'md' directory as new directories named @@ -259,10 +362,29 @@ Each directory contains: faulty - device has been kicked from active use due to a detected fault in_sync - device is a fully in-sync member of the array + writemostly - device will only be subject to read + requests if there are no other options. + This applies only to raid1 arrays. + blocked - device has failed, metadata is "external", + and the failure hasn't been acknowledged yet. + Writes that would write to this device if + it were not faulty are blocked. spare - device is working, but not a full member. This includes spares that are in the process - of being recoverred to - This list make grow in future. + of being recovered to + This list may grow in future. + This can be written to. + Writing "faulty" simulates a failure on the device. + Writing "remove" removes the device from the array. + Writing "writemostly" sets the writemostly flag. + Writing "-writemostly" clears the writemostly flag. + Writing "blocked" sets the "blocked" flag. + Writing "-blocked" clears the "blocked" flag and allows writes + to complete. + Writing "in_sync" sets the in_sync flag. + + This file responds to select/poll. Any change to 'faulty' + or 'blocked' causes an event. errors An approximate count of read errors that have been detected on @@ -279,7 +401,7 @@ Each directory contains: This gives the role that the device has in the array. It will either be 'none' if the device is not active in the array (i.e. is a spare or has failed) or an integer less than the - 'raid_disks' number for the array indicating which possition + 'raid_disks' number for the array indicating which position it currently fills. This can only be set while assembling an array. A device for which this is set is assumed to be working. @@ -294,7 +416,25 @@ Each directory contains: for storage of data. This will normally be the same as the component_size. This can be written while assembling an array. If a value less than the current component_size is - written, component_size will be reduced to this value. + written, it will be rejected. + + recovery_start + + When the device is not 'in_sync', this records the number of + sectors from the start of the device which are known to be + correct. This is normally zero, but during a recovery + operation is will steadily increase, and if the recovery is + interrupted, restoring this value can cause recovery to + avoid repeating the earlier blocks. With v1.x metadata, this + value is saved and restored automatically. + + This can be set whenever the device is not an active member of + the array, either before the array is activated, or before + the 'slot' is set. + + Setting this to 'none' is equivalent to setting 'in_sync'. + Setting to any other value also clears the 'in_sync' flag. + An active md device will also contain and entry for each active device @@ -302,7 +442,7 @@ in the array. These are named rdNN -where 'NN' is the possition in the array, starting from 0. +where 'NN' is the position in the array, starting from 0. So for a 3 drive array there will be rd0, rd1, rd2. These are symbolic links to the appropriate 'dev-XXX' entry. Thus, for example, @@ -343,6 +483,19 @@ also have 'check' and 'repair' will start the appropriate process providing the current state is 'idle'. + This file responds to select/poll. Any important change in the value + triggers a poll event. Sometimes the value will briefly be + "recover" if a recovery seems to be needed, but cannot be + achieved. In that case, the transition to "recover" isn't + notified, but the transition away is. + + degraded + This contains a count of the number of devices by which the + arrays is degraded. So an optimal array with show '0'. A + single failed/missing drive will show '1', etc. + This file responds to select/poll, any increase or decrease + in the count of missing devices will trigger an event. + mismatch_count When performing 'check' and 'repair', and possibly when performing 'resync', md will count the number of errors that are @@ -352,6 +505,54 @@ also have than sectors, this my be larger than the number of actual errors by a factor of the number of sectors in a page. + bitmap_set_bits + If the array has a write-intent bitmap, then writing to this + attribute can set bits in the bitmap, indicating that a resync + would need to check the corresponding blocks. Either individual + numbers or start-end pairs can be written. Multiple numbers + can be separated by a space. + Note that the numbers are 'bit' numbers, not 'block' numbers. + They should be scaled by the bitmap_chunksize. + + sync_speed_min + sync_speed_max + This are similar to /proc/sys/dev/raid/speed_limit_{min,max} + however they only apply to the particular array. + If no value has been written to these, of if the word 'system' + is written, then the system-wide value is used. If a value, + in kibibytes-per-second is written, then it is used. + When the files are read, they show the currently active value + followed by "(local)" or "(system)" depending on whether it is + a locally set or system-wide value. + + sync_completed + This shows the number of sectors that have been completed of + whatever the current sync_action is, followed by the number of + sectors in total that could need to be processed. The two + numbers are separated by a '/' thus effectively showing one + value, a fraction of the process that is complete. + A 'select' on this attribute will return when resync completes, + when it reaches the current sync_max (below) and possibly at + other times. + + sync_max + This is a number of sectors at which point a resync/recovery + process will pause. When a resync is active, the value can + only ever be increased, never decreased. The value of 'max' + effectively disables the limit. + + + sync_speed + This shows the current actual speed, in K/sec, of the current + sync_action. It is averaged over the last 30 seconds. + + suspend_lo + suspend_hi + The two values, given as numbers of sectors, indicate a range + within the array where IO will be blocked. This is currently + only supported for raid4/5/6. + + Each active md device may also have attributes specific to the personality module that manages it. These are specific to the implementation of the module and could @@ -364,3 +565,9 @@ These currently include there are upper and lower limits (32768, 16). Default is 128. strip_cache_active (currently raid5 only) number of active entries in the stripe cache + preread_bypass_threshold (currently raid5 only) + number of times a stripe requiring preread will be bypassed by + a stripe that does not require preread. For fairness defaults + to 1. Setting this to 0 disables bypass accounting and + requires preread stripes to wait until all full-width stripe- + writes are complete. Valid values are 0 to stripe_cache_size.