Ext4 Filesystem
===============
-This is a development version of the ext4 filesystem, an advanced level
-of the ext3 filesystem which incorporates scalability and reliability
-enhancements for supporting large filesystems (64 bit) in keeping with
-increasing disk capacities and state-of-the-art feature requirements.
+Ext4 is an an advanced level of the ext3 filesystem which incorporates
+scalability and reliability enhancements for supporting large filesystems
+(64 bit) in keeping with increasing disk capacities and state-of-the-art
+feature requirements.
-Mailing list: linux-ext4@vger.kernel.org
+Mailing list: linux-ext4@vger.kernel.org
+Web site: http://ext4.wiki.kernel.org
1. Quick usage instructions:
===========================
+Note: More extensive information for getting started with ext4 can be
+ found at the ext4 wiki site at the URL:
+ http://ext4.wiki.kernel.org/index.php/Ext4_Howto
+
- Compile and install the latest version of e2fsprogs (as of this
- writing version 1.41) from:
+ writing version 1.41.3) from:
http://sourceforge.net/project/showfiles.php?group_id=2406
# mke2fs -t ext4 /dev/hda1
- Or configure an existing ext3 filesystem to support extents and set
- the test_fs flag to indicate that it's ok for an in-development
- filesystem to touch this filesystem:
+ Or to configure an existing ext3 filesystem to support extents:
- # tune2fs -O extents -E test_fs /dev/hda1
+ # tune2fs -O extents /dev/hda1
If the filesystem was created with 128 byte inodes, it can be
converted to use 256 byte for greater efficiency via:
# mount -t ext4 /dev/hda1 /wherever
- - When comparing performance with other filesystems, remember that
- ext3/4 by default offers higher data integrity guarantees than most.
- So when comparing with a metadata-only journalling filesystem, such
- as ext3, use `mount -o data=writeback'. And you might as well use
- `mount -o nobh' too along with it. Making the journal larger than
- the mke2fs default often helps performance with metadata-intensive
- workloads.
+ - When comparing performance with other filesystems, it's always
+ important to try multiple workloads; very often a subtle change in a
+ workload parameter can completely change the ranking of which
+ filesystems do well compared to others. When comparing versus ext3,
+ note that ext4 enables write barriers by default, while ext3 does
+ not enable write barriers by default. So it is useful to use
+ explicitly specify whether barriers are enabled or not when via the
+ '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
+ for a fair comparison. When tuning ext3 for best benchmark numbers,
+ it is often worthwhile to try changing the data journaling mode; '-o
+ data=writeback,nobh' can be faster for some workloads. (Note
+ however that running mounted with data=writeback can potentially
+ leave stale data exposed in recently written files in case of an
+ unclean shutdown, which could be a security exposure in some
+ situations.) Configuring the filesystem with a large journal can
+ also be helpful for metadata-intensive workloads.
2. Features
===========
* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format more robust in face of on-disk corruption due to magics,
-* internal redunancy in tree
+* internal redundancy in tree
* improved file allocation (multi-block alloc)
-* fix 32000 subdirectory limit
+* lift 32000 subdirectory limit imposed by i_links_count[1]
* nsec timestamps for mtime, atime, ctime, create time
* inode version field on disk (NFSv4, Lustre)
* reduced e2fsck time via uninit_bg feature
* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
the ordering)
+[1] Filesystems with a block size of 1k may see a limit imposed by the
+directory hash tree having a maximum depth of two.
+
2.2 Candidate features for future inclusion
* Online defrag (patches available but not well tested)
The big performance win will come with mballoc, delalloc and flex_bg
grouping of bitmaps and inode tables. Some test results available here:
- - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
- - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html
+ - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
+ - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
3. Options
==========
When mounting an ext4 filesystem, the following option are accepted:
(*) == default
-extents (*) ext4 will use extents to address file data. The
- file system will no longer be mountable by ext3.
-
-noextents ext4 will not use extents for newly created files
-
-journal_checksum Enable checksumming of the journal transactions.
- This will allow the recovery code in e2fsck and the
- kernel to detect corruption in the kernel. It is a
- compatible change and will be ignored by older kernels.
+ro Mount filesystem read only. Note that ext4 will
+ replay the journal (and thus write to the
+ partition) even when mounted "read only". The
+ mount options "ro,noload" can be used to prevent
+ writes to the filesystem.
journal_async_commit Commit block can be written to disk without waiting
for descriptor blocks. If enabled older kernels cannot
- mount the device. This will enable 'journal_checksum'
- internally.
+ mount the device.
journal=update Update the ext4 file system's journal to the current
format.
-journal=inum When a journal already exists, this option is ignored.
- Otherwise, it specifies the number of the inode which
- will represent the ext4 file system's journal file.
-
journal_dev=devnum When the external journal device's major/minor numbers
have changed, this option allows the user to specify
the new journal location. The journal device is
identified through its new major/minor numbers encoded
in devnum.
-noload Don't load the journal on mounting.
+noload Don't load the journal on mounting. Note that
+ if the filesystem was not unmounted cleanly,
+ skipping the journal replay will lead to the
+ filesystem containing inconsistencies that can
+ lead to any number of problems.
data=journal All data are committed into the journal prior to being
written into the main file system.
performance.
barrier=<0|1(*)> This enables/disables the use of write barriers in
- the jbd code. barrier=0 disables, barrier=1 enables.
- This also requires an IO stack which can support
+barrier(*) the jbd code. barrier=0 disables, barrier=1 enables.
+nobarrier This also requires an IO stack which can support
barriers, and if jbd gets an error on a barrier
write, it will disable again with a warning.
Write barriers enforce proper on-disk ordering
safe to use, at some performance penalty. If
your disks are battery-backed in one way or another,
disabling barriers may safely improve performance.
+ The mount options "barrier" and "nobarrier" can
+ also be used to enable or disable barriers, for
+ consistency with other ext4 mount options.
inode_readahead=n This tuning parameter controls the maximum
number of inode table blocks that ext4's inode
bsddf (*) Make 'df' act like BSD.
minixdf Make 'df' act like Minix.
-check=none Don't do extra checking of bitmaps on mount.
-nocheck
-
debug Extra debugging information is sent to syslog.
-errors=remount-ro(*) Remount the filesystem read-only on an error.
+abort Simulate the effects of calling ext4_abort() for
+ debugging purposes. This is normally used while
+ remounting a filesystem which is already mounted.
+
+errors=remount-ro Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.
+ (These mount options override the errors behavior
+ specified in the superblock, which can be configured
+ using tune2fs)
data_err=ignore(*) Just print an error message if an error occurs
in a file data buffer in ordered mode.
sb=n Use alternate superblock at this location.
-quota
-noquota
-grpquota
-usrquota
+quota These options are ignored by the filesystem. They
+noquota are used only by quota tools to recognize volumes
+grpquota where quota should be turned on. See documentation
+usrquota in the quota-tools package for more details
+ (http://sourceforge.net/projects/linuxquota).
+
+jqfmt=<quota type> These options tell filesystem details about quota
+usrjquota=<file> so that quota information can be properly updated
+grpjquota=<file> during journal replay. They replace the above
+ quota options. See documentation in the quota-tools
+ package for more details
+ (http://sourceforge.net/projects/linuxquota).
bh (*) ext4 associates buffer heads to data pages to
nobh (a) cache disk block mapping information
"nobh" option tries to avoid associating buffer
heads (supported only for "writeback" mode).
-mballoc (*) Use the multiple block allocator for block allocation
-nomballoc disabled multiple block allocator for block allocation.
stripe=n Number of filesystem blocks that mballoc will try
to use for allocation size and alignment. For RAID5/6
systems this should be the number of data
nodelalloc Disable delayed allocation. Blocks are allocation
when data is copied from user to page cache.
+max_batch_time=usec Maximum amount of time ext4 should wait for
+ additional filesystem operations to be batch
+ together with a synchronous write operation.
+ Since a synchronous write operation is going to
+ force a commit and then a wait for the I/O
+ complete, it doesn't cost much, and can be a
+ huge throughput win, we wait for a small amount
+ of time to see if any other transactions can
+ piggyback on the synchronous write. The
+ algorithm used is designed to automatically tune
+ for the speed of the disk, by measuring the
+ amount of time (on average) that it takes to
+ finish committing a transaction. Call this time
+ the "commit time". If the time that the
+ transaction has been running is less than the
+ commit time, ext4 will try sleeping for the
+ commit time to see if other operations will join
+ the transaction. The commit time is capped by
+ the max_batch_time, which defaults to 15000us
+ (15ms). This optimization can be turned off
+ entirely by setting max_batch_time to 0.
+
+min_batch_time=usec This parameter sets the commit time (as
+ described above) to be at least min_batch_time.
+ It defaults to zero microseconds. Increasing
+ this parameter may improve the throughput of
+ multi-threaded, synchronous workloads on very
+ fast disks, at the cost of increasing latency.
+
+journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the
+ highest priorty) which should be used for I/O
+ operations submitted by kjournald2 during a
+ commit operation. This defaults to 3, which is
+ a slightly higher priority than the default I/O
+ priority.
+
+auto_da_alloc(*) Many broken applications don't use fsync() when
+noauto_da_alloc replacing existing files via patterns such as
+ fd = open("foo.new")/write(fd,..)/close(fd)/
+ rename("foo.new", "foo"), or worse yet,
+ fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
+ If auto_da_alloc is enabled, ext4 will detect
+ the replace-via-rename and replace-via-truncate
+ patterns and force that any delayed allocation
+ blocks are allocated such that at the next
+ journal commit, in the default data=ordered
+ mode, the data blocks of the new file are forced
+ to disk before the rename() operation is
+ committed. This provides roughly the same level
+ of guarantees as ext3, and avoids the
+ "zero-length" problem that can happen when a
+ system crashes before the delayed allocation
+ blocks are forced to disk.
+
Data Mode
=========
There are 3 different data modes:
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
-outperforms all others modes. Curently ext4 does not have delayed
+outperforms all others modes. Currently ext4 does not have delayed
allocation support if this data journalling mode is selected.
References