Contents
SUSE Linux Enterprise Server ships with a number of different file systems from which to choose, including Ext3, Ext2, ReiserFS, and XFS. Each file system has its own advantages and disadvantages.
Professional high-performance setups might require a highly available storage systems. To meet the requirements of high-performance clustering scenarios, SUSE Linux Enterprise Server includes OCFS2 (Oracle Cluster File System 2) and the Distributed Replicated Block Device (DRBD) in the SLES High-Availability Storage Infrastructure (HASI) release. These advanced storage systems are not covered in this guide. For information, see the SUSE Linux Enterprise 11 SP2 High Availability Extension Guide.
A data structure that is internal to the file system. It assures that all of the on-disk data is properly organized and accessible. Essentially, it is “data about the data.” Almost every file system has its own structure of metadata, which is on reason that the file systems show different performance characteristics. It is extremely important to maintain metadata intact, because otherwise all data on the file system could become inaccessible.
A data structure on a file system that contains various information about a file, including size, number of links, pointers to the disk blocks where the file contents are actually stored, and date and time of creation, modification, and access.
In the context of a file system, a journal is an on-disk structure containing a type of log in which the file system stores what it is about to change in the file system’s metadata. Journaling greatly reduces the recovery time of a file system because it has no need for the lengthy search process that checks the entire file system at system startup. Instead, only the journal is replayed.
SUSE Linux Enterprise Server offers a variety of file systems from which to choose. This section contains an overview of how these file systems work and which advantages they offer.
It is very important to remember that no file system best suits all kinds of applications. Each file system has its particular strengths and weaknesses, which must be taken into account. In addition, even the most sophisticated file system cannot replace a reasonable backup strategy.
The terms data integrity and data consistency, when used in this section, do not refer to the consistency of the user space data (the data your application writes to its files). Whether this data is consistent must be controlled by the application itself.
![]() | |
Unless stated otherwise in this section, all the steps required to set up or change partitions and file systems can be performed by using YaST. | |
BrtFS (Better File System) is a copy-on-write (COW) file system developed by Chris Mason. It is based on COW-friendly B-trees developed by Ohad Rodeh. BtrFS is a logging-style file system. Instead of journaling the block changes, it writes them in a new location, then links the change in. Until the last write, the new changes are not committed.
![]() | |
Because BtrFS is capable of storing snapshots of the file system, it is advisable to reserve more disk space (double) than the standard storage proposal. | |
BtrFS provides fault tolerance, repair, and easy management features, such as the following:
Writable snapshots that allow you to easily roll back your system if needed after applying updates, or to back up files.
Multiple device support that allows you to grow or shrink the file system.
Compression to efficiently use storage space.
Different RAID levels for metadata and user data.
Different checksums for metadata and user data to improve error detection.
Integration with Linux Logical Volume Manager (LVM) storage objects.
Integration with the YaST Partitioner on SUSE Linux.
BtrFS creates a default subvolume in its assigned pool of space. It allows you to create additional subvolumes that act as individual file systems within the same pool of space. The number of subvolumes is limited only by the space allocated to the pool.
If BtrFS is used for the root (/) file system, you can cover any subdirectory as a subvolume as you might normally do. You should also consider covering the following subdirectories in separate subvolumes because they contain files that you might prefer not to snapshot for the reasons given:
The origins of Ext2 go back to the early days of Linux history. Its predecessor, the Extended File System, was implemented in April 1992 and integrated in Linux 0.96c. The Extended File System underwent a number of modifications and, as Ext2, became the most popular Linux file system for years. With the creation of journaling file systems and their short recovery times, Ext2 became less important.
A brief summary of Ext2’s strengths might help understand why it was—and in some areas still is—the favorite Linux file system of many Linux users.
Being quite an “old-timer,” Ext2 underwent many improvements and was heavily tested. This might be the reason why people often refer to it as rock-solid. After a system outage when the file system could not be cleanly unmounted, e2fsck starts to analyze the file system data. Metadata is brought into a consistent state and pending files or data blocks are written to a designated directory (called lost+found). In contrast to journaling file systems, e2fsck analyzes the entire file system and not just the recently modified bits of metadata. This takes significantly longer than checking the log data of a journaling file system. Depending on file system size, this procedure can take half an hour or more. Therefore, it is not desirable to choose Ext2 for any server that needs high availability. However, because Ext2 does not maintain a journal and uses significantly less memory, it is sometimes faster than other file systems.
Because Ext3 is based on the Ext2 code and shares its on-disk format as well as its metadata format, upgrades from Ext2 to Ext3 are very easy.
Ext3 was designed by Stephen Tweedie. Unlike all other next-generation file systems, Ext3 does not follow a completely new design principle. It is based on Ext2. These two file systems are very closely related to each other. An Ext3 file system can be easily built on top of an Ext2 file system. The most important difference between Ext2 and Ext3 is that Ext3 supports journaling. In summary, Ext3 has three major advantages to offer:
The code for Ext2 is the strong foundation on which Ext3 could become a highly-acclaimed next-generation file system. Its reliability and solidity are elegantly combined in Ext3 with the advantages of a journaling file system. Unlike transitions to other journaling file systems, such as ReiserFS or XFS, which can be quite tedious (making backups of the entire file system and recreating it from scratch), a transition to Ext3 is a matter of minutes. It is also very safe, because re-creating an entire file system from scratch might not work flawlessly. Considering the number of existing Ext2 systems that await an upgrade to a journaling file system, you can easily see why Ext3 might be of some importance to many system administrators. Downgrading from Ext3 to Ext2 is as easy as the upgrade. Just perform a clean unmount of the Ext3 file system and remount it as an Ext2 file system.
Some other journaling file systems follow the “metadata-only” journaling approach. This means your metadata is always kept in a consistent state, but this cannot be automatically guaranteed for the file system data itself. Ext3 is designed to take care of both metadata and data. The degree of “care” can be customized. Enabling Ext3 in the data=journal mode offers maximum security (data integrity), but can slow down the system because both metadata and data are journaled. A relatively new approach is to use the data=ordered mode, which ensures both data and metadata integrity, but uses journaling only for metadata. The file system driver collects all data blocks that correspond to one metadata update. These data blocks are written to disk before the metadata is updated. As a result, consistency is achieved for metadata and data without sacrificing performance. A third option to use is data=writeback, which allows data to be written into the main file system after its metadata has been committed to the journal. This option is often considered the best in performance. It can, however, allow old data to reappear in files after crash and recovery while internal file system integrity is maintained. Ext3 uses the data=ordered option as the default.
To convert an Ext2 file system to Ext3:
Create an Ext3 journal by running tune2fs -j as the root user.
This creates an Ext3 journal with the default parameters.
To specify how large the journal should be and on which device it should reside, run tune2fs -J instead together with the desired journal options size= and device=. More information about the tune2fs program is available in the tune2fs man page.
Edit the file /etc/fstab as the root user to change the file system type specified for the corresponding partition from ext2 to ext3, then save the changes.
This ensures that the Ext3 file system is recognized as such. The change takes effect after the next reboot.
To boot a root file system that is set up as an Ext3 partition, include the modules ext3 and jbd in the initrd.
Edit /etc/sysconfig/kernel as root, adding ext3 and jbd to the INITRD_MODULES variable, then save the changes.
Run the mkinitrd command.
This builds a new initrd and prepares it for use.
Reboot the system.
Officially one of the key features of the 2.4 kernel release, ReiserFS has been available as a kernel patch for 2.2.x SUSE kernels since version 6.4. ReiserFS was designed by Hans Reiser and the Namesys development team. It has proven itself to be a powerful alternative to Ext2. Its key assets are better disk space utilization, better disk access performance, faster crash recovery, and reliability through data journaling.
![]() | |
The ReiserFS file system is fully supported for the lifetime of SUSE Linux Enterprise Server 11 specifically for migration purposes. SUSE plans to remove support for creating new ReiserFS file systems starting with SUSE Linux Enterprise Server 12. | |
In ReiserFS, all data is organized in a structure called a B*-balanced tree. The tree structure contributes to better disk space utilization because small files can be stored directly in the B* tree leaf nodes instead of being stored elsewhere and just maintaining a pointer to the actual disk location. In addition to that, storage is not allocated in chunks of 1 or 4 KB, but in portions of the exact size needed. Another benefit lies in the dynamic allocation of inodes. This keeps the file system more flexible than traditional file systems, like Ext2, where the inode density must be specified at file system creation time.
For small files, file data and “stat_data” (inode) information are often stored next to each other. They can be read with a single disk I/O operation, meaning that only one access to disk is required to retrieve all the information needed.
Using a journal to keep track of recent metadata changes makes a file system check a matter of seconds, even for huge file systems.
ReiserFS also supports data journaling and ordered data modes similar to the concepts outlined in Section 1.2.3, “Ext3”. The default mode is data=ordered, which ensures both data and metadata integrity, but uses journaling only for metadata.
Originally intended as the file system for their IRIX OS, SGI started XFS development in the early 1990s. The idea behind XFS was to create a high-performance 64-bit journaling file system to meet extreme computing challenges. XFS is very good at manipulating large files and performs well on high-end hardware. However, even XFS has a drawback. Like ReiserFS, XFS takes great care of metadata integrity, but less care of data integrity.
A quick review of XFS’s key features explains why it might prove to be a strong competitor for other journaling file systems in high-end computing.
At the creation time of an XFS file system, the block device underlying the file system is divided into eight or more linear regions of equal size. Those are referred to as allocation groups. Each allocation group manages its own inodes and free disk space. Practically, allocation groups can be seen as file systems in a file system. Because allocation groups are rather independent of each other, more than one of them can be addressed by the kernel simultaneously. This feature is the key to XFS’s great scalability. Naturally, the concept of independent allocation groups suits the needs of multiprocessor systems.
Free space and inodes are handled by B+ trees inside the allocation groups. The use of B+ trees greatly contributes to XFS’s performance and scalability. XFS uses delayed allocation, which handles allocation by breaking the process into two pieces. A pending transaction is stored in RAM and the appropriate amount of space is reserved. XFS still does not decide where exactly (in file system blocks) the data should be stored. This decision is delayed until the last possible moment. Some short-lived temporary data might never make its way to disk, because it is obsolete by the time XFS decides where actually to save it. In this way, XFS increases write performance and reduces file system fragmentation. Because delayed allocation results in less frequent write events than in other file systems, it is likely that data loss after a crash during a write is more severe.
Before writing the data to the file system, XFS reserves (preallocates) the free space needed for a file. Thus, file system fragmentation is greatly reduced. Performance is increased because the contents of a file are not distributed all over the file system.
For a side-by-side feature comparison of the major operating systems in SUSE Linux Enterprise Server, see File System Support and Sizes on the SUSE Linux Enterprise Server Technical Information Web site.
Table 1.1, “File System Types in Linux” summarizes some other file systems supported by Linux. They are supported mainly to ensure compatibility and interchange of data with different kinds of media or foreign operating systems.
Table 1.1. File System Types in Linux¶
Originally, Linux supported a maximum file size of 2 GB (231 bytes). Currently all of our standard file systems have LFS (large file support), which gives a maximum file size of 263 bytes in theory. The numbers given in the following table assume that the file systems are using 4 KiB block size. When using different block sizes, the results are different, but 4 KiB reflects the most common standard.
![]() | |
In this document: 1024 Bytes = 1 KiB; 1024 KiB = 1 MiB; 1024 MiB = 1 GiB; 1024 GiB = 1 TiB; 1024 TiB = 1 PiB; 1024 PiB = 1 EiB (see also NIST: Prefixes for Binary Multiples. | |
Table 1.2, “Maximum Sizes of Files and File Systems (On-Disk Format)” offers an overview of the current limitations of Linux files and file systems.
Table 1.2. Maximum Sizes of Files and File Systems (On-Disk Format)¶
![]() | |
Table 1.2, “Maximum Sizes of Files and File Systems (On-Disk Format)” describes the limitations regarding the on-disk format. The Linux kernel imposes its own limits on the size of files and file systems handled by it. These are as follows: | |
You can use the YaST2 Partitioner to create and manage file systems and RAID devices. For information, see “Advanced Disk Setup” in the SUSE Linux Enterprise Server 11 SP2 Deployment Guide.
Each of the file system projects described above maintains its own home page on which to find mailing list information, further documentation, and FAQs:
A comprehensive multipart tutorial about Linux file systems can be found at IBM developerWorks in the Advanced File System Implementor’s Guide.
An in-depth comparison of file systems (not only Linux file systems) is available from the Wikipedia project in Comparison of File Systems.