NTFS Directories and Files
Yes, NTFS volumes have directories and files. Isn’t that good to know? :^) Well, you probably want to learn a bit more about them than that, I am sure, and in this part of the NTFS guide I will endeavor to do just that. If you are experienced with the FAT file system used in other versions of Windows, then as a user of NTFS partitions, you will find much that is familiar in the way directories and files are used. However, internally, NTFS stores and manages directories and files in a rather different way than FAT does.
In this section I will explore the fundamentals of NTFS directories and files. I will begin with a look at directories and how they are stored on NTFS volumes. I will then discuss user data files in some detail, including a look at how files are stored and named, and what their maximum size can be. I will then describe some of the more common standard attributes associated with files. Finally, I will discuss reparse points, a special enhanced feature present in NTFS 5.0 under Windows 2000.
NTFS Directories (Folders)
From an external, structural perspective, NTFS generally employs the same methods for organizing files and directories as the FAT file system (and most other modern file systems as well). This is usually called the hierarchical or directory tree model. The “base” of the directory structure is the root directory, which is actually one of the key system metadata files on an NTFS volume. Within this root directory, references are stored to files, or to other directories. Each directory can in turn store any combination of files or more sub-directories, allowing you to create an arbitrary tree structure. I describe these general concepts in more detail on this page discussing the FAT file system.
Note: Directories are also often called folders.
While NTFS is similar to FAT in its hierarchical structuring of directories, it is very different in how they are managed internally. One of the key differences is that in FAT volumes, directories are responsible for storing most of the key information about files; the files themselves contain only data. In NTFS, files are collections of attributes, so they contain their own descriptive information, as well as their own data. An NTFS directory pretty much stores only information about the directory itself, not about the files within the directory.
Everything within NTFS is considered a file, and that applies to directories as well. Each directory has an entry in the Master File Table, which serves as the main repository of information for the directory. The MFT record for the directory contains the following information and NTFS attributes:
- Header (H): This is a set of low-level management data used by NTFS to manage the directory. It includes sequence numbers used internally by NTFS and pointers to the directory’s attributes and free space within the record. (Note that the header is part of the MFT record but not an attribute.)
- Standard Information Attribute (SI): This attribute contains “standard” information stored for all files and directories. This includes fundamental properties such as date/time-stamps for when the directory was created, modified and accessed. It also contains the “standard” attributes usually associated with a file (such as whether the file is read-only, hidden, and so on.)
- File Name Attribute (FN): This attribute stores the name associated with the directory. Note that a directory can have multiple file name attributes, to allow the storage of the “regular” name of the file, along with an MS-DOS short filename alias and also POSIX-like hard links from multiple directories. See here for more on NTFS file naming.
- Index Root Attribute: This attribute contains the actual index of files contained within the directory, or part of the index if it is large. If the directory is small, the entire index will fit within this attribute in the MFT; if it is too large, some of the information is here and the rest is stored in external index buffer attributes, as described below.
- Index Allocation Attribute: If a directory index is too large to fit in the index root attribute, the MFT record for the directory will contain an index allocation attribute, which contains pointers to index buffer entries containing the rest of the directory’s index information.
- Security Descriptor (SD) Attribute: This attribute contains security information that controls access to the directory and its contents. The directory’s Access Control Lists (ACLs) and related data are stored here.
So in a nutshell, small directories are stored entirely within their MFT entries, just like small files are. Larger ones have their information broken into multiple data records that are referenced from the root entry for the directory in the MFT. NTFS uses a special way of storing these index entries however, compared to traditional PC file systems. The FAT file system uses a simple linked-list arrangement for storing large directories: the first few files are listed in the first cluster of the directory, and then the next files go into the next cluster, which is linked to the first, and so on. This is simple to implement, but means that every time you look at the directory you must scan it from start to end and then sort it for presentation to the user. It also makes it time-consuming to locate individual files in the index, especially with very large directories.
To improve performance, NTFS directories use a special data management structure called a B-tree. This is a concept taken from relational database design. In brief terms, a B-tree is a balanced storage structure that takes the form of trees, where data is balanced between branches of the tree. It’s kind of hard to explain what B-trees are without getting far afield, so if you want to learn more about them, try this page. (Note that the “B-tree” concept here refers to a tree of storage units that hold the contents of an individual directory; it is a different concept entirely from that of the “directory tree”, a logical tree of directories themselves.)
From a practical standpoint, the use of B-trees means that the directories are essentially “self-sorting”. There is a bit more overhead involved when adding files to an NTFS directory, because they must be placed in this special structure. However, the payoff occurs when the directories are used. The time required to find a particular file under NTFS is dramatically reduced compared to an unsorted linked-list structure–especially for very large directories.
NTFS Files and Data Storage
As with most file systems, the fundamental unit of storage in NTFS from the user’s perspective is the file. A file is just a collection of any sort of data, and can contain anything: programs, text files, audio clips, database records–and thousands of other kinds of information. The operating system doesn’t distinguish between types of files. The use of a particular file depends on how it is interpreted by applications that use it.
Within NTFS, all files are stored in pretty much the same way: as a collection of attributes. This includes the data in the file itself, which is just another attribute: the “data attribute”, technically. Note that to understand how NTFS stores files, one must first understand the basics of NTFS architecture, and in particular, it’s good to comprehend what the Master File Table is and how it works. You may also wish to review the discussion of NTFS attributes, because understanding the difference between resident and non-resident attributes is important to making any sense at all of the rest of this page. ;^)
The way that data is stored in files in NTFS depends on the size of the file. The core structure of each file is based on the following information and attributes that are stored for each file:
- Header (H): The header in the MFT is a set of low-level management data used by NTFS to manage the directory. It includes sequence numbers used internally by NTFS and pointers to the file’s other attributes and free space within the record. (Note that the header is part of the MFT record but not an attribute.)
- Standard Information Attribute (SI): This attribute contains “standard” information stored for all files and directories. This includes fundamental properties such as date/time-stamps for when the file was created, modified and accessed. It also contains the “standard” FAT-like attributes usually associated with a file (such as whether the file is read-only, hidden, and so on.)
- File Name Attribute (FN): This attribute stores the name associated with the file. Note that a file can have multiple file name attributes, to allow the storage of the “regular” name of the file, along with an MS-DOS short filename alias and also POSIX-like hard links from multiple directories. See here for more on NTFS file naming.
- Data (Data) Attribute: This attribute stores the actual contents of the file.
- Security Descriptor (SD) Attribute: This attribute contains security information that controls access to the file. The file’s Access Control Lists (ACLs) and related data are stored here.
These are the basic attributes; others may also be associated with a file (see this full discussion of attributes for details). If a file is small enough that all of its attributes can fit within the MFT record for the file, it is stored entirely within the MFT. Whether this happens or not depends largely on the size of the MFT records used on the volume. If the file is too large for all of the attributes to fit in the MFT, NTFS begins a series of “expansions” that move attributes out of the MFT and and make them non-resident. The sequence of steps taken is something like this:
- First, NTFS will attempt to store the entire file in the MFT entry, if possible. This will generally happen only for rather small files.
- If the file is too large to fit in the MFT record, the data attribute is made non-resident. The entry for the data attribute in the MFT contains pointers to data runs (also called extents) which are blocks of data stored in contiguous sections of the volume, outside the MFT.
- The file may become so large that there isn’t even room in the MFT record for the list of pointers in the data attribute. If this happens, the list of data attribute pointers is itself made non-resident. Such a file will have no data attribute in its main MFT record; instead, a pointer is placed in the main MFT record to a second MFT record that contains the data attribute’s list of pointers to data runs.
- NTFS will continue to extend this flexible structure if very large files are created. It can create multiple non-resident MFT records if needed to store a great number of pointers to different data runs. Obviously, the larger the file, the more complex the file storage structure becomes.
The data runs (extents) are where most file data in an NTFS volume is stored. These runs consist of blocks of contiguous clusters on the disk. The pointers in the data attribute(s) for the file contain a reference to the start of the run, and also the number of clusters in the run. The start of each run is identified using a virtual cluster number or VCN. The use of a “pointer+length” scheme means that under NTFS, it is not necessary to read each cluster of the file in order to determine where the next one in the file is located. This method also reduces fragmentation of files compared to the FAT setup.
NTFS File Size
One of the most important limiting issues for using serious business applications–especially databases–under consumer Windows operating systems and the FAT file system, is the relatively small maximum file size. In some situations the maximum file size is 4 GiB, and for others it is 2 GiB. While this seems at first glance to be fairly large, in fact, neither is even close to being adequate for the needs of today’s business environment computing. Even on my own home PC I occasionally run up against this limit when doing backups to hard disk files.
In the page describing how data is stored in NTFS files, I explained the way that NTFS first attempts to store files entirely within the MFT record for the file. If the file is too big, it extends the file’s data using structures such as external attributes and data runs. This flexible system allows files to be extended in size virtually indefinitely. In fact, under NTFS, there is no maximum file size. A single file can be made to take up the entire contents of a volume (less the space used for the MFT itself and other internal structures and overhead.)
NTFS also includes some features that can be used to more efficiently store very large files. One is file-based compression, which can be used to let large files take up significantly less space. Another is support for sparse files, which is especially well-suited for certain applications that use large files that have non-zero data in only a few locations.
NTFS File Naming
Microsoft’s early operating systems were very inflexible when it came to naming files. The DOS convention of eight characters for the file name and three characters for the file extension–the so-called “8.3 standard”–was very restrictive. Compared to the naming abilities of competitors such as UNIX and the Apple Macintosh, 8.3 naming was simply unacceptable. To solve this problem, when NTFS was created, Microsoft gave it greatly expanded the file naming capabilities.
The following are the characteristics of regular file names (and directory names as well) in the NTFS file system:
- Length: Regular file names can be up to 255 characters in NTFS.
- Case: Mixed case is allowed in NTFS file names, and NTFS will preserve the mixed case, but references to file names are case-insensitive. An example will make this much more clear. :^) Suppose you name a file “4Q Results.doc” on an NTFS volume. When you list the directory containing this file, you will see “4Q Results.doc”. However, you can refer to that file by both the name you gave, as well as “4q results.doc”, “4q ReSulTS.dOc”, and so on.
- Characters: Names can contain any characters, including spaces, except the following (which are reserved because they are generally used as file name or operating system delimiters or operators): ? ” / \ < > * | :
- Unicode Storage: All NTFS file names are stored in a format called Unicode. Recall that conventional storage for characters in computers is based on the ASCII character set, which uses one byte (actually, 7 bits) to represent the hundred or so “regular” characters used in Western languages. However, a single byte can only hold a couple of hundred different values, which is insufficient for the needs of many languages, especially Asian ones. Unicode is an international, 16-bit character representation format that allow for thousands of different characters to be stored. Unicode is supported throughout NTFS.
Tip: For more information about Unicode, see this web site.
You may recall that when Windows 95’s VFAT file system introduced long file names to Microsoft’s consumer operating systems, it provided for an aliasing feature. The file system automatically creates a short file name (“8.3”) alias of all long file names, for use by older software written before long file names were introduced. NTFS does something very similar. It also creates a short file name alias for all long file names, for compatibility with older software. (If the file name given to the file or directory is short enough to fit within the “8.3”, no alias is created, since it is not needed). It’s important to realize, however, that the similarities between VFAT and NTFS long file names are mostly superficial. Unlike the VFAT file system’s implementation of long file names, NTFS’s implementation is not a kludge added after the fact. NTFS was designed from the ground up to allow for long file names.
File names are stored in the file name attribute for every file (or directory), in the Master File Table. (No big surprise there!) In fact, NTFS supports the existence of multiple file name attributes within each file’s MFT record. One of these is used for the regular name of the file, and if a short MS-DOS alias file name is created, it goes in a second file name attribute. Further, NTFS supports the creation of hard links as part of its POSIX compliance. Hard links represent multiple names for a single file, in different directories. These links are each stored in separate file name attributes. (This is a limited implementation of the very flexible naming system used in UNIX file systems.)
NTFS File Attributes
As I mention in many places in this discussion of NTFS, almost everything in NTFS is a file, and files are implemented as collections of attributes. Attributes are just chunks of information of various sorts–the meaning of the information in an attribute depends on how software interprets and uses the bits it contains. Directories are stored in the same general way as files; they just have different attributes that are used in a different manner by the file system.
All file (and directory) attributes are stored in one of two different ways, depending on the characteristics of the attribute–especially, its size. The following are the methods that NTFS will use to store attributes:
- Resident Attributes: Attributes that require a relatively small amount of storage space are stored directly within the file’s primary MFT record itself. These are called resident attributes. Many of the simplest and most common file attributes are stored resident in the MFT file. In fact, some are required by NTFS to be resident in the MFT record for proper operation. For example, the name of the file, and its creation, modification and access date/time-stamps are resident for every file.
- Non-Resident Attributes: If an attribute requires more space than is available within the MFT record, it is not stored in that record, obviously. Instead, the attribute is placed in a separate location. A pointer is placed within the MFT that leads to the location of the attribute. This is called non-resident attribute storage.
In practice, only the smallest attributes can fit into MFT records, since the records are rather small. Many other attributes will be stored non-resident, especially the data of the file, which is also an attribute. Non-resident storage can itself take two forms. If the attribute doesn’t fit in the MFT but pointers to the data do fit, then the data is placed in a data run, also called an extent, outside the MFT, and a pointer to the run is placed in the file’s MFT record. In fact, an attribute can be stored in many different runs, each with a separate pointer. If the file has so many extents that even the pointers to them won’t fit, the entire data attribute may be moved to an external attribute in a separate MFT record entry, or even multiple external attributes. See the discussion of file storage for more details on this expansion mechanism.
NTFS comes with a number of predefined attributes, sometimes called system defined attributes. Some are associated with only one type of structure, while others are associated with more than one. Here’s a list, in alphabetical order, of the most common NTFS system defined attributes:
- Attribute List: This is a “meta-attribute”: an attribute that describes other attributes. If it is necessary for an attribute to be made non-resident, this attribute is placed in the original MFT record to act as a pointer to the non-resident attribute.
- Bitmap: Contains the cluster allocation bitmap. Used by the $Bitmap metadata file.
- Data: Contains file data. By default, all the data in a file is stored in a single data attribute–even if that attribute is broken into many pieces due to size, it is still one attribute–but there can be multiple data attributes for special applications.
- Extended Attribute (EA) and Extended Attribute Information: These are special attributes that are implemented for compatibility with OS/2 use of NTFS partitions. They are not used by Windows NT/2000 to my knowledge.
- File Name (FN): This attribute stores a name associated with a file or directory. Note that a file or directory can have multiple file name attributes, to allow the storage of the “regular” name of the file, along with an MS-DOS short filename alias and also POSIX-like hard links from multiple directories. See here for more on NTFS file naming.
- Index Root Attribute: This attribute contains the actual index of files contained within a directory, or part of the index if it is large. If the directory is small, the entire index will fit within this attribute in the MFT; if it is too large, some of the information is here and the rest is stored in external index buffer attributes.
- Index Allocation Attribute: If a directory index is too large to fit in the index root attribute, the MFT record for the directory will contain an index allocation attribute, which contains pointers to index buffer entries containing the rest of the directory’s index information.
- Security Descriptor (SD): This attribute contains security information that controls access to a file or directory. Access Control Lists (ACLs) and related data are stored in this attribute. File ownership and auditing information is also stored here.
- Standard Information (SI): Contains “standard information” for all files and directories. This includes fundamental properties such as date/time-stamps for when the file was created, modified and accessed. It also contains the “standard” FAT-like attributes usually associated with a file (such as whether the file is read-only, hidden, and so on.)
- Volume Name, Volume Information, and Volume Version: These three attributes store key name, version and other information about the NTFS volume. Used by the $Volume metadata file.
Note: For more detail on how the attributes associated with files work, see the page on file storage; for directories, the page on directories.
In addition to these system defined attributes, NTFS also supports the creation of “user-defined” attributes. This name is a bit misleading, however, since the term “user” is really given from Microsoft’s perspective! A “user” in this context means an application developer–programs can create their own file attributes, but actual NTFS users generally cannot.
NTFS Reparse Points
One of the most interesting new capabilities added to NTFS version 5 with the release of Windows 2000 was the ability to create special file system functions and associate them with files or directories. This enables the functionality of the NTFS file system to be enhanced and extended dynamically. The feature is implemented using objects that are called reparse points.
The use of reparse points begins with applications. An application that wants to use the feature stores data specific to the application–which can be any sort of data at all–into a reparse point. The reparse point is tagged with an identifier specific to the application and stored with the file or directory. A special application-specific filter (a driver of sorts) is also associated with the reparse point tag type and made known to the file system. More than one application can store a reparse point with the same file or directory, each using a different tag. Microsoft themselves reserved several different tags for their own use.
Now, let’s suppose that the user decides to access a file that has been tagged with a reparse point. When the file system goes to open the file, it notices the reparse point associated with the file. It then “reparses” the original request for the file, by finding the appropriate filter associated with the application that stored the reparse point, and passing the reparse point data to that filter. The filter can then use the data in the reparse point to do whatever is appropriate based on the reparse point functionality intended by the application. It is a very flexible system; how exactly the reparse point works is left up to the application. The really nice thing about reparse points is that they operate transparently to the user. You simply access the reparse point and the instructions are carried out automatically. This creates seamless extensions to file system functionality.
In addition to allowing reparse points to implement many types of custom capabilities, Microsoft itself uses them to implement several features within Windows 2000 itself, including the following:
- Symbolic Links: Symbolic linking allows you to create a pointer from one area of the directory structure to the actual location of the file elsewhere in the structure. NTFS does not implement “true” symbolic file linking as exists within UNIX file systems, but the functionality can be simulated by using reparse points. In essence, a symbolic link is a reparse point that redirect access from one file to another file.
- Junction Points: A junction point is similar to a symbolic link, but instead of redirecting access from one file to another, it redirects access from one directory to another.
- Volume Mount Points: A volume mount point is like a symbolic link or junction point, but taken to the next level: it is used to create dynamic access to entire disk volumes. For example, you can create volume mount points for removable hard disks or other storage media, or even use this feature to allow several different partitions (C:, D:, E: and so on) to appear to the user as if they were all in one logical volume. Windows 2000 can use this capability to break the traditional limit of 26 drive letters–using volume mount points, you can access volumes without the need for a drive letter for the volume. This is useful for large CD-ROM servers that would otherwise require a separate letter for each disk (and would also require the user to keep track of all these drive letters!)
- Remote Storage Server (RSS): This feature of Windows 2000 uses a set of rules to determine when to move infrequently used files on an NTFS volume to archive storage (such as CD-RW or tape). When it moves a file to “offline” or “near offline” storage in this manner, RSS leaves behind reparse points that contain the instructions necessary to access the archived files, if they are needed in the future.
These are just a few examples of how reparse points can be used. As you can see, the functionality is very flexible. Reparse points are a nice addition to NTFS: they allow the capabilities of the file system to be enhanced without requiring any changes to the file system itself.