A directory is a special type of file that contains a list of mappings between filenames (or subdirectory names) and inode numbers. Every file and directory in ext4 has an inode. A directory inode has the mode bits set to indicate it is a directory (S_IFDIR). Instead of storing user data (such as text in a .txt file), the data blocks referenced by a directory's inode contain directory entries (struct ext4_dir_entry_2).
In modern ext4 filesystems (where the "filetype" feature is enabled in the superblock, as it usually is), each directory entry follows the struct ext4_dir_entry2 layout. This structure can reach a maximum size of 263 bytes — that's 8 bytes of fixed header fields plus up to 255 bytes for the longest possible filename. However, entries aren't fixed-length on disk; their actual size varies and is always padded to a 4-byte boundary. To determine the precise length of any given entry and locate the next one in the chain, you have to read the rec_len field (a 16-bit little-endian value) from the entry itself. The official kernel documentation includes a clear table outlining this chained structure, highlighting the inode number, record length, name length, file type indicator, and the variable-length name that follows.
Offset | Size | Name | Description |
0x00 | 4 bytes | inode | The inode number this directory entry points to. |
0x04 | 2 bytes | rec_len | Length of this directory entry. This field may be interpreted as a pointer to the next valid directory entry. |
0x06 | 1 byte | name_len | Length of the file name. |
0x7 | 1 byte | file_type | File type code. In classic Unix file systems, the file name length field above was two bytes. But since Linux doesn't allow file names longer than 255 characters, the EXT developers decided to use only one byte for the file name length and to use the second byte to hold the file type. While the file type is also stored in the inode, having this information in the directory entry is more efficient for operations like "ls -F" where the file type is displayed along with the file name. Valid values include:
|
0x8 | Var | name[EXT4_NAME_LEN] | File name. |
Let us consider the hex dump of a small directory as shown below.
We see in bytes 0x00-0x03 that the inode corresponding to the first entry is tthe inode number 2,883,586 (0x2C0002), and bytes 0x04-0x05 show that the directory entry is 12 bytes (0x0c). Byte 6 shows that the name is 1 byte long, and byte 7 shows that the entry is for a directory (0x02). The name is given in byte 0x08-0x09, and we see that it is ‘.’. This corresponds to the directory entry for the current directory. To find the second entry, we add the length of the first entry to its start, which means that the second entry will start in byte 0x0C. We see in bytes 0x10-0x11 that the length of this entry is also 12 bytes, and it is for the ‘..’ directory. To find the third entry, we add the length of the second entry to its start and get byte 0x18. We see in bytes 0x1C-0x1D that the entry length is 20 bytes (0x14). Byte 0x1E shows the name length is 10 (0x0A). The name starts at byte 0x20 and extends until byte 0x29, and it contains the string readme.txt.
In the original ext2 filesystem (without any extensions), directories are stored as special files containing a sequence of variable-length directory entry structures. These entries are chained together linearly (effectively a singly linked list within and across blocks, using the rec_len field to point to the next entry). To perform operations like looking up a file (open), creating a new entry, or deleting one, the kernel must perform a linear scan through the directory blocks until it finds a matching name or an appropriate spot. This results in O(n) time per operation, where n is the number of entries in the directory. When many such operations (e.g., creating or deleting all files in a large directory) are performed, the total cost scales roughly quadratically (O(n²)) because each successive operation has to scan an increasingly larger number of entries on average.
The HTree directory indexing extension (introduced around 2001–2002 and enabled via the dir_index feature) was specifically designed to fix this scalability problem. It uses a hashed B-tree-like structure (constant depth, usually 1–2 levels) with filename hashes as keys, turning lookups into near-constant or logarithmic time in practice—dramatically faster for large directories (often 50–100× improvement). The leaf nodes remain standard ext2 directory blocks containing the actual entries referenced by the root or indirectly through intermediate h-tree index blocks. References within the directory file are by means of logical block offsets within the file. Backward compatibility is preserved: old kernels (pre-HTree support) see index (DX) blocks as "fake deleted" entries (special magic in the first entry hides the index), so they fall back to scanning all leaf blocks linearly. The directory remains fully readable and usable (though it's slower for large ones). Also, it is forward-compatible: new kernels can handle both plain ext2 directories and indexed ones seamlessly. This extension was first developed for ext2, later became standard in ext3 (often enabled by default), and continues in ext4. In ext4 (and modern ext3), HTree is enabled by default for directories once they grow beyond a single block (~4 KB.
HTree Directory Indexing
When the EXT4_INDEX_FL (0x1000) flag is set in the inode (default in modern ext3/ext4), directories use a hashed B-tree (HTree) to organize and find directory entries. Without it, directories fall back to linear (unsorted) linked lists of entries. The HTree is a constant-depth (usually 1–2 levels, max 3 in large ext4 directories) tree that uses hashed filename values as sort keys. The tree is a uniform-depth hashed B-tree variant (not a fully balanced B-tree with variable fanout splits like classic B-trees; splits happen when a leaf overflows, and the tree grows depth only when needed). It provides near-constant-time or logarithmic lookup performance for large directories, compared to the old linear scan in plain ext2 directories.The search for a file name in the H-tree begins with a binary search of the leaf in which the directory entry for the name is found. Directory entries within the leaf are not ordered, the leaf should be searched linearly. Filenames are hashed (using a 32-bit hash function like half-MD4 or TEA, with a per-filesystem hash seed for collision resistance and security). These hash values serve as the sort/search keys in the tree (not the actual strings or alphabetical order). There are basically two types of blocks in an HTree-indexed directory: dx-block and de-block.
Directory Index Block (DX-block)
These are the internal/index nodes of the HTree. The root is the first block (block 0) of the directory (a special dx_root structure). Intermediate nodes (if depth > 1) are dx_node structures. Each contains an array of dx_entry items, where each entry stores a hash value (or hash prefix bounding a range) and block ID pairs. Hash value is the hash value of the entry name. Block-ID is the logical block number pointing to either a lower-level DX-block (another index node) or a leaf block (DE-block containing actual entries).
Directory Entries Block (DE-block)
These are the leaf nodes of the tree. They store standard ext* directory entries (struct ext4_dir_entry_2 or legacy ext2_dir_entry), including inode number, record length (rec_len), name length, file type, and the filename itself referenced by the root or indirectly through intermediate HTree index blocks. Within a DE-block, entries remain linearly linked via rec_len (requiring a short linear scan inside the leaf after hash-based navigation reaches it).
![]() |
| Figure 1: HTree directory index |
The root of the HTree always resides in the first data block of the directory file. This block is accessed via the directory inode's extent tree (or legacy block map) as logical block 0. By longstanding ext2 convention (preserved for compatibility in ext3/ext4), every directory's first block must begin with the special entries for "." (current directory) and ".." (parent directory). These are stored as regular struct ext4_dir_entry_2 records right at the start of the block. Because of this legacy requirement, the "." and ".." entries are not part of the hashed tree itself—they're placed explicitly in the first block and excluded from the indexing structure. (The HTree only indexes the actual named entries beyond these two.) After the "." and ".." entries, the rest of the first block is occupied by the HTree root node (struct dx_root). This includes:
- A small header (struct dx_root_info) with fields like hash_version, info_length, indirect_levels (tree depth), and unused_flags.
- Then an array of struct dx_entry items forming the hash → block map. Each entry pairs a 32-bit hash lower-bound key with a 32-bit logical block pointer (offset within the directory file) to either an interior index node or a leaf block containing real directory entries.
Offset | Size (bytes) | Name | Description |
0x00 | 8 | dot | Directory entry for '.'. |
0x08 | 4 | dot_name[4] | ".\0\0\0" |
0x0C | 8 | dotdot | Directory entry for '..'. |
0x14 | 4 | dotdot_name[4] | "..\0\0" |
0x18 | 4 | struct dx_root_info.reserved_zero | Zero to make the rest of this directory data block seem empty |
0x1C | 1 | struct dx_root_info.hash_version | Hash version, can be one of the following:
|
0x1D | 1 | struct dx_root_info.info_length | Length of the tree information, 0x8. |
0x1E | 1 | struct dx_root_info.indirect_levels | Depth of the htree. |
0x1F | 1 | struct dx_root_info.unused_flags | Flags (unused) |
0x20 |
| entries[0] | As many 8-byte struct dx_entry as fits in the rest of the data block. |
The indirect_levels field indicates the number of additional index levels beyond the root. In classic ext4 HTree implementations (before the largedir feature), this is typically 1 (meaning one level of interior nodes + leaf nodes = two levels total). If non-zero, the pointers from the root's dx_entry map lead to interior nodes (not directly to leaves). With the INCOMPAT_LARGEDIR feature (added later), it can go up to 2 (three levels total).
The root (in the directory's first block) contains dx_entry structures with a 32-bit hash (major hash) and block pointer. When indirect_levels > 0, those point to interior nodes (struct dx_node), and lookups in those interior nodes use the minor hash (the lower bits of the full computed hash) to perform binary search or probe the index entries. An interior node (dx_node) starts with a fake/zeroed ext4_dir_entry_2 (inode=0, to masquerade as an "empty" directory block for legacy compatibility), followed by an array of dx_entry structures. Each entry pairs a minor hash (lower bound for the hashes in the subtree) with a logical block pointer to a leaf or deeper node.
Offset | Size (bytes) | Name | Description |
0x00 | 8 | fake | Zeroed out to make this data block seem empty of directory entries. |
0x08 | entries[0] | As many 8-byte struct dx_entry as fits in the rest of the data block. |
Leaf blocks are standard ext4 directory blocks containing real ext4_dir_entry_2 structures (with inode, name_len, file_type, name, etc.), sorted by their full hash within the leaf. Entries in one leaf share a common hash prefix (they fall within the hash range bounded by the parent index entry), but due to the 31-bit effective hash space, they don't necessarily have identical hashes—only hashes that sort into the same leaf range. The actual hash uses 31 bits; the LSB in each dx_entry hash field is a collision/continuation flag. When set (LSB=1), it indicates possible hash collisions or boundary overflow—meaning the target entry might be in this leaf or spill over into the immediately following leaf block. During lookup, if the entry isn't found in the expected leaf and the flag is set on the boundary hash, the code checks the next leaf block to resolve collisions.
The hash maps that exist in both struct dx_root and struct dx_node are recorded as struct dx_entry, which is 8 bytes long. Each index entry is a compact 8-byte record pairing a 32-bit hash key with a 32-bit logical block pointer. The logical block pointer refers to an offset within the directory file itself (not a filesystem-wide block number), pointing either to a child index node or directly to a leaf block. The hash key represents the lower bound—the smallest hash value present in the referenced subtree or leaf—so entries are kept in strictly ascending hash order to enable fast binary-search-style traversal.
Offset | Size (bytes) | Name | Description |
0x00 | 4 | hash | Hash code |
0x04 | 4 | block | Block number (within the directory file, not filesystem blocks) of the next node in the htree. |
The kernel computes a 32-bit hash of the target filename (using the filesystem's configured hash algorithm, like half_md4, and a per-superblock seed for better distribution). The upper bits (major hash) are used to binary-search the root node's dx_entry array (in the directory's first block) to locate the appropriate index entry, whose block pointer leads to the next level. If the tree is "flat" (dx_root->info.indirect_levels == 0, meaning a one-level tree), that pointer directly references a leaf block containing a sorted linear array of ext4_dir_entry_2 structures (sorted by their full hash values). The code then scans this array linearly for an exact name match, as hashes aren't unique.
If not flat (e.g., indirect_levels == 1 for a two-level tree), the root points to an interior index node (the "second block"). The lower bits (minor hash) of the filename's hash are then used to binary-search that interior node's dx_entry array, yielding a pointer to the leaf block (the "third block"). Again, the leaf is linearly scanned for the match. Hash collision handling (via the LSB flag in hash keys) may require checking adjacent leaves if the flag is set and the entry isn't found.
For tools or code that don't support HTree (e.g., old ext2 utilities or when dir_index is disabled), the directory is read as a flat sequence of blocks via the inode's block map or extents. Index blocks (root and interiors) are designed to look like "empty" directory blocks: the root starts with real "." and ".." entries but follows with index data; interiors begin with a fake zero-inode ext4_dir_entry_2 (rec_len spanning the block) to mimic emptiness. Thus, linear readers skip them as uninteresting. Only the leaf blocks contain actual file/directory entries, so the full contents are still accessible, albeit with O(n) performance for large directories.
Let us examine a directory with an inode number of 527258. Below is the inode dump of the directory.
If we look at the first two bytes of this inode, we can see the value 0x41ED. The four most significant bits are 0100, which means it is a directory. The extent tree starts in inode byte offset 0x28, starting with a header. The header includes the 0xF30A magic for extents, it contains four extents, and this extent is a leaf node (depth is 0). Directly after the extent header, we find the four extents (underlined above). The first extent starts from logical block 0, has a length of 1 block, and the location of the block is 0x202036. The second extent starts from logical block 1, has a length of 2 blocks, and the location of the block is 0x202195. The third extent starts from logical block 3, has a length of 1 block, and the location of the block is 0x20219C. The fourth extent starts from logical block 4, has a length of 1 block, and the location of the block is 0x20219F.
Let us examine the dump of the initial few bytes of the first block (i.e., block 0x202036).
The first 36 bytes are used by dx_root. Bytes 0x00-0x0B represent the "." entry. Bytes 0x0C-0x17 represent the ".." entry. Bytes 0x18-0x1B (0x00000000) are reserved. Byte 0x1C (0x01) records the hashing algorithm used (see table below). Byte 0x1D (0x08) records the size of the dx_entry records. This value, as discussed earlier, is always 8. Byte 0x1E (0x00) records the depth of the tree. The value here is 0, indicating that it is a leaf level in the hash tree structure. Byte 0x1F (0x00) represents flags. It is unused and typically has the value of 0. Bytes 0x20-0x21 (0x01FB) record the maximum number of dx_entry records possible. Bytes 0x22-0x23 (0x0004) record the actual number of dx_entry records to follow after the initial fields of the dx_root structure.
Value | Hash Algorithm |
0x00 | Legacy |
0x01 | Half MD4 |
0x02 | Tea |
0x03 | Legacy, unsigned |
0x04 | Half MD4, unsigned |
0x05 | Tea, unsigned |
0x06 | Siphash |
Bytes 0x24-0x27 (0x00000001) represent the relative block number for the "zero hash". The remaining bytes contain the actual dx_entry records, which have been underlined in the figure above.
The 2-byte value stored in bytes 0x22–0x23 indicates the total count of dx_entry records that follow after the initial fields of the dx_root structure. This count includes the special "zero hash" entry (located in bytes 0x24-0x27) as one of the records. Each dx_entry consists of:
- a 4-byte hash value.
- followed by a 4-byte relative block offset, measured from the start of the directory file.
In a dx_entry, the first 4 bytes represent the hash value, and the next 4 bytes represent the logical block number. To illustrate this structure more clearly, the four dx_entry records are presented in the table below:
Hash Value | Block offset |
Zero hash | 0x01 |
0x3E4A67BC | 0x03 |
0x843E3C3C | 0x02 |
0xC436C702 | 0x04 |
The dx_entry records form a sorted lookup table, ordered by hash value. The first entry — the special "zero hash" record — indicates that any file whose name produces a hash value below 0x3E4A67BC is located in directory block number 1. The next entry shows that files with hash values from 0x3E4A67BC (inclusive) up to but not including 0x843E3C3C are stored in directory block 3, and the pattern continues similarly for subsequent entries. Because the table is kept in ascending hash order, the ext filesystem code can perform a binary search over the dx_entry records. This allows it to quickly determine which directory block contains the desired file.
When the hash tree has a depth of 2, two layers of intermediate nodes exist between the leaf nodes (which contain the actual directory entries) and the dx_root block. These intermediate nodes contain only dx_entry structures.
Collision
The hash of a filename is mainly used to identify the specific leaf block that covers the corresponding portion of the hash space. If multiple directory entries (dirents) end up with identical hash values due to a collision, they are kept together in that same leaf block whenever feasible. Lookups then involve reading the targeted block and scanning its entries linearly to locate the exact matching name.
When inserting a new entry causes an overflow in a full leaf block (particularly in cases involving a hash collision), the filesystem allocates a fresh block. It then splits the original block's contents, distributing the dirents between the old and new blocks according to their hash values. Importantly, the split is performed in a way that ensures all dirents sharing the exact same hash remain in one block or the other — they are never divided across the two blocks during a normal split. This design choice maintains efficient linear searches within each block, even after redistribution, by avoiding the fragmentation of colliding entries. Only in rare, extreme cases (where collisions are so numerous that a single block cannot hold them all) would identical-hash entries spill over into additional consecutive blocks, with the collision flag in the index helping guide lookups to check neighboring blocks as needed.





Post a Comment