Modern filesystems commonly employ journaling to safeguard data integrity. A journal acts as a write-ahead log: it records all pending changes to file-system metadata (and, depending on the mode, data) before those changes are applied to their final locations on disk. This design ensures that, after a sudden crash or power loss, the file system can be quickly returned to a consistent state simply by replaying the journal during the next mount. In ext4, the journal’s primary role is crash recovery. When the file system is found to be “out of sync” on boot, the kernel (via the JBD2 layer) replays any uncommitted transactions, preventing corruption that would otherwise require lengthy fsck scans.
Journaling itself is marked in the ext4 superblock as a compatible feature (EXT4_FEATURE_COMPAT_HAS_JOURNAL). Older kernels and tools can still mount the volume even if they do not fully understand the journal. If the INCOMPAT_JOURNAL_DEV is set, the journal can be located on another device described by its UUID, defined in the superblock. The main file-system superblock sets s_journal_inum to zero and stores the UUID of the external journal device in s_journal_uuid. The external device itself carries its own superblock with the incompatible journal_dev flag. This arrangement keeps the journal data entirely off the main volume, freeing space and allowing the journal to reside on faster storage such as an SSD or RAID array.
The internal journal is a regular (but hidden/system-reserved) file using inode number 8 in nearly all ext4 filesystems with journaling enabled. You can see this in tune2fs -l output under "Journal inode: 8". While inode 8 is the default and overwhelmingly common case, it is theoretically possible (via special tools or corruption) to have a journal inode placed elsewhere as defined in the superblock. The first 68 bytes of the journal inode are replicated in the ext4 superblock. The superblock (at offset 0xE0) includes fields like s_journal_inum, s_journal_dev, s_journal_uuid, and a backup array s_jnl_blocks[17]. These replicate the beginning of the journal inode's on-disk structure (the i_blocks array and related fields) for bootstrap/bootstrapping and recovery purposes when the actual inode might be inaccessible. The journal is not a special on-disk structure; it is just an ordinary inode + data blocks, marked as hidden/reserved (no directory entries point to it, and it's protected by the kernel). The journal is often large enough (default 128 MB on bigger filesystems, but adjustable) to fill or nearly fill one block group. mke2fs (and mkfs.ext4) places it roughly in the middle of the device when possible to reduce average seek time for journal I/O on rotational (HDD) disks (less relevant on SSDs).
The journaling process follows a strict commit protocol. All pending updates are first written to the journal as descriptor blocks, data/revocation blocks, and finally a commit block. Only after the commit block is safely on disk does ext4 consider the transaction complete and begin writing the changes to their permanent locations. Ext4’s journal is a fixed-size, circular log—whether stored internally (as the hidden inode 8) or externally. Once the journal fills, new transactions simply wrap around and overwrite the oldest committed entries. Because journaling operates at the block level, even a tiny change (such as updating a single inode) causes the entire 4 KB block containing that inode to be copied into the journal. This design simplifies recovery but also creates a rich forensic artifact.
For digital forensic analysts, the ext4 journal is a time machine. Because it retains copies of recently modified metadata blocks—often including inodes, directory entries, and extent trees—investigators can recover deleted files whose inode copies have not yet been overwritten, reconstruct earlier versions of files, and build precise timelines of system activity. Although the journal is relatively small (default sizes range from 128 MB to 1 GB) and older entries are eventually recycled, it frequently holds evidence that no longer exists anywhere else on the live file system. There are three journaling modes available in ext4:
- Writeback mode (
data=writeback): Only metadata is journaled. File data can be written to disk at any time — before, during, or after the metadata is committed to the journal — with no guarantees about ordering. This is indeed the least reliable for data integrity. In a crash:- Metadata remains consistent (filesystem structure is protected).
- But file contents can become corrupted or show stale/old data (e.g., a partially appended file might appear larger than it is on disk, with garbage/trash at the end, or old file versions could unexpectedly reappear after recovery). This mode offers the best performance (no extra data flushes), but it's riskier and not recommended for most use cases.
- Ordered mode (
data=ordered—the default in ext4): Only metadata is journaled, but the filesystem ensures that associated file data blocks are written to disk before the corresponding metadata is committed to the journal. This provides a logical ordering guarantee without journaling the data itself. In a crash:- If appending to a file, incomplete new data is typically purged/reverted (the file reverts to its old size/content, avoiding garbage at the end).
- For overwrites (replacing existing content), corruption is possible: the file can end up in a half-updated intermediate state (some old blocks, some new, mixed unpredictably—especially if disk hardware reorders writes). Neither the fully old nor the fully new version may be recoverable, as old data isn't preserved anywhere. This mode balances good performance (faster than full journaling) with reasonable protection against many common corruption scenarios—better than writeback, but not as safe as journal mode.
- Journal mode (
data=journal—sometimes called full/data journaling): Both metadata and file data are written to the journal first, then (after commit) to their final locations on the filesystem. In a crash: The journal replay ensures files end up with either the old complete version or the new complete version — no torn/intermediate states or garbage. This provides the strongest data integrity. Performance: Data is written twice (journal + final location), so it's usually the slowest mode (especially for write-heavy workloads). However, if the journal is on a fast device or has sufficient space, sequential journal writes can sometimes offer better throughput in specific scenarios.
Forensic examination of a system can be affected by journaling. In full data journaling mode (data=journal), both metadata and file data blocks are copied to the journal before being written to their final location. This creates a temporary "backup" of old versions or deleted/overwritten content in the journal, which forensic tools can carve or analyze for recovery—even after deletion or overwrite on the main filesystem. Many forensic papers and tools exploit this for recovering previous file versions or deleted data without scanning the entire disk. In the other two modes, file data is usually not in the journal (so content recovery is limited or impossible from there), but metadata copies (inodes, directory entries, timestamps like mtime/ctime/atime, file names via directory records) often persist in the journal. This provides evidentiary value—e.g., historical MAC times, evidence of deletion timestamps, or prior file existence—even if the actual file contents are gone or unrecoverable.
Journaled filesystems perform journal replay (replaying pending transactions to reach a consistent state) during mount if the filesystem appears "dirty" (e.g., from improper shutdown). This replay can involve writing changes to the main filesystem structures—even on a supposedly read-only (ro) mount in some cases. This is a well-known forensic pitfall, as it modifies the evidence. Forensic best practices require special mount options mount -o ro,noload to prevent this. Without them, mounting can trigger unwanted writes (replay/rollback of incomplete transactions), altering timestamps, superblocks, or other metadata.
If the journal isn't replayed (e.g., using noload to preserve evidence), the mounted view of the filesystem may show inconsistencies—missing recent changes, incomplete operations, or an "incorrect" state compared to what the live system would have after recovery. This can make files appear missing, truncated, or with outdated metadata until the journal is applied. In forensics, examiners often deliberately avoid replay to preserve the "as-found" state, accepting potential inconsistencies in exchange for no alteration.
Ext4 Journal Layout
All filesystem metadata (and sometimes data, depending on the mount mode like data=journal) updates are grouped into atomic transactions. Every journal block includes a transaction ID in its header (h_sequence field), which serves as the sequence number for that transaction. Journal blocks are one of the following types:
- Administrative (descriptor blocks, commit blocks, revocation blocks) — contain control information, tags, checksums, etc.
- File system update data (data blocks) — contain the actual copies of filesystem blocks (usually metadata) being updated/journaled.
There are five types of blocks that a journal could have, the first four are administrative and are known as: Superblock, Descriptor, Commit and Revoke blocks. The fifth type is the one that stores the metadata or data that is recorded in the journal depending on the journal operation mode. Each administrative block type holds information related to its type, but all four administrative blocks share the same format on the first 12 bytes. The fifth type can be either metadata blocks that holds copies of the inodes being modified in the file system, if the journal is using ordered/write back modes and content blocks if the journal is in journaled mode.
A transaction typically begins with a descriptor block (block type 1), which has: (a) the common journal header including h_sequence (transaction ID); (b) an array of block tags listing the target on-disk block numbers (final locations) for each subsequent data block, plus flags/checksums/etc. Immediately after the descriptor come the actual metadata/data blocks (verbatim copies of the filesystem blocks being updated, as listed in the descriptor tags). The transaction ends with a commit block (block type 2), which also carries the same h_sequence value. This commit block acts as the "atomic seal" — its presence (and valid checksum) means the entire transaction is durable and can be replayed during recovery. The journal is a continuous circular log: after one transaction's commit block, the next transaction's descriptor block (or revocation block) follows immediately. Multiple transactions are chained in sequence.
![]() |
| Figure 1: Ext4 Journal transaction overview |
The ext4 filesystem (via the JBD2 journaling layer) begins its journal with a superblock that stores essential metadata, including the current transaction sequence number, the starting block of the valid log tail, and pointers to locate active transactions in the circular journal log. Due to the journal's circular nature, new transactions wrap around and overwrite old checkpointed ones, so the first valid descriptor block may appear anywhere after the superblock rather than at the beginning. Each transaction starts with one or more descriptor blocks (containing tags that point to the on-disk locations of modified filesystem blocks), followed by the actual metadata and/or data blocks being journaled, and concludes with a commit block that includes checksums and a timestamp to confirm completion. If a descriptor block fills up during a large transaction, additional descriptor blocks follow, all sharing the same sequence number. Revoke blocks (or revocation records within them) are used during normal operation to list filesystem blocks that should not be replayed from earlier transactions—typically when a block is freed, overwritten, or reallocated—preventing stale data from corrupting the filesystem during recovery; these revokes carry a sequence number, and during replay, a block is skipped if it appears in a revoke record from a transaction with an equal or higher sequence number. On crash or unclean shutdown, recovery scans the journal in multiple passes: first to find the log end, second to collect all revoke records into a table (mapping revoked blocks to transaction IDs), and third to replay only committed transactions (those with a valid commit block and matching checksums) while respecting the revoke table to avoid applying superseded changes. Incomplete transactions—those lacking a commit block or failing checksum validation—are simply discarded and not replayed, ensuring the filesystem returns to a consistent state without creating new revoke blocks during recovery for aborted transactions.
The journal data is all written in big-endian ordering. This is quite unusual in ext (and in file systems in general!). Every superblock, commit, revoke and descriptor block begins with the same 12-byte journal header. The structure of this header is shown in the table below.
Offset | Size | Name | Description |
0x00 | 4 bytes | h_magic | jbd2 magic number, 0xC03B3998. |
0x04 | 4 bytes | h_blocktype | Field describing the block type of the current block:
|
0x08 | 4 bytes | h_sequence | The transaction ID that goes with this block. |
The journal superblock is the first block in the journal log, and it holds important metadata information. It is recorded as struct journal_superblock_s, which is 1024 bytes long. The superblock has the fields given in the following table.
Offset | Size | Name | Description |
0x00 | 12 bytes | s_header | Common header identifying this as a superblock (See the preceding table above). |
0x0C | 4 bytes | s_blocksize | Journal device block size. |
0x10 | 4 bytes | s_maxlen | Total number of blocks in this journal. |
0x14 | 4 bytes | s_first | First block of log information. |
0x18 | 4 bytes | s_sequence | First commit ID expected in log. |
0x1C | 4 bytes | s_start | Block number of the start of the log. If zero, the journal is clean. |
0x20 | 4 bytes | s_errno | Error value, as set by jbd2_journal_abort(). |
The remaining fields are only valid in a version 2 superblock. | |||
0x24 | 4 bytes | s_feature_compat | Compatible feature set. Only one possible value, 0x01, meaning that checksums are enabled. |
0x28 | 4 bytes | s_feature_incompat | Incompatible feature set. Possible values include:
|
0x2C | 4 bytes | s_feature_ro_compat | Read-only compatible feature set. There aren't any of these currently. |
0x30 | 16 bytes | s_uuid[16] | 128-bit UUID for journal. This is compared against the copy in the ext4 super block at mount time. |
0x40 | 4 bytes | s_nr_users | Number of file systems sharing this journal. |
0x44 | 4 bytes | s_dynsuper | Location of dynamic super block copy. (Not used) |
0x48 | 4 bytes | s_max_transaction | Limit of journal blocks per transaction. (Not used) |
0x4C | 4 bytes | s_max_trans_data | Limit of data blocks per transaction. (Not used) |
0x50 | 1 byte | s_checksum_type
| Checksum algorithm type (e.g., CRC32C).
|
0x51 | 3 bytes | s_padding2[3] | Padding for alignment. |
0x54 | 4 bytes | s_num_fc_blocks | Number of fast-commit blocks (newer feature). |
0x58 | 4 bytes | s_head | Current head of the journal (when empty). |
0x5C | 160 bytes | s_padding[40] | Padding (reserved). |
0xFC | 4 bytes | s_checksum | Checksum of the superblock (computed with this field zeroed). |
0x100 | 768 bytes | s_users[16*48]
| An array of UUIDs for filesystems sharing the journal (up to 48 entries). |
The first thing in a descriptor block is journal_header_s, which has the magic number, h_blocktype field in the header is set accordingly to identify the block as the descriptor block. Following the header, there will be an array of journal block tags that store the final location of data blocks of the transaction.
Offset | Type | Description |
0x00 | journal_header_t | Common block header. |
0x0C | struct journal_block_tag_s or struct journal_block_tag3_s | Enough tags either to fill up the block or to describe all the data blocks that follow this descriptor block. |
The superblock must be consulted to process the descriptor blocks. Specifically, it is needed to determine if the version 3 checksum and 64-bit block flags are set. If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set in an incompatible feature set, then journal_block_tag3_s will be used; otherwise, journal_block_tag_s will be used.
Old journal descriptor block structure (journal_block_tag_s) | |||
Offset | Size | Name | Description |
0x00 | 4 bytes | t_blocknr | Lower 32 bits of the location of where the corresponding data block should end up on disk. |
0x04 | 2 bytes | t_checksum | Truncated checksum of the journal UUID, the sequence number, and the data block. |
0x06 | 2 bytes | t_flags | Flags that go with the descriptor. Possible values include:
|
0x08 | 4 bytes | t_blocknr_high | Upper 32 bits of target block number (0 if 64-bit feature disabled). |
Version 3 journal descriptor block structure (journal_block_tag3_s) | |||
Offset | Size | Name | Description |
0x00 | 4 bytes | t_blocknr | Lower 32-bits of the location of where the corresponding data block should end up on disk. |
0x04 | 4 bytes | t_flags | Flags that go with the descriptor. Possible values include:
|
0x08 | 4 bytes | t_blocknr_high | Upper 32-bits of the location of where the corresponding data block should end up on disk. |
0x0C | 4 bytes | t_checksum | Checksum of the journal UUID, the sequence number, and the data block. |
Data blocks are written verbatim to the journal immediately following a descriptor block. To prevent accidental misinterpretation during recovery, if the first four bytes of a data block happen to equal the JBD2 magic number (0xC03B3998), they are replaced with zeros on disk, and the escape flag (0x0001) is enabled in the descriptor block’s tag for that entry. During journal replay, the original magic value is then restored.
The commit block serves as a marker confirming that an entire transaction has been fully written into the journal. Only after this commit block is successfully persisted to the journal can the associated data blocks be copied (checkpointed) to their permanent locations on the filesystem. The commit block does not guarantee that the data has already been written to final locations — it only guarantees that the journal copy is complete and safe. In data=ordered mode (the default), file data is written to its final location before the metadata is journaled and the commit block is issued. In data=writeback mode, file data can be written at any time (no strict ordering). In data=journal mode, both data and metadata go through the journal first. The actual writing of journaled blocks to their home locations is called checkpointing, which can happen later (often lazily) to free up space in the journal. The commit block is described by struct commit_header, which is 32 bytes long (but uses a full block).
Offset | Size | Name | Description |
0x00 | 12 bytes | journal_header_s | Common header |
0x0C | 1 byte | h_chksum_type | The type of checksum to use to verify the integrity of the data blocks in the transaction. One of:
|
0x0D | 1 byte | h_chksum_size | The number of bytes used by the checksum. Most likely 4. |
0x0E | 2 bytes | h_padding[2] | Padding |
0x10 | 32 bytes | h_chksum[JBD2_CHECKSUM_BYTES] | 32 bytes of space to store checksums. |
0x30 | 8 bytes | h_commit_sec | The time that the transaction was committed, in seconds since the epoch. |
0x38 | 4 bytes | h_commit_nsec | Nanoseconds component of the above timestamp. |
A revocation block starts with the standard journal_header_t (which includes the magic number, block type JBD2_REVOKE_BLOCK, sequence number, etc.) followed by a list of block numbers that are being revoked. They are at least 16 bytes in length. However, revocation blocks always occupy one full journal block (typically 4 KiB, matching the filesystem block size). After the 16-byte header comes a variable-length array of revoked block numbers (4 bytes or 8 bytes each, depending on whether 64-bit block numbers are enabled), and the very end of the block usually contains a tail/checksum structure.
Offset | Size | Name | Description |
0x00 | 12 bytes | journal_header_t | Common block header |
0x0C | 4 bytes | r_count | Number of bytes used in this block. |
0x10 | 4 bytes or 8 bytes | blocks[0] | Blocks to revoke. |
A revoke applies to transactions whose sequence number is equal to or less than the sequence number of the revoke record. Let’s look at a journal from a file system. We can view the contents using debugfs as shown below.
We see the signature (0xC03B3998) in bytes 0x00-0x03, and bytes 0x4-0x07 show that this block has a type of 4, which is the version 2 superblock. Bytes 0x08-0x0B show the sequence number is 0, and bytes 0x0C-0x0F show the journal block size is 1,024 bytes (0x0400). Bytes 0x10-0x13 show that there are 4,096 blocks in the journal, and 0x14-0x17 show that the journal entries start in journal block 1. To identify the first transaction in the journal, we refer to bytes 0x18-0x1B to see that the first sequence number is 5 (0x0005), and bytes 0x1C-0x1F show that it is in block 1. If the first transaction is in block 0, it is because the file system was cleanly unmounted and all transactions are complete; hence, there are no valid transactions in the journal. In bytes 0x28-0x2B, we see that the incompatible features (0x12 = 0x02 + 0x10) are set. This means that the JBD2_FEATURE_INCOMPAT_64BIT and JBD2_FEATURE_INCOMPAT_CSUM_V3 features are set. Thus, the journal_block_tag3_s block descriptor will be used. At bytes 0x30-0x3F, we see the 128-bit UUID for the journal - 7B4747CF2AEB4961A9A9C58C7652701D. At bytes 0xFC-0xFF, we see the checksum of the superblock - 0x497D.
We now examine the contents of journal block 1. Keep in mind that this is not file system block 1; this is the block inside the journal file. We can view this jcat as shown below.
We see from the type value in bytes 0x04-0x07 that this is a descriptor block, and its sequence number is 5 (0x05). The first descriptor entry starts at byte 0x0C, and we can observe that the filesystem block being modified and hence copied to the journal is 292 (0x124). The flags field in bytes 0x10-0x13 is 0x00, which means that no special flags are set for the transaction. It represent a transaction, without any additional conditions or modifiers. We can examine the commit block now, but we need to determine the appropriate commit block to examine based on the sequence number 5 as follows.
We can examine journal block 5 for the commit block as seen in the figure below.
Bytes 0x04-0x07 show us that it is a commit block (0x02), and bytes 0x08-0x0B show us that its sequence number is 5 (0x05). Having the same sequence number 5 in the journal superblock, descriptor, and commit block is normal and healthy. It indicates that transaction 5 is properly recorded as a complete, committed transaction. The jls tool in TSK will display the contents of the journal. Here is the output of our forensic image.







Post a Comment