File Signature And Hash Analysis

A signature analysis is a process where files headers and extensions are compared with a known database of file headers and extensions in an attempt to verify all files on the storage media and discover those which may be hidden. In order to fully understand the usefulness of signature analysis, this post gives an introduction to the structure of computer files and how such files can be hidden. Then, a demonstration would be articulated to show how signature analysis can be used to defeat such data hiding techniques

Understanding The Structure Of A File

Since data are stored on computers as files, all of these files must be searched and examined as if they were files in an office for the purpose to gather digital evidence. In order to understand the process of data hiding, one must first understand the structure of a computer file. The structure of a file normally consists of:

  • Filename
  • File header/footer
  • File content

 Filename 

The filename is a unique identifier which allows the computer to correctly identify each file stored on the disk. The first piece of information on the file format is given on the name of the file e.g. report.doc. Different applications use different scheme to encode data on files that describes how information is stored within the file. This schema is called the file format or file extension. The part “.doc” is the filename extension. It is used by many modern operating systems such as Mac OS X and Windows to determine the format of the file and associate a list of application of which the file is compatible with.

To see a list of common file extensions organized by file format, go to this site. If you come across a file type and you do not know how to open it, you can check here. This site lists most file extensions along with the needed free program to open each one.

 

File Header/Footer  

A file signature is a sequence of bytes in the header at the beginning of each file. It is not obligatory, for instance text files do not have it. Most common files used in computer systems may consist of a header, content data and a footer.  Customarily, the header stores information specific to the file format immediately after the signature. Metadata can indicate parameters such as file size, data format, software version that was used to generate the file, and the like.

The information which describes the type of the file, e.g. which application the file is associated with,  is stored in the header or footer (or both) within a file. Such information is called the signature of the file or file signature and they are most often unique to one another. The file extension and the file signature of each file should match each other in most cases but there are a few exceptions. There can be mismatches, no match, unknown types and anomalous results.

The file header is located within the first 20 bytes of the file while the footer is located with the last 20 bytes of the file. Files of a particular type can be searched for using the information stored in the file header alone. Such information can be easily obtained by opening files using a hex editor. A hex editor allows users to see and edit the raw and exact contents stored in a file. Below is a PDF file opened using a hex editor:

 

 

The header is highlighted in red in the above figure. In fact, the official signature of a PDF document is the first four hexadecimal values, 25 50 44 46. A file signature analysis makes use of a more extensive list of such file signatures to detect file tampering. This process is detailed later in this post.

To see a list of 518 file signature, you can go to this site, where you can use site’s search functionality to query the database for a particular file signature. Remember to omit the point when you type the extension letters in the search text box. If you did not find the signature you are searching for on this site, point your browser to Garry Kessler’s site and try searching for it again.

 

File Content

Since the purpose of the file is to store data for different applications, the major content of a file is the data from the associated application. Such data maybe encoded in various different ways so that only the eligible applications are compatible and competent to read and extract the content of the file.

 

Data Hiding Methods

The most common ways in which data can be hidden on computers are: 

  • Hiding document inside another by changing filename extension.
  • Deleting the data 

Changing Filename Extension

Some information may be of such value that criminals may not want to delete them. Instead, they opt to hide them in such a way that the files may appear to other users as another type of document and only the criminal can retrieve the original document.  A criminal can change the file extension and store it inside the C:\ Windows\System32 directory. For example, if we want to hide an MS Word® file called report.doc, we can change its name to MSDOS386.dll and store it inside the System32 directory. The System32 directory is full of DLL files and Windows® by default changes the icon and default associated program used to open the file according to its new extension, thus making it nearly impossible for a casual user to track it down.

Another way of hiding a file in Windows is to merge a file with another one of a different format. In Windows, one can easily merge a file (say, a document) with another (say, an image) by following the steps outlined below:

  • I will first create the file that I need to hide. For this experiment, I will create a DOC file and name it secret.doc.
  • Next, I will create a folder with a name secretFolder.doc. I will put my newly created DOC file inside it and then zip the folder (it is now named secretFolder.zip).
  • Now select an image and place it in the directory where the zip file is stored. Any picture will do the job; JPG and PNG are common types.
  • Use the copy command in DOS, combined with the /b switch for the purpose of handling the file as a binary file as shown below

copy /b joseph.jpg+secretFolder.zip instagram.jpg 

The + symbol is used to combine the two files. Here I have combined my picture named joseph.jpg with the compressed file I already created (SecretFolder.zip) and output the result with a new name, instagram.jpg. The last image file will contain my archived file.

 

  

In order to retrieve hidden information inside the zipped folder we need to rename the extension of our newly created picture to zip and then open it using any compressed utility we have. Alternatively we can simply right-click over the image then open it using WinRAR or 7-zip program without renaming it to view the hidden contents

Deleting Data

This is by far the most common way of hiding data. Truly speaking, the user’s original intention was probably to delete and destroy the data rather than hiding the data. However, deleted data are not “deleted” in a way that the data would be destroyed and irrecoverable at an instant.

So, how are files deleted? When a file is created, a directory entry for the file is also created. When that file is deleted (not deposited in a recycle bin like the one in Windows), the first character of the filename in the directory entry is changed to a special character (represented in hexadecimal as E5). Then, a search in the File Allocation Table for any entries with this filename is carried out and any entries found would be cleared. This process is simply a notification to the memory management unit that the memory allocated to this filename becomes available and can be reallocated to other processes. Until these memory slots are reallocated and overwritten by new data, the original data which was stored in the file would remain on disk and theoretically speaking, it can reside on the disk forever.  

                                 

File Signature Analysis

The comparison of a signature with file extension allows to quickly identify files with an invalid (changed) extension. For example, while downloading files a web browser first creates a temporary file, such files can go unnoticed if we take into account only file extensions.

In my case study, I have an image named joseph.jpg which I have renamed and had its extension changed to contactList.doc Please note that data hiding using this technique does not affect the integrity of the original document as the copy /directory command is not used. However, when the DOC file is double-clicked, no document is opened and a warning message “The file appears to be damaged or corrupted” is displayed instead.

To detect a mismatch between the file signature and the file extension, use a free tool called Hex Browser.

To use this tool, follow these simple steps:
  •  Go to this site and download the Hex Browser
  • Click the Open button in the main program menu, select the suspect file, and you are done.
  • See the results on the right pane of the program window.
 


This indicates that the file is suspicious.  The next step is to investigate what is hidden underneath the doc file. From the file signature found in the doc file, it can be seen that the original document could in fact be a JPEG file. The forensic examiner can simply rename the doc file from contactList to whatever the examiner wants, and change the file’s extension from .doc to .jpeg After changing the file’s extension, the examiner can open the image file and see its content.


Autopsy has the ability to discover file extension mismatches; to use this feature, you have to enable the “Extension Mismatch Detector” module. You can further configure file mismatch search options by going to the Tools menu Options File Extension Mismatch. From here, you can add or remove extensions based on your case need and the results are shown in the Results tree under “Extension Mismatch Detected”.


In case of deletion in NTFS partition and the loss of the corresponding $MFT information (Master File Table), the deleted file is no longer visible to the system. The file data can be 100% complete, however, they are stored in the unallocated space. One method for retrieving a file in this case is to look for its signature. This involves searching the entire unallocated space. In order to retrieve the file, the moment the signature is found, it should be extracted with some data following it. Unfortunately, in many cases the header files do not store information about the file size. Therefore, the amount of data to extract is determined arbitrarily by a specialist before starting the retrieval procedure. 


File fragmentation is yet another issue. If you find a fragmented file signature you will find its first block and the remaining parts will be lost (see Figure below). Since the size of the fragments is not known, we do not know how many clusters after signature need to be recovered. 



The analysis of the unallocated space for signature causes a number of other problems. The most serious problem are the so-called false positives. Since the signature is short, there is a high probability that during the search an array of bytes will be found which has the value of the signature in question, but which is not the signature itself.  

One solution in such a situation is to make use of the characteristic structure of a header and information stored in it (e.g. software version) and thus to "artificially" increase the header length. Such an operation requires tracking changes in each new version of the program, as later distributions may differ in information stored in the headers.


However, with millions of files stored on the hard disk and some of which are system files, a process can be carried out before File Signature Analysis is carried. This process is called hash analysis.


Hash Analysis

Hash analysis works by comparing hashes of files from disk with a list of predefined file hashes. Those which match can be of two categories, known and notable. This process simply automates the process of finding those files which can be ignored e.g. typical system files and those which can be of evidentiary value e.g. Internet browser history files

In doing so, we assume that we have a large number of files in custody, either good (trustworthy ones) or bad (inappropriate or harmful ones). Then, databases are created to include the hash values of all the known files, also known as hash databases. Afterwards, the forensic analyst would use the hash function (e.g. MD5 hash function) to generate a hash value of the file and compare it to a hash database.

A good hash database to reference is NIST National Software Reference Library (NSRL), which contains hashes of commercial software packages and files found in operating systems and software distributions. These files are known to be good since they came from trusted sources and are typically found on authorized systems. The analyst would calculate the MD5 hash of the given file and search the md5 hash databases. If there is a hit, it means the given file is known to investigator.

A cryptographically secure hash function h(x) must satisfy the following properties:

  • It takes arbitrary length of data and produces a fixed-length output;
  •  Given a message m of arbitrary length, there is a computationally efficient way to calculate h(m) but it’s infeasible to go the other way. It is also known as one-way or pre-image resistance property;
  • Given a hash value y, it is computationally infeasible to find a message m with h(m) = y. It is also known as second pre-image resistant or weak collision resistance property; 
  • It is computationally infeasible to find any pair of messages m1 and m2 such that h(m1) = h(m2). It is also known as collision resistant or strong collision resistance property;

Hash Analysis Process

The file signature search typically follows a process where an analyst acquires a questionable amount of files. This could be hundreds of thousands of files, and going through manually in order to deduce whether it’s good or bad would be impossible. Instead, we can simply calculate the hash values of these files being investigated.

Further, we assume that we have already created two hash databases, one containing the hash values of all the known good files, known as Known Database, and another including the hash values of all the known bad files, known as Notable Database. Then, we search both Known Database and Notable Database for the hash values we calculated for the files being investigated. If the search finds a hit in the Notable Database, there is a known bad file among the files under investigation. In other words, more attentions must be paid and further investigation is required. Nevertheless, if there is a hit in the Known Database, it means a good file which has been found so we simply ignore it. As a result, we can quickly filter many known files out of a large amount number of files being investigated. There are two major benefits of using hash analysis.

  1. It is deliberately difficult to reconstruct two different messages having the same hash
  2. We do not have to sift through millions of files found on the hard disk used by a suspect. Instead we can narrow down to a small number of files which are still unknown to us. There are many system files and the ones resulted from software installations, and these files can be quickly identified and excluded or ignored. By so doing, our precious time and effort can be put into investigating these unknown files.

There are two common hash functions used to generate hashes (or signatures) of files in forensic investigation, which are md5 and sha-1. The investigator would create two hash databases (common source is from NIST National Software Reference Library), one with repository of known software, file profile, and file signature dubbed “Known Database”; and a second one with repository of known bad software, file profile, and file signature dubbed “Notable Database”. The analyst would then create a hash value for the suspicious file being investigated, and traverse through the hash databases. If there is a hit from the Known database, then the analyst knows the file is good. If there is a hit from the Notable database, then the analyst knows the file is bad and further investigation and analysis is required on the suspect’s hard drive.


Create Hash Database Using md5sum

Suppose that all the files under /usr/bin (not including files in the subfolders) are good, create a hash database by navigating to the directory and using the following command.

md5sum * > /usr/known.db  

Where “known.db” is the created hash database file to be considered “Known Database” in our example.

Note that md5sum may complain by saying an entity in the specified folder is a directory and you can simply ignore these warnings. The figure below shows part of the file from line to line, each line containing a 32-hex-character MD5 hash and its corresponding file of which the hash is calculated.

 


Create An md5 Index File For Hash Database

Next we use a TSK tool, hfind, which uses a binary search algorithm to lookup hashes in a hash database, for example NIST NSRL, “known.db” created above. Thus, an index must be created to do faster searching through the database, which is important if using large databases. The following command will create a MD5 index using the newly created MD5 hash database.

hfind –i md5sum known.db



An index file called “known.db-md5.idx” will be created and can be found in the current folder. Now, we can perform file signature search through the indexed database “known. db”.


Search Database For A Given Hash Value

Suppose we have a file /home/joseph/file.c in custody and want to know whether it is good. First, we need to calculate the hash value of the file /home/joseph/file.c using the following command

md5sum /home/joseph/Desktop/file.c  


Then, we do hash analysis through the database “known.db” using the following command.

hfind known.db b7dce75c411b2324bb16f2a0c01d5558  

 


In the above example, we can conclude that the file /home/joseph/Desktop/file.c is not a known good file. In other words, a further investigation is needed for the file. Or, you may see the following output from the command below.



Now it means the file under investigation is a known good file, which is aircrack-ng.

The basic requirement for hash analysis is the availability of hash values for the file in question. You cannot find the file that is not known, which significantly narrows the usefulness of this method. If the file in question has been changed (e.g. a JPG image was scaled or the range of colours was changed), this method will not provide satisfactory results. Even if the change to a text in a MS Office 2007 document (these are compressed files) was very little, this method will fail.

1 Comments

  1. Great information! Thanks for working on this

    ReplyDelete

Post a Comment

Previous Post Next Post