A signature analysis is a process where files headers and extensions are compared with a known database of file headers and extensions in an attempt to verify all files on the storage media and discover those which may be hidden. In order to fully understand the usefulness of signature analysis, this post gives an introduction to the structure of computer files and how such files can be hidden. Then, a demonstration would be articulated to show how signature analysis can be used to defeat such data hiding techniques

Understanding The Structure Of A File

Since data are stored on computers as files, all of these files must be searched and examined as if they were files in an office for the purpose to gather digital evidence. In order to understand the process of data hiding, one must first understand the structure of a computer file. The structure of a file normally consists of:

Filename
File header/footer
File content

Filename

The filename is a unique identifier which allows the computer to correctly identify each file stored on the disk. The first piece of information on the file format is given on the name of the file e.g. report.doc. Different applications use different scheme to encode data on files that describes how information is stored within the file. This schema is called the file format or file extension. The part “.doc” is the filename extension. It is used by many modern operating systems such as Mac OS X and Windows to determine the format of the file and associate a list of application of which the file is compatible with.

To see a list of common file extensions organized by file format, go to this site. If you come across a file type and you do not know how to open it, you can check here. This site lists most file extensions along with the needed free program to open each one.

File Header/Footer

A file signature is a sequence of bytes in the header at the beginning of each file. It is not obligatory, for instance text files do not have it. Most common files used in computer systems may consist of a header, content data and a footer. Customarily, the header stores information specific to the file format immediately after the signature. Metadata can indicate parameters such as file size, data format, software version that was used to generate the file, and the like.

The information which describes the type of the file, e.g. which application the file is associated with, is stored in the header or footer (or both) within a file. Such information is called the signature of the file or file signature and they are most often unique to one another. The file extension and the file signature of each file should match each other in most cases but there are a few exceptions. There can be mismatches, no match, unknown types and anomalous results.

The file header is located within the first 20 bytes of the file while the footer is located with the last 20 bytes of the file. Files of a particular type can be searched for using the information stored in the file header alone. Such information can be easily obtained by opening files using a hex editor. A hex editor allows users to see and edit the raw and exact contents stored in a file. Below is a PDF file opened using a hex editor:

The header is highlighted in red in the above figure. In fact, the official signature of a PDF document is the first four hexadecimal values, 25 50 44 46. A file signature analysis makes use of a more extensive list of such file signatures to detect file tampering. This process is detailed later in this post.

To see a list of 518 file signature, you can go to this site, where you can use site’s search functionality to query the database for a particular file signature. Remember to omit the point when you type the extension letters in the search text box. If you did not find the signature you are searching for on this site, point your browser to Garry Kessler’s site and try searching for it again.

File Content

Since the purpose of the file is to store data for different applications, the major content of a file is the data from the associated application. Such data maybe encoded in various different ways so that only the eligible applications are compatible and competent to read and extract the content of the file.

Data Hiding Methods

The most common ways in which data can be hidden on computers are:

Hiding document inside another by changing filename extension.
Deleting the data

Changing Filename Extension

Some information may be of such value that criminals may not want to delete them. Instead, they opt to hide them in such a way that the files may appear to other users as another type of document and only the criminal can retrieve the original document. A criminal can change the file extension and store it inside the C:\ Windows\System32 directory. For example, if we want to hide an MS Word® file called report.doc, we can change its name to MSDOS386.dll and store it inside the System32 directory. The System32 directory is full of DLL files and Windows® by default changes the icon and default associated program used to open the file according to its new extension, thus making it nearly impossible for a casual user to track it down.

Another way of hiding a file in Windows is to merge a file with another one of a different format. In Windows, one can easily merge a file (say, a document) with another (say, an image) by following the steps outlined below:

I will first create the file that I need to hide. For this experiment, I will create a DOC file and name it secret.doc.
Next, I will create a folder with a name secretFolder.doc. I will put my newly created DOC file inside it and then zip the folder (it is now named secretFolder.zip).
Now select an image and place it in the directory where the zip file is stored. Any picture will do the job; JPG and PNG are common types.
Use the copy command in DOS, combined with the /b switch for the purpose of handling the file as a binary file as shown below

copy /b joseph.jpg+secretFolder.zip instagram.jpg

The + symbol is used to combine the two files. Here I have combined my picture named joseph.jpg with the compressed file I already created (SecretFolder.zip) and output the result with a new name, instagram.jpg. The last image file will contain my archived file.

In order to retrieve hidden information inside the zipped folder we need to rename the extension of our newly created picture to zip and then open it using any compressed utility we have. Alternatively we can simply right-click over the image then open it using WinRAR or 7-zip program without renaming it to view the hidden contents

Deleting Data

This is by far the most common way of hiding data. Truly speaking, the user’s original intention was probably to delete and destroy the data rather than hiding the data. However, deleted data are not “deleted” in a way that the data would be destroyed and irrecoverable at an instant.

So, how are files deleted? When a file is created, a directory entry for the file is also created. When that file is deleted (not deposited in a recycle bin like the one in Windows), the first character of the filename in the directory entry is changed to a special character (represented in hexadecimal as E5). Then, a search in the File Allocation Table for any entries with this filename is carried out and any entries found would be cleared. This process is simply a notification to the memory management unit that the memory allocated to this filename becomes available and can be reallocated to other processes. Until these memory slots are reallocated and overwritten by new data, the original data which was stored in the file would remain on disk and theoretically speaking, it can reside on the disk forever.

File Signature Analysis

The comparison of a signature with file extension allows to quickly identify files with an invalid (changed) extension. For example, while downloading files a web browser first creates a temporary file, such files can go unnoticed if we take into account only file extensions.

In my case study, I have an image named joseph.jpg which I have renamed and had its extension changed to contactList.doc Please note that data hiding using this technique does not affect the integrity of the original document as the copy /directory command is not used. However, when the DOC file is double-clicked, no document is opened and a warning message “The file appears to be damaged or corrupted” is displayed instead.

To detect a mismatch between the file signature and the file extension, use a free tool called Hex Browser.

To use this tool, follow these simple steps:

Go to this site and download the Hex Browser
Click the Open button in the main program menu, select the suspect file, and you are done.
See the results on the right pane of the program window.

This indicates that the file is suspicious. The next step is to investigate what is hidden underneath the doc file. From the file signature found in the doc file, it can be seen that the original document could in fact be a JPEG file. The forensic examiner can simply rename the doc file from contactList to whatever the examiner wants, and change the file’s extension from .doc to .jpeg After changing the file’s extension, the examiner can open the image file and see its content.

Autopsy has the ability to discover file extension mismatches; to use this feature, you have to enable the “Extension Mismatch Detector” module. You can further configure file mismatch search options by going to the Tools menu ➤ Options ➤ File Extension Mismatch. From here, you can add or remove extensions based on your case need and the results are shown in the Results tree under “Extension Mismatch Detected”.

In case of deletion in NTFS partition and the loss of the corresponding $MFT information (Master File Table), the deleted file is no longer visible to the system. The file data can be 100% complete, however, they are stored in the unallocated space. One method for retrieving a file in this case is to look for its signature. This involves searching the entire unallocated space. In order to retrieve the file, the moment the signature is found, it should be extracted with some data following it. Unfortunately, in many cases the header files do not store information about the file size. Therefore, the amount of data to extract is determined arbitrarily by a specialist before starting the retrieval procedure.

File fragmentation is yet another issue. If you find a fragmented file signature you will find its first block and the remaining parts will be lost (see Figure below). Since the size of the fragments is not known, we do not know how many clusters after signature need to be recovered.

The analysis of the unallocated space for signature causes a number of other problems. The most serious problem are the so-called false positives. Since the signature is short, there is a high probability that during the search an array of bytes will be found which has the value of the signature in question, but which is not the signature itself.

One solution in such a situation is to make use of the characteristic structure of a header and information stored in it (e.g. software version) and thus to "artificially" increase the header length. Such an operation requires tracking changes in each new version of the program, as later distributions may differ in information stored in the headers.

However, with millions of files stored on the hard disk and some of which are system files, a process can be carried out before File Signature Analysis is carried. This process is called hash analysis.

Hash Analysis

Hash analysis works by comparing hashes of files from disk with a list of predefined file hashes. Those which match can be of two categories, known and notable. This process simply automates the process of finding those files which can be ignored e.g. typical system files and those which can be of evidentiary value e.g. Internet browser history files

In doing so, we assume that we have a large number of ﬁles in custody, either good (trustworthy ones) or bad (inappropriate or harmful ones). Then, databases are created to include the hash values of all the known ﬁles, also known as hash databases. Afterwards, the forensic analyst would use the hash function (e.g. MD5 hash function) to generate a hash value of the ﬁle and compare it to a hash database.

A good hash database to reference is NIST National Software Reference Library (NSRL), which contains hashes of commercial software packages and ﬁles found in operating systems and software distributions. These ﬁles are known to be good since they came from trusted sources and are typically found on authorized systems. The analyst would calculate the MD5 hash of the given ﬁle and search the md5 hash databases. If there is a hit, it means the given ﬁle is known to investigator.

A cryptographically secure hash function h(x) must satisfy the following properties:

It takes arbitrary length of data and produces a ﬁxed-length output;
Given a message m of arbitrary length, there is a computationally efﬁcient way to calculate h(m) but it’s infeasible to go the other way. It is also known as one-way or pre-image resistance property;
Given a hash value y, it is computationally infeasible to ﬁnd a message m with h(m) = y. It is also known as second pre-image resistant or weak collision resistance property;
It is computationally infeasible to ﬁnd any pair of messages m1 and m2 such that h(m1) = h(m2). It is also known as collision resistant or strong collision resistance property;

Hash Analysis Process

The ﬁle signature search typically follows a process where an analyst acquires a questionable amount of ﬁles. This could be hundreds of thousands of ﬁles, and going through manually in order to deduce whether it’s good or bad would be impossible. Instead, we can simply calculate the hash values of these ﬁles being investigated.

Further, we assume that we have already created two hash databases, one containing the hash values of all the known good ﬁles, known as Known Database, and another including the hash values of all the known bad ﬁles, known as Notable Database. Then, we search both Known Database and Notable Database for the hash values we calculated for the ﬁles being investigated. If the search ﬁnds a hit in the Notable Database, there is a known bad ﬁle among the ﬁles under investigation. In other words, more attentions must be paid and further investigation is required. Nevertheless, if there is a hit in the Known Database, it means a good ﬁle which has been found so we simply ignore it. As a result, we can quickly ﬁlter many known ﬁles out of a large amount number of ﬁles being investigated. There are two major benefits of using hash analysis.

It is deliberately difficult to reconstruct two different messages having the same hash
We do not have to sift through millions of files found on the hard disk used by a suspect. Instead we can narrow down to a small number of files which are still unknown to us. There are many system files and the ones resulted from software installations, and these files can be quickly identified and excluded or ignored. By so doing, our precious time and effort can be put into investigating these unknown files.

There are two common hash functions used to generate hashes (or signatures) of ﬁles in forensic investigation, which are md5 and sha-1. The investigator would create two hash databases (common source is from NIST National Software Reference Library), one with repository of known software, ﬁle proﬁle, and ﬁle signature dubbed “Known Database”; and a second one with repository of known bad software, ﬁle proﬁle, and ﬁle signature dubbed “Notable Database”. The analyst would then create a hash value for the suspicious ﬁle being investigated, and traverse through the hash databases. If there is a hit from the Known database, then the analyst knows the ﬁle is good. If there is a hit from the Notable database, then the analyst knows the ﬁle is bad and further investigation and analysis is required on the suspect’s hard drive.

Create Hash Database Using md5sum

Suppose that all the ﬁles under /usr/bin (not including ﬁles in the subfolders) are good, create a hash database by navigating to the directory and using the following command.

md5sum * > /usr/known.db

Where “known.db” is the created hash database ﬁle to be considered “Known Database” in our example.

Note that md5sum may complain by saying an entity in the speciﬁed folder is a directory and you can simply ignore these warnings. The figure below shows part of the ﬁle from line to line, each line containing a 32-hex-character MD5 hash and its corresponding ﬁle of which the hash is calculated.

Create An md5 Index File For Hash Database

Next we use a TSK tool, hﬁnd, which uses a binary search algorithm to lookup hashes in a hash database, for example NIST NSRL, “known.db” created above. Thus, an index must be created to do faster searching through the database, which is important if using large databases. The following command will create a MD5 index using the newly created MD5 hash database.

hfind –i md5sum known.db

An index ﬁle called “known.db-md5.idx” will be created and can be found in the current folder. Now, we can perform ﬁle signature search through the indexed database “known. db”.

Search Database For A Given Hash Value

Suppose we have a ﬁle /home/joseph/file.c in custody and want to know whether it is good. First, we need to calculate the hash value of the ﬁle /home/joseph/file.c using the following command

md5sum /home/joseph/Desktop/file.c

Then, we do hash analysis through the database “known.db” using the following command.

hfind known.db b7dce75c411b2324bb16f2a0c01d5558

In the above example, we can conclude that the ﬁle /home/joseph/Desktop/file.c is not a known good ﬁle. In other words, a further investigation is needed for the ﬁle. Or, you may see the following output from the command below.

Now it means the ﬁle under investigation is a known good ﬁle, which is aircrack-ng.

The basic requirement for hash analysis is the availability of hash values for the file in question. You cannot find the file that is not known, which significantly narrows the usefulness of this method. If the file in question has been changed (e.g. a JPG image was scaled or the range of colours was changed), this method will not provide satisfactory results. Even if the change to a text in a MS Office 2007 document (these are compressed files) was very little, this method will fail.

Facebook SDK

File Signature And Hash Analysis