Regular Expressions for Digital Forensics

During a forensic examination, investigators use keywords to find exact string matches and regular expressions to find strings that match a pattern. This post focuses on a methodology introduced to help formulate regular expressions to find patterns of data.


Regular Expressions

Regular Expressions (RegEx) are used to search for data that matches a pattern. In order to use a regular expression to locate data during a forensic examination, the forensic tool must have a regular expression search engine that performs the search operation. Some tools also refer to regular expressions as grep expressions. GREP stands for Globally search a Regular Expression and Print. The grep command is a Linux tool used to find input files (or standard input when there is no file to name) for a given line of data. It matches based on a regular expression. The basic usage of grep command is to search for a specific string, represented by a regular expression in specified file(s).

Some terms commonly associated with regular expressions include:

  • Literal Characters - A character the regular expression sees exactly as it is typed. The regular expression engine is looking for that character.
  • Character Class - A range of alphanumeric characters such as a-z, A-Z, or 0-9. The regular expression engine looks for any character in the specified range.
  • Group - A defined set of characters in the regular expression. The regular expression engine is looking for any character in the defined set.
  • Metacharacters or Special Characters - Characters with special meaning to the regular expression engine. The regular expression engine interprets the special character based on its special meaning.
  • Escaping - A means to instruct the regular expression engine to ignore the special meaning of a metacharacter and instead look for that character.

Combining literal characters with character classes and groups in conjunction with metacharacters is what creates a regular expression.

Regular expression provides a basic and extended standard syntax for creating patterns designed specifically to lookup a set of strings from a list of elements or to verify if a given string follows a particular arrangement (for example, postcode, email address, phone number, and so on). Literally, Basic Regular Syntax (BRE) and Extended Regular Syntax (ERE) work together. However, BRE requires that the meta-characters ( ) and { } be designated and \{\}, whereas ERE does not. Also, ERE introduces more meta-characters, including ?+, and |


REGULAR EXPRESSION METACHARACTERS

Metacharacter

Description

^

Matches the following item at the beginning of a text line

$

Matches the preceding item at the end of a text line

.

Matches any single character

[...]

A bracket expression. Matches a single character in the bracketed list or range

[^...]

Matches a single character that is not contained within the brackets

()

De fines a marked sub-expression. A marked sub-expression is also called a block or capturing group. BRE mode requires

*

Matches the preceding item zero or more time

{m}

The preceding item is matched exactly m times. BRE mode requires \{m\}

{m,}

The preceding item is matched N or more times. BRE mode requires \{m,\}

{m,n}

Matches the preceding item at least m and not more than n times. BRE mode requires \{m,n\}

\

The escape of special meaning of next character


The next three meta-characters are only for extended regular expression:


Metacharacter

Description

?

Matches the preceding character, meta-character, or expression zero or one time

+

Matches the preceding character, meta-character, or expression one or more times. There is no limit to the number of times it can be matched.

|

Matches the character, meta-character, or expression on either side of it


Note that to use the grep command to search for meta-characters, you have to use a backslash (\) to escape the meta-character. For example, the regular expression “^\.” matches lines that start with a period.

 

Escaped Alphabetical Characters

By default, the regular expression engine treats alphabetical characters literally; a hit is returned when the letter, word, or phrase is located. But if an escape (\) is used with an alphabetical character, the engine treats it non-literally; as a line break or short cut. Below are specific alphabetical characters that have an escape meaning.



Escaped Letter

Search Results

\t

Find a tab

\r

Find a carriage return

\n

Find a new line

\s

Find a space, tab, carriage return or line break. The same as \t\r\n

\v

Find a vertical tab

\f

Find a form feed

\d

Find any digit including zero. The same as [0-9]

\w

Find any capital letter, or lower-case letter, or digit [including zero], or the underscore character. The same as [A-Za-z0-9]

\b [word boundary]

Find a whole defined word



In the table below, I will show how to use grep with examples.



grep hackers files

search files for lines with “hackers”

grep 'hackers?' files

search files for lines with “hackers” or “hacker”

grep '^hackers' files

“hackers” at the start of a line

grep 'hackers$' files

“hackers” at the end of a line

grep '^hackers$' files

lines containing only “hackers”

grep '[Hh]ackers' files

search for “Hackers” or “hackers”

grep '\^f' files

search files for lines with “^h”, “\” escapes the ^

grep '^$' files

search for blank lines

grep '[0-9][0-9] [0-9]' files

search for triples of numeric digits

grep -f hack.txt files

The -f option specifies a file where grep reads patterns. In this example, the search patterns are contained in a file called hack.txt, one per line


I will consider a few use cases of regular expressions in digital forensics examinations as follows.


Regex To Find Social Security Number

The description for this regular expression is given as follows:


Find string patterns that includes any single digit from 0-9, followed by any single digit from 0-9, followed by any single digit from 0-9, then a dash, followed by any single digit from 0-9...[etc.]

The expression can be written in long hand, then simplified using short cuts.


Long hand:

[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]

Simplified:

[0-9]{3}-[0-9]{2}-[0-9]{4}

Further simplified:

\b\d{3}-\d{2}-\d{4}\b


However, a given string must meet certain criteria as follows in order to be a valid Social Security Number (SSN):

  • It should have nine (9) digits
  • It should be divided into three (3) parts by hyphen (-)
  • The first part should have 3 digits and should not be 000, 666, or between 900 and 999.
  • The second part should have 2 digits and it should be from 01 to 99.
  • The third part should have 4 digits and it should be from 0001 to 9999

You may want to consider these conditions in formulating your regular expressions in order to increase significantly the chance of positive valid hits. Given these criteria, the regular expression will be formulated as follows:


^(?!666|000|9\d{2})\d{3}-(?!00)\d{2}-(?!0{4})\d{4}$


A breakdown of the expression using Regexper is shown below, which can be observed that it matches Social Security Numbers (SSN).





Regex To Find Phone Number

The following regular expression locates strings containing any phone number where the area code is not enclosed in parentheses, digits are separated by a dash, dot or space and optionally locates an extension. If an extension is present, it must be from two to five digits and preceded by any of the following ext., ext, Ext, x, x., X, or X..


(\+\d{1,3}\s?)?((\(\d{3}\)\s?)|(\d{3})(\s|-?))(\d{3}(\s|-?))(\d{4})(\s?(([E|e]xt[:|.|]?)|x|X)(\s?\d+))?


A breakdown of the expression is shown below, which can be observed that it matches phone numbers.




The following regular expression is also valid:


(((\(\d{3}\)|\d{3})[-.\s])|(\(\d{3}\)|\d{3}))?\d{3}[-.\s]?\d{4}([-.\s]?([Ee]xt|[Xx])[.]?[-.\s]?\d{2,5})?


A good regular expression to match international numbers is given below:


((?:\+|00)[17](?: |\-)?|(?:\+|00)[1-9]\d{0,2}(?: |\-)?|(?:\+|00)1\-\d{3}(?: |\-)?)?(0\d|\([0-9]{3}\)|[1-9]{0,3})(?:((?: |\-)[0-9]{2}){4}|((?:[09]{2}){4})|((?: |\-)[0-9]{3}(?: |\-)[0-9]{4})|([0-9]{7}))

 

Regex To Find Email Addresses

To find any email address(es) either in the allocated space or unallocated space, the right regular expression for this will be:


[A-Za-z0-9._%+-]+(%20|@)[A-Za-z0-9.-]\.[A-Za-z{2,4}]


The @ character is stored in the unallocated space as %20. Therefore, if the search for email addresses is intended to be made in the allocated space only, the (%20|) characters can be safely omitted without problems.


[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]\.[A-Za-z{2,4}]


If the search is intended for the unallocated space only, the regular expression can be written as follows:


[A-Za-z0-9._%+-]+%20[A-Za-z0-9.-]\.[A-Za-z{2,4}]


To search for email addresses of a specific domain (for example gmail.com) in either the allocated or unallocated space, the regular expression is:


[A-Za-z0-9._%+-]+(%20|@)gmail\.com


A more general regular expression for email address that I use in my investigation of allocated and unallocated space is given below:


^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$


To Find Web Address

To return a basic URL from the search result, the following regular expression might suffice:


www\.\w+\.com


Because web addresses do not always include www, consider grouping www\. together in parenthesis and the ? repetition metacharacter just after the group to tell the search function to return a hit if www. occurs zero or one times. The regular expression for this will be:


(www\.)?\w+\.com


To find addresses that matches “http://”, “https://”, or neither of them, followed by a series of digits and letters, followed by a single dot and more digits and letters after another single dot, finally followed by a single “/”, the following regular expression will be appropriate.


/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

A breakdown of the expression is shown below, which can be observed that it matches URLs



The following regular expression will search for .onion URLs in the suspect hard disk


^(https?:\/\/)?([a-z2-7]{16}.[onion]{5})  


Regex To Find IP Addresses and Domain Name Web Addresses

Depending on the type of investigation at hand, locating IP addresses may be important, a regualr expression can be built to look for IP addresses (such as https:\\197.128.20.4) and domain name web addresses. A good regular expression for this is given below:


(http(s?)|ftp(s?)):\/\/(((www\.)?\w+\.(com|org|net)|(\d{1,3}\.){3}\d{3}))


A good regular expression to search for IPV4 addresses only is given as follows:


^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$  


A good regular expression to search for IPV6 addresses only is given as follows:


((([0-9a-fA-F]){1,4})\:){7}([0-9a-fA-F]){1,4}


Regex To Find Passwords

The following regular expression is used to search for strings with at least one upper-case letter, one lower-case letter, and one digit.


((?=.*\d)(?=.*[a-z])(?=.*[A-Z]{8,15}))




To learn how to use grep and regular expressions in combination with the sleuthkit function srch_strings to conduct forensic keyword analysis, the reader is adviced to view this post.

Post a Comment

Previous Post Next Post