DOMAIN NAME OSINT INVESTIGATION

Web browsers locate objects on the web through a Uniforming Resource Locator (URL). The URL is a user-friendly address to a particular point on the internet. A typical URL is shown in the figure below.

In the above example, the first five letters before the colon (i.e. https) refers to the protocol (or scheme) used by the resource. The table below shows the most commonly used schemes on the internet.

PROTOCOL (SCHEME)	DESCRIPTION
http	Hypertext Transfer Protocol
https	Hypertext Transfer Protocol Secure
ftp	File Transfer Protocol
gopher	The Gopher Protocol
mail-to	Redirects to email address
news	USENET news
nntp	Network News Transfer Protocol
telnet	Telnet protocol
wais	Wide Area Information Servers
file	Host-specific file names
prospero	Prospero Directory Service

The letters www after the forward slashes indicates that the resource is located on the world wide web. You may sometimes encounter third level domains such as www1, www2, or www3. This indicates that the browser is retrieving information from an alternative server to the one identified by the URL that serves this location. They are used for load balancing on domains with large user loads. The rest entries are self explanatory from the above image. The top level domain indicates which group of name servers will be used when searching the site. For a list of all designated top level domains on the internet, you can visit this page.

INVESTIGATING THE TARGET WEBSITE

The first step in your investigation is to visit the web pages of your target company. There is a lot of valuable information that can be obtained from a target's website. They include but are not limited to the following.

Physical address
Branch office locations
Key employees
Current job postings (This may reveal technologies used in the company)
Email schema
Phone numbers
Partner companies
Open hours and holidays
News about target organization (merger or acquisition news)
Technologies used in building the website
Email system used
IT technologies (hardware and software) used by target organization
VPN provider (if any)
Digital files and metadata
Information about the organization’s employees

Website Source Code

Webpages are often written using a combination of mark-up and scripting languages such as HTML (HyperText Markup Language) and JavaScript, among others. Together, these are referred to as the source code of the website which includes both comments and a set of instructions, written by programmers, that makes sure the content is displayed as intended. This should be the starting point of website investigation. You can view the source code to see if whether the developers left any useful information in the HTML comments. You need also to check the head section of the HTML source code for attached external files such as CSS and JavaScript files. They may also contain comments by their developers

To view the HTML source code of a website, type Ctrl + U after visiting the web page or right-click on the web page and select View Page Source.

Figure 2: Sample HTML source code

Many companies outsource their website design to foreign companies. Discovering this issue from the HTML source code will make the outsourced company part of your investigation endeavour.

Investigate the robots.txt file

The robots.txt file is publicly available and found on websites that give instructions to web robots (also known as crawlers or spiders) on what pages they want to include or exclude during the search process using the Robots Exclusion Protocol. The Disallow: / statement tells a web robot not to crawl a specific page. For investigation purposes, checking this file will reveal what the website owner wants to hide from the public. To view the robots.txt file, type the target domain name followed by a forward slash and then robots.txt as shown below.

Figure 3: robots.txt file

RobotsDisallowed is a GitHub project that harvests the Disallow: / directories from the robots.txt files of the world's top websites (taken from Alexa global ranking).

View the Target Website Offline

Examining a website is a great idea, but what if you could examine it offline on your own computer? Things would be a lot easier because you could search the files for text strings, patterns, different file extensions, and even content that was thought hidden in some cases. The applications that perform this function are commonly called website downloaders. Some of the popular ones include:

HTTrack - This is one of the most comprehensive website cloning tool. It allows you to clone an entire website for offline view in other to investigate it further.
GNU Wget - If you are on Linux, you can use the wget command to copy a web page locally. To download the whole website into a folder of the same name on your computer, use the command sudo wget -m http://<website name>. The –m option stands for mirror, as in “mirror this website.” Mirroring is another term for downloading a website. If you want to download a site in its entirety, use the command wget -r --level=1 -p http://<website name>. This command says, “Download all the pages (-r, recursive) on the website plus one level (—level=1) into and get all the components such as images that make up each page (-p).”
Website Ripper Copier - This is another great tool for copying a website. It has few additional functions than httrack.
Black Window - Using this tool, you can download a complete site or part of it. You can also download any kind of files including Youtube videos embedded within a site.

Extract the Links

A target website can be linked with other applications, web technologies, and related websites. Dumping its links will reveal such connections and give the URLs of other resources (such as CSS and JavaScript files) and domains connected with it. There are many online services to extract the URLs, images, scripts, iframes, and embeds of target websites. The following are the most popular (use more than one service as they do not return the same results):

Link Extractor
Free URL extractor
Link Gopher - This is a Firefox add-on that extracts all the links from web pages, including embedded ones, and displays them in a new web page.

To see where a target website URL redirects to, use redirect detective.

Check the Target Website Backlinks

Websites contain more links and pages than one might think. These are links on the website itself (called internal links) and links to pages on other websites (called external links). Together, these links are called backlinks. Checking these backlinks may reveal useful information about the target. To view all backlinks of a site, type the following in the browser address bar.

site:* <domain name>

Please note that there must be a space between the asterik * and the domain name as shown below.

site:* facebook.com

This will return all sites that links to https://facebook.com. To refine the search and return only the results from other domain names, exclude all links to the target domain from itself as shown below.

Figure 4: Finding website backlinks using Google advanced search operator

Monitor Website Updates

Websites change constantly. This means that there may be new relevant information tomorrow or in a week time. You should monitor web updates of the target website regularly. Doing this manually might not be convenient so you may want to automate this activity using appropriate tools. Some popular ones include:

Website-Watcher - This is a commercial program. The software will monitor web pages, forums, and RSS feeds for new postings and replies (even the password-protected pages) and report the changes.
VisualPing - This allows you to automatically monitor a website, where you will be notified by email if any changes occurs. You can also indicate exactly what must have been changed in order to get notified.

Viewing the Web History

We have noted above that web pages change constantly and these changes can be monitored either manually or using specialized tools, it is also useful to view the historical snapshots of these web pages during investigation. Interesting information can be found in an older version of a target website such as organizational charts, phone numbers, customer intelligence, system information listed in fields such as view-source or robots.txt, older business partnerships, vulnerabilities fixed in later versions, and other useful data, the target doesn't want on the current website version. It is important to understand that the publicly available information is hard to remove completely, making historical sources a valuable place for investigation. To view the previous version of a website, use the following services

Figure 5: Snapshot of Facebook as at 13th April, 2007

Identify the Technologies Used

It is also important to determine the web technologies and infrastructures upon which the target website relies. There are many ways to do this. From job posting offered on the target organization's website, valuable information such as the type of skills needed, IT certifications, past experience with specific products/vendors can help determine the type of IT infrastructure, the Operating System, and other software used.

There are many online tools that can help determine the technologies upon which a target website relies. Some of the popular ones include

Built With

A quick analysis of a target website may identify the technologies used to build and maintain it. Many pages that are built in an environment such as WordPress or Tumblr often contain obvious evidence of these technologies. If you notice the You Tube logo within an embedded video, you will know that the creator of the site likely has an account within the video service. However, the presence of various services is not always obvious. Built With takes the guesswork out of this important discovery.

Entering the domain of sans.org into the Built With search immediately identifies the web server (Apache, Nginx), email provider (Imperva, Incapsula), web framework (Contentstack, Nuxt.js), Website analytics, Youtube video services, mailing list provider, blog environment, and website code functions. While much of this is geek speak that may not add
value to your investigation, some of it will assist in additional search options through other networks.

Nerdy Data

If you have located a Google Analytics ID, AdSense ID, or Amazon ID of a website using the previous methods, you should consider searching this number through Nerdy Data. A search of our target's Google AdSense number revealed five domains that possess the same data. The search of the Amazon number revealed three domains.

This could only mean that the extra domains discovered are attributable to the target.

If this service presents more results than you can manage, consider using their free file download option to generate a csv spreadsheet.

Other options for this type of search include:

Identifying the key technologies used—both software and hardware—will help you do some focused research to identify any vulnerabilities in the target organization’s software, identify product-specific defects, and identify application-specific configuration problems.

Find Related Sites

There are numerous things you can uncover from a page’s source coude, but one good example is code that helps website owners and administrators monitor the traffic that a website is receiving. One of the most popular such services is Google Analytics.

Sites that are related often share a Google Analytics ID. Because Google Analytics allows multiple websites to be managed by one traffic-monitoring account, you can use their ID numbers to identify domains that may be connected by a shared ownership or administrator.

To find domains sharing the same Google Analytics ID, use DNSlytics, DomainIQ, and MoonSearch.

Web Scraping

There are automated tools that can help you to collect various types of information from the target website easily. Such tools are known as web scraping tools or web data extraction tools. Imagine you want to collect e-mails from a big website (with thousands of pages). Doing this manually would be a daunting task, but when using automated tools, you can do it with a single click. Some web scraping tools are discussed below.

theHarvester

theHarvester is an open source reconnaisance tool for obtaining email addresses, employee names, open ports, subdomains, host banners, etc from public sources such as Google, Bing, and other sites such as LinkedIn. theHarvester mainly makes use of passive techniques and sometimes active techniques as well. theharvester comes preinstalled on Kali Linux. However, you can install it on any Linux-based Operating System.

Generally we need to input a domain name or company name to collect relevant information such as email addresses, subdomains, or the other details mentioned above. But we can use keywords also to collect related information.We can specify our search, such as from which particular public source we want to use for the information gathering.

To collect the details of the target organization, type the following command theHarvester -d <domain name> -b all -l <limit_of_the_number_of_result> -f <output_name>. For example theHarvester -d twitter.com -b all -l 500 -f result.txt.

theharvester is used to execute the tool, and these are some options:

-d specifies the domain to search or the company name.
-b specifies a data source such as google, googleCSE, bing, bingapi, pgp, linkedin, google-profiles, jigsaw, twitter, googleplus, and all.
-l limits the number of results to work with.
f- saves the results into an HTML or XML file.

Figure 6: Harvesting website details with theHarverster

Web Data Extractor

Web Data Extractor is a commercial program that collects various types of data including URLs, phone and fax numbers, e-mail addresses, and meta tag information and body text.

Email Extractor

Email Extractor is a Chrome add-on that extracts all e-mails from currently visited web pages.

Investigate Target Website's File Metadata

When browsing a target company’s website, you may encounter different types of files posted on it, such as files advertising products in JPEG or PDF format, spreadsheets containing product catalogs, and others. Analyzing the files of the target could also reveal some interesting information such as the metadata (data about data) of a particular target.

FOCA (Fingerprinting Organizations with Collective Archives)- This is a very effective tool that is capable of analyzing files without downloading them. It can search a wide variety of extensions from all the three big search engines (Google, Yahoo, and Bing). It’s also capable of finding some vulnerabilities such as directory listing and DNS cache snooping.
Metagoofil - This is an information gathering tool designed for extracting metadata of public documents (pdf,doc,xls,ppt,docx,pptx,xlsx) belonging to a target company.
OOMetaExtractor - With this, you can extract and clean OpenOffice documents metadata.

Website Certification Search

Use these services to show cryptographic certifications associated with any domain name:

Search Engine Marketing Tools

Search Engine Marketing (SEM) websites provide details valuable to those responsible for optimizing their own websites. SEM services usually provide overall ranking of a website; its keywords that are often searched; backlinks; and referrals from other websites. SEO specialists use this data to determine potential advertisement relationships and to study their competition. Online investigators can use this to collect important details that are never visible on the target websites. Three individual services will provide easily digestible data on any domain.

Similar Web

Similar Web is usually the most comprehensive of the free options. However, some of the details retrieved usually contradict other services. Much of this data is "guessed" based on many factors. A search of twitter.com produced the following partial information about the domain.

The majority of the traffic is from the USA.
There are 6.8 billion total visits for last month.
There are ten main online competitors to the target, and the largest is Facebook.com
"twitter" led more people to the site than any other search term followed by ツイッター.
There are 1000 referral sites to twitter.com and Wikipedia referred the most.
The top social media networks directing traffic to twitter.com are Youtube, Reddit, and Facebook. YouTube tops the list with 47.01%

These analytical pieces of data can be valuable to a researcher. Knowing similar websites can lead you to other potential targets. Viewing sources of traffic to a website can identify where people hear about the target. Global popularity can explain whether a target is geographically tied to a single area. Identifying the searches conducted before reaching the target domain can provide understanding about how people engage with the website. While none of this proves or disproves anything, the intelligence gathered can help give an overall view of the intent of the target. Other websites that render similar service are listed below.

Moon Search - Offers websites analytical services and Backlinks checker service.
Spy on Web - Collect different information about target domain name like its IP address and used DNS server.
W3bin - Here you can find out who hosts a specific website.
Visual Site Mapper - This tool shows outgoing and incoming links to a target website.
Site Liner - This tool shows duplicate content and related domain names.
Clear Web Stats - This tool shows detailed technical information about any domain name.
Website Outlook - Different website statistics tools like social popularity, keyword analysis and website technical information.
Informer - This tool shows statistical information about websites.
Security Headers - Here you can analyze HTTP response headers of target websites.

Sharedcount

This website provides one simple yet unique service. It searches your target domain and identifies its popularity on social networks such as Facebook and Twitter. A search of sans.org produced the following results. This information would lead me to focus on Facebook first.

Reddit Domains

The primary purpose of the service is to share links to online websites, photos, videos, and comments of interest. If your target website has ever been posted on Reddit, you can retrieve a listing of the incidents. This is done through a specific address typed directly into your browser. If your target website was sans.org, you would navigate to https://www.reddit.com/search?q=sans.org. This example produced many Reddit posts mentioning this domain. These could be analyzed to document the discussions and usernames related to these posts.

Small SEO Backlinks

After you have determined the popularity of a website on social networks, you may want to identify any websites that have a link to your target domain. This will often identify associates and people with similar interests of the subject of your investigation. There are several online services that offer a check of any "backlinks" to a specific website. One of the best is the backlink checker at Small SEO Tools. A search of this blog produces 32 websites that have a link to this blog. These sites should also fall within the radar of your investigations if I were to be your target.

Visual Site Mapper

When researching a domain, I am always looking for a visual representation to give me an idea of how massive the website is. Conducting a "site" search on Google helps, but you are at the mercy of Google's indexing, which is not always accurate or recent. This service analyzes the domain in real time, looking for linked pages within that domain. It provides an interactive graph that shows whether a domain has a lot of internal links that you may have missed. Highlighting any page will display the internal pages that connect to the selected page. This helps identify pages that are most "linked" within a domain, and may lead a researcher toward those important pages. This visual representation helps me digest the magnitude of a target website.

Website Reputation Checker Tools

There are many organizations that offer free online services to check whether a specific website is malicious. Some of these sites also offer historical information about a target website. The following are the various web reputation analysis services:

Threat Miner - This site offers domain threat intelligence analysis.
URLVoid - This is a website reputation checker tool.
Threat Crowd - This is a search engine for threats.
Sucuri SiteCheck - This is a website malware and security scanner. It will also show a list of links and list of scripts included within the target website.
Joe Sandbox - This service detects and analyzes potential malicious files and URLs.
Google Transparency Report - This examines billions of URLs per day looking for unsafe websites.
MalwareURL - You can check a suspicious website or IP address here.
Scumware - This is a list of malicious websites.

To see a list of websites that have been hacked before, go to zone-h and search for the target domain name. if there is a previous hack, it will show you the hacked page (which replaces the original main home page), the hacker team responsible of this hack if available, and the date/ time when the hack took place.

Domain Search Tools

In this section we'll try to make use of different kinds of recon technique to do enumeration of domain, subdomains, IP addresses. We will also do DNS footprinting and obtain Whois information.

Whois Lookup

Whois holds a huge database that contains information regarding almost every website that is on the web, most common information are “who owns the website” and “the e-mail of the owner” and other personal information such as billing contact, and technical contact address which can be used for further investigation. This information is public and required to be so by the ICANN organization responsible for overseeing the domain name system. WHOIS information about each domain is stored within public central databases called WHOIS databases. These databases can be queried to fetch detailed information about any registered domain name. Please note that some domain registrants may opt to make their domain registration information private. In these cases, the personal information of the domain registrant will be hidden in the WHOIS databases.

Numerous sites offer WHOIS information. However, the main one responsible for delivering this service is ICANN. ICANN and its local regional Internet registries manage the allocation and registration of IP addresses and domain names for the entire world.

ICANN - This is the head organization responsible for coordinating the Internet DNS and IP addresses.
AFRINIC - This is responsible for the Africa region.
APNIC - This is responsible for the Asia- Pacific region.
LACNIC - This is responsible for the Latin American and the Caribbean.

There are many other online services that provide information about registered domain names. Some of them include:

Whoisology - For finding deep connection between domain names and their owners.
Robtex - For research of IP addresses and domain names.
Who - Look up domain and IP owner information, and check out dozens of other statistics.
Operative Framework
URL Scan

Scanning For Subdomains

A subdomain is a web address created under the current domain name address. It is usually used by website administrators to organize their content online. Most Webmasters put all their efforts in securing their main domain, often ignoring their subdomains leaving them vulnerable and prone to attack. Discovering such insecure subdomains can provide important information about the target company (for example, it may reveal the website code or leak documents forgotten on the server).

Using Google Dork

A very common way of searching for subdomains is by using a simple Google dork such as shown for Apress below

Figure 7: Using Google dork for subdomain name discovery

This query is telling the search engine to return results without www, which are normally subdomains. However, it will not be able to find subdomains that have the following pattern: www.subdomain.apress.com since we have already asked Google to return results without www.

DNSdumpster

With DNSdumpster, you can find subdomains, DNS record and MX records.

Other subdomain discovery tools and services include:

DNSmap - This comes default to Kali Linux. It performs subdomain name discovery and shows the associated IP addresses for each subdomain name found.
Certificate Search - This service also discovers subdomain names of the target domain.
Gobuster - This site discovers subdomains and files/directories on target websites. This tool is used as an active reconnaissance technique to collect information.
Bluto - Here you can gather subdomain names passively via Netcraft.
PenTest Tools - Here you can discover subdomain names, find virtual hosts, and do website reconnaissance and metadata extraction from a target website.
Sublist3r - Here you can discover subdomain names using both passive and active reconnaissance techniques.

It is always considered a better practice to use more than one service for subdomain discovery because some services may return partial results based on their discovery method.

DNS Enumeration

Without a domain name, Google.com would just be 216.58.223.206, which is it’s IP. Imagine having to memorize the IPs of all the websites you visit—surfing the Internet would become really difficult. That’s why DNS protocol was developed. It is responsible for translating an IP address to a domain name. DNS is one of the most important sources of information on public and private servers of the target.

Common DNS Record Types

Before collecting information from the target DNS, you need to know the main DNS record types. The domain name system has many records associated with it. Each one gives a different set of information about the related domain name. These are the most common DNS records:

A is usually used to map hostnames to an IP address of the host. It is used for IPv4 records.
AAAA is the same as record type A but used for IPv6 records.
CNAME is the canonical name record. This record is often called an alias record because it maps an alias to the canonical name.
MX is the mail exchange record. It maps domain names to its mail server responsible for delivering messages for that domain.
NS is the name server record. It handles queries regarding different services related to the main domain name.
TXT is the text record. It associates arbitrary text with a domain name.

nslookup command

nslookup is available in both Windows and Linux OS. This command helps you discover various DNS information about the target domain name in addition to its resolved IP address. Let’s say that we want the DNS servers to return all the A records of an organization, we will do the following:

Figure 8: Finding the A record of the target domain name using nslookup

To see the MX records (mail server records) associated with the target domain name, do the following:

Figure 9: Showing MX records with a target domain name

The following are useful websites that offer DNS and web search tools:

Domain Information Groper (DIG)

We can run the same queries with dig as we did with nslookup. However, it’s very handy and has more functionalities than nslookup. So let’s ask dig to return mx records for Wikipedia.org. We will use the following command:

Similarly, you can use ns, A, CNAME, ... in place of mx for returning all appropriate records.

IP Address Tracking

Below are the popular—and free of charge—tools that can help you to find more information about any IP address or domain name.

Here are tools for IP geolocation information:

IPVerse - This shows the IPv4 and IPv6 address block lists by country code.
IP2Location - Identify geographical location and proxy by IP address.
Ipfinerprints - This shows the IP address geographical location finder.
DB-IP - This shows the IP geolocation and network intelligence.
IP Location - This shows IP geolocation data.
UTrace - Locate IP address and domain names.

Here are tools to gain information about the Internet Protocol (IP):

Onyphe
CIDR REPORT for IPv4
IP to ASN - This shows the IP address to the ASN database updated hourly.
Reverse DNS Lookup - This shows reverse DNS entries for a target IP address.
Reverse IP Lookup
Same IP - This shows sites hosted on the same IP address.
CIDR REPORT for IPv6
IP Address Tools
ExoneraTor - Here you can check whether a particular IP address was used as a Tor relay before.

Here are tools to find out information about the Border Gateway Protocol (BGP):

Here are tools to find out information about blacklist IP addresses:

Block List - Here you can report abused IP addresses to their server operators to stop the attacks or the compromised system.

FireHOL - Here you can collect cybercrime IP feeds to create a blacklist of IP addresses that can be used on various networking devices to block malicious access/websites.
Directory of Malicious IPs

A reference guide that I have found useful in domain name OSINT investigation is the image below created by inteltechniques.com

Facebook SDK