How to See What’s Behind a Website

by Brad Murray, Wael Eskandar

../_images/Investigating-Websites-cik-illustration.png


In Short: A practical overview of tools and techniques to investigate the ownership of websites and uncover hidden information online, as well as essential tips on how to do it securely.

On the surface, websites look like they’re designed to make information available to the public. However, there is plenty of valuable information hiding behind what you are able to see in your web browser.

Sometimes it is important to research hidden data: to identify the individuals or companies that own a domain name or maintain a website, to determine where that site was registered or to dig up content that it once contained but that has since been removed.

Doing so is not always straightforward. For example, people who do not want to be associated with a website’s content, or with the affiliated business, sometimes try to hide their connection to the site by using intermediaries when they register its domain name.

A website investigator is sometimes like a mechanic. Just as a mechanic might need to poke around inside a car’s engine to diagnose a problem, an investigator might need to look into the inner workings of a website to find out who and what is behind it.

Finding hidden content and connections is not an exact science, but a combination of acquired skills, a set of tools and a dose of perseverance. We’ll explore some useful tools and methods, which can help a determined investigator to unearth clues buried within a website – from registration details and metadata to source code and server configurations.

A website and its elements

../_images/Investigating-Websites_Breakdown_03-cik-illustration_2.png

To investigate a website effectively, you will need to know what goes into one. This includes elements that are immediately apparent to visitors and others that lurk beneath the surface.

Website and webpage

A website is made up of webpages that display information. That information might include the profile of a company, a list of social media posts, a description of a product, a collection of photographs, a database of legal information or just about anything else.

These webpages can typically be viewed by anyone with internet access and a web browser. Considered from another perspective, however, a webpage is really just a digital file that is stored on a disk that is attached to a computer that is plugged into power and connected to a network cable somewhere in the physical world. It is sometimes helpful to keep this in mind when investigating a website.

IP address

To visit a website, your device needs to know the Internet Protocol address, or IP address, of the computer that hosts it. Hosting a website means making it available to the world; the computers responsible for doing so are often called servers.

An IP address is typically written as a series of four numbers, separated by periods, each of which ranges from 0 to 255.

For example: 172.217.16.174 is the IP address of one of the servers that hosts the “google.com” website, at which visitors can access Google’s search engine.

At any given time, each device that is directly connected to the internet - be it a webserver, an email service or a home WiFi router - is identified by a particular IP address. This allows other devices to find it, to request access to whatever it is hosting and, in some cases, to send it content like search terms, passwords or email messages.

Many devices, including most mobile phones, laptops and desktop computers, connect to the internet indirectly. They can reach out to websites and other services - and they can receive replies - but most other devices cannot reach out to them. In a sense, they are not listening for connections. Many of these devices have what are called “internal IP addresses.” This means that devices on the same local network can connect to them directly, but others cannot. If you lookup the IP address of your phone or laptop, you will likely find an internal IP address, but you will rarely find one associated with a website.

Domain name

Like most long numbers, IP addresses are difficult to remember, so we tend to use domain names instead. Each domain name points to one or more IP addresses. In the example above, the domain name “google.com” points to 172.217.16.174 and is far easier for most people to remember.

Domain registrar, domain registrants & domain registration

Domain names are unique. There can only be one “google.com,” for example. The process of purchasing a domain name is called domain registration.

This process ensures that domain names remain unique and makes it more difficult for someone to impersonate a website they do not control. When someone registers a domain name, a record is created to keep track of that domain’s official owner and administrator (or their representatives).

A person who registers a domain is called a domain registrant. That registrant - or someone to whom they give access - can then point their domain to a particular IP address. If a webserver is listening at that IP address, a website is born.

The companies that handle the registration process are called domain registrars, and they almost always charge a fee for their services. Example registrars include GoDaddy.com, Domain.com and Bluehost.com, among many others. These companies are required to keep track of certain information about each of their registrants.

A non-profit organisation called the Internet Corporation for Assigned Names and Numbers (ICANN) governs the domain registration process for every website in the world.

Web host

We know that a website has a domain name and that a domain name is translated into an IP address. We also know that every website is actually stored on a computer somewhere in the physical world. The computer that hosts the website is called a web host.

There is an entire industry of companies that store and serve websites. They are called web hosting companies. They have buildings filled with computers that store websites, and they can be located anywhere in the world. While it is most common for websites to be hosted in “data centres” like these, they can actually be hosted from almost any device with an internet connection.


Safety first!

There are many ways to describe using and researching on the internet. Many of these descriptions involve “traveling” somewhere, for example “surfing” the internet or “going to” a website.

The fact is, a better description would be opening a door or dialing a phone number. When you dial a phone number, the person on the other end can see your phone number. When you visit a website’s IP address, the website can see your IP address. When you open a door to look out, someone on the other side can look in. It is important to understand that when you visit a website you are sending hidden information about yourself to that website.

That information includes what kind of device or computer you have (iPhone 6, Samsung Galaxy, MacBook etc.), which operating system you are running (Windows, MacOS, Linux), and even what fonts you have installed.

All of this information can be used to figure out who you are, where you are, and even what other websites you have been on.

There are tools you can use to see some of the data you are sharing with the websites you visit. Using your current web browser, visit the online tools below to see what information you might be leaking to the websites you visit and the companies that own them.

  • Panopticlick – analyses how well your browser and its add-ons protect you against online tracking techniques. This site also works on Tor Browser.
  • Browser Leaks – displays a list of web-browser security-testing tools that tell you what personal data you may be leaking to others, without your knowledge or permission, when you surf the Internet. This site also works on Tor Browser.

Be sure to check for leaks related to the Web Real-Time Communication (WebRTC) protocol – a technology that supports video and audio chat – and for DNS leaks – which allow third parties like your internet service provider (ISP) to see what websites you visit and what apps you use. The sites above also indicate whether or not your real IP address is visible to the websites you visit.

Having seen some of your weaknesses and formulated some concerns about how your online research might expose your information or threaten your safety, you can now take the next step. In the final section - How to stay safe when investigating websites - we go through a few tools and techniques you can use to protect yourself and your data when investigating online.

Basic WHOIS Look-up

When researching a website, one of the most useful sources of data can be found in its domain registration details.

Over the course of your investigation, it might be relevant to know who – whether it is an organisation or an individual – owns a particular domain, when it was registered and by which registrar, as well as other details. In many cases, this information can be accessed through third-party services that are detailed below.

Yet, as mentioned earlier, sometimes the owner of a domain would not want to appear as linked to the site. Whatever the reason - be it not wanting to be associated with the site’s content or just wishing to maintain a degree of privacy - it’s worth noting that domains can be registered through proxy or intermediary organisations that conceal the full details of the registration.

The information collected from domain registrants is called WHOIS data, and it includes contact details for the technical staff assigned to manage the site, as well as contact details of the actual site owner or their proxy.

This data has long been publicly available on sites like ICANN’s WHOIS Lookup. However, there are currently other free or partially-free services (some have fees for advanced searches and extended results) that also aggregate WHOIS information and which often provide more details and more accurate and up-to-date information than ICANN.

Note that if you are making many requests for information in a short period of time, on most of these sites you may receive an error and need to wait or switch to a different service to continue your searches. Similarly, many of these sites require you to complete CAPTCHAs (selecting various items from images) to make sure you are not a robot.

These are some of the sites providing useful WHOIS data for free:

As mentioned above, many registrars offer the ability to act as proxy contacts on the domain registration forms, a service known as “WHOIS privacy”. In such cases, domains registered with WHOIS privacy will not list the actual names, phone numbers, postal and email addresses of the true registrant and owner of the site, but rather the details of the proxy service. While this can frustrate some WHOIS queries, the lookup tool is nonetheless a powerful resource for investigating a domain.

As different search engines return different results for the same query depending on their indexes and algorithms, it may be that searching with different WHOIS query services returns varying amounts of detail about your domain of interest. Checking with multiple sources whenever possible is therefore a good way to make sure you collect as much information as possible, as is standard in any part of an investigation.

To illustrate this, let’s look at what a search for “usps.com” (the website of the United States Postal Service) on several WHOIS services leads to.

A query for WHOIS data for “usps.com” using the ICANN WHOIS Lookup returns:

../_images/web-icann-ups.pngICANN WHOIS data for "usps.com" on 19 February 2019

The information we get about the registrant is limited – we can only see the domain’s creation and expiry dates – and the registrar’s details appear in place of those of the registrant.

To show how the information returned from these services may differ, a search for “usps.com” on https://who.is/ returns more information about the Postal Service, including an address, email contact, and phone number.

../_images/web-whodotis.pngWho.is WHOIS data for "usps.com" on 19 February 2019


Tip:

In addition to the WHOIS search tools above, IntelTechniques – the website of Michael Bazzel, an open source intelligence consultant – provides an aggregated list of domain search tools that allow you to compare search results from several sources of WHOIS data. Just check the Domain Name search menu on the left-hand side. Also note that IntelTechniques has a rich offering of other tools you can use in your investigations, such as image metadata search and social media search tools.


GDPR implications

The European Union’s (EU) General Data Protection Regulation (GDPR) has led to a lot of uncertainty for the status of public WHOIS registries in the EU because in theory, WHOIS data of owners and administrators of EU-registered domains should not be collected and published by registrars. Under the GDPR, it is considered to be private information.

However, ICANN has sued several European registrars for deviating from its interpretation of the GDPR, which has a more relaxed approach to the regulation and permits limited access to WHOIS data. Even after GDPR’s implementation, ICANN continued to demand EU registrars to at least collect data about site owners and administrators, if not to make it publicly available. ICANN’s interpretation has been repeatedly rejected by the courts, but their insistence that their policy for EU registrants is GDPR compliant leaves a lot of questions unanswered. Most likely, collection of and access to WHOIS data for EU-based registrants will be restricted.

Even in these conditions, some researchers are finding ways to work around the restrictions that make some registrants’ data inaccessible at times. This post by GigaLaw - a US law firm specialised in domain name disputes - provides some tips and techniques that can prove successful at times.

Historic WHOIS

Historic data can be a useful tool when investigating websites, because it can track the transfer of a domain’s ownership. It can also help identify owners of websites who have not consistently chosen to obscure their registration data using a WHOIS privacy service.


Example:

One example where this historic data proved useful was the investigation of a cybercrime gang known as Carbanak, who were believed to have stolen over a billion dollars from banks. Using the historical data provided by DomainTools, a researcher was able to link multiple sites together by going through their historical records and finding hundreds of domains that were initially registered with the same phone number and Yahoo email address. These contact details were later used to establish a link between Carbanak and a Russian security company.

For your own investigations, several companies offer access to historic WHOIS records, though these records may often be restricted to non-EU countries due to the GDPR, as mentioned above.

DomainTools

It is perhaps the best-known of these companies that offer historic hosting and WHOIS data. Unfortunately, this data is not free and DomainTools requires you to register for a membership in order to access it.

Whoisology

An alternative to Domain Tools that also provides historical WHOIS data. It requires you to create an account for both basic free, as well as advanced fee-based services. There is a limit to the number of free basic searches per day and this option only provides you with the latest historical data archive of a website (not full history). The full historical archives require payment and there are several annual fee rates depending on the number of searches and other features the service provides. Whoisology doesn’t work via the Tor Browser, and it may also use CAPTCHAs to verify that you are a real person searching for information.


Safety First!

If you decide to set up an account with these services, it may be a good idea to create a new email address that you can use for this purpose only. This way you avoid sharing your regular contact data and other personal details.

Reverse WHOIS Look-up

Reverse phone directories, which allowed you to look up a phone number to determine who it belonged to, used to be a staple of investigative work for years. These directories contained the same information as a phone book, but they organised it differently: entries were sorted by phone numbers rather than by names. This allowed investigators to cross-reference phone numbers back to the names of the people to whom those numbers belonged. While printed reverse directories have long since been replaced by online databases (such as White Pages Reverse Phone), the need to cross-reference information has expanded into many other applications.

Investigators often need to look up residents by home address, to get names from email addresses or find businesses by officer or incorporation agent (a person or business that carries out company formation services on behalf of real owners). Reverse directories should be part of any investigator’s toolkit. The notion of tracing little pieces of information back to their sources is central to the investigative mindset.

When you look up the domain names registered to a certain email address, phone number or name, it is called a “reverse WHOIS lookup”. Several sites offer these kinds of searches.

To identify the owner of a domain – especially when that owner has taken some steps to obscure their identity – you will need to locate all the information about the website that can be reverse searched. The tools available to cross-reference information from a website will change, and the information available will vary for each site, but the general principle is consistent. When trying to locate the owner of a domain name, focus on locating information that can help you “reverse” back to an ultimate owner.

Here are some tools you can use for reverse searches:

ViewDNSinfo

It is free and allows searches by email or phone number. ViewDNSinfo also provides other useful options such as searching by an individual or company, historical IP address search (historical list of IP addresses a given domain name has been hosted on as well as where that IP address is geographically located) etc. Note that IP address owners are sometimes marked as ‘unknown’ so it helps to use several websites for your searches and combine the results for a fuller picture. It works via Tor Browser and doesn’t have CAPTCHA.

Domain Eye

You can register on Domain Eye to get 10 free searches per day. It works via Tor Browser and doesn’t have CAPTCHA.

Domain Tools

A paid service with no free demos available for reverse WHOIS at the moment. It works via Tor Browser and doesn’t have CAPTCHA.

../_images/web-viewdns.pngViewDNSinfo example of reverse WHOIS search based on email address info@archive.org (used by the Internet Archive), date searched 11 January 2019

Discovering useful information in a webpage’s source code

A webpage that you see in your browser is a graphical translation of code.

Webpages are often written in plain text using a combination of scripting languages such as HTML (HyperText Markup Language) and JavaScript, among others.

Together, these are referred to as a website’s source code, which includes both content and a set of instructions, written by programmers, that makes sure the content is displayed as intended.

Your browser processes these instructions behind the scenes and produces the combination of text and images you see when accessing a website. With a simple extra step, your browser will let you view the source code of any page you visit.

Give it a try. Open up your browser and take a look at the source code of a website that interests you. You can usually right click on the page and select “View page source”. On most Windows and Linux browsers, you can also press CTRL+U. For Mac instructions and additional tips, check out this guide on how to read source code (also accessible via Tor Browser)


For example:

Part of the source code for the White House website https://www.whitehouse.gov, which you can reveal by right-clicking your cursor and selecting “View source code”, looks like this: ../_images/web-whitehouse-sourcecode.pngExample of source code

If you’ve never looked at a site’s source code before, you might be struck by how much of the information that is transmitted to your computer does not appear when you view the page in your browser.

For instance, there may be comments left by whoever wrote the source code. These comments are only visible when you view the source – they are never displayed in the rendered page (that is, the page that has been translated into graphics and text). A comments begin with <!--, which indicates that what comes next is a comment and should not be displayed on the page. They end with -->, which signals the end of the comment.

Comments are often written in plain language and sometimes provide hints about who maintains a website. They may also include personal notes or reveal information such as a street address or copyright designation.

Finding connections with reverse Google Analytics ID

There are numerous things you can uncover from a page’s source coude, but one good example is code that helps website owners and administrators monitor the traffic that a website is receiving. One of the most popular such services is Google Analytics - https://analytics.google.com.

Sites that are related often share a Google Analytics ID. Because Google Analytics allows multiple websites to be managed by one traffic-monitoring account, you can use their ID numbers to identify domains that may be connected by a shared ownership or administrator.

Sites that use Google Analytics embed an ID number into their source code. All Google Analytics IDs begin with “UA-“, and are followed by an account number. They look a bit like this: “UA-12345678-2”.


For example:

To follow on the White House example above, the Google Analytics ID for www.whitehouse.gov is “UA-12099831-10”. You can find this out yourself by following these steps while on the website:

  • go to the website’s source code by right-clicking and selecting “View source code”, as indicated above,
  • open a search box with “Ctrl-F” or “Command-F” while you are on the page’s source code,
  • search for “UA-“ by typing it in the search box; you will find the site’s Google Analytics code “UA-12099831-10”.

../_images/web-whitehouse-analytics.pngWhitehouse Analytics code

The number after the first dash (-12099831) is the White House’s Google Analytics account number. The number at the end (10, in this case) indicates how many different websites rely on that same account to track visitors.

Because multiple websites can be managed on one Google Analytics account, you can use Google Analytics ID numbers to identify domains that may be connected by a shared ownership or administrator.

There are several reverse search tools that allow you to locate sites that share a given analytics IDs. Examples includes:

  • DNSLytics – searchable by domain name, IP address, or Analytics ID. It also works via the Tor Browser.
  • DomainIQ - searchable by domain name or analytics ID. It doesn’t works via the Tor Browser.
  • Moonsearch – searchable by Analytics ID, IP address, etc. It doesn’t works via the Tor Browser.

As usual, it’s advisable to search the same Google Analytics ID on several of these websites, as their results tend to vary.


Note:

Sometimes one website may copy the source code of another even if they are not actually related. This will lead to misleading results when looking up the Google Analytics ID. Reverse lookups of Google Analytics ID must always be treated as a possible lead and not as hard evidence. This technique can be useful but makes it worth repeating the importance of checking multiple sources before drawing conclusions.

For instance, in the case above, searching for the White House ID (UA-12099831-10) with any of these services will return a list of sites sharing the same Google Analytics ID with the White House website. (Also note that results tend to differ from service to service; some will return more sites others less, so search on more to compile a thorough list of findings.) If you do this exercise, you will notice that several websites that are most likely unrelated to the official White House site also appear on the list. Some are parody sites, others are gaming sites, and so on. Although this looks bizarre at first, the explanation is rather simple – the White House source code has been copied and replicated without deleting the Google Analytics ID. Therefore, not all the listed sites are related in this case. Also worth noting that the unrelated websites are not actually using the Google Analytics ID of the White House and its genuinely related sites, they are merely displaying it.

How can these searches help an investigation?

If a website owner or administrator is obscuring their identity on one site, they may not have taken similar measures on every site they manage or own. Enumerating these sites by reverse searching the Google Analytics IDs can help you locate related websites that may be easier to identify.


Example:

In a 2011 article, Wired columnist Andy Baio revealed that out of a sample of 50 anonymous or pseudonymous blogs he researched, 15 percent were sharing their Google Analytics ID with another website. This finding proved fruitful for unmasking anonymous sites. Out of the sample of 50, Baio claimed to have identified seven of the bloggers in 30 minutes of searching. The full story is available here.

Let’s try an exercise and see if the website Our Revolution uses Google Analytics to monitor traffic.


../_images/web-ourrevolution.pngScreenshot of "Ourrevolution.com"

To determine whether “Our Revolution” has a Google Analytics ID we have to view the source code as described above.

../_images/web-ua-example.pngSource code of "ourrevolution.com"

We can then use one of the reverse search tools mentioned above to see if other sites are using that same Google Analytics ID. On DNSlytics, for instance, choose Reverse Analytics from the Reverse Tools top navigation menu.

../_images/web-dnslytics.pngSearching by Google Analytics ID on DNSlytics

In addition to the “Our Revolution” domain where we found the Analytics ID, the search returns another domain name: “Summer for Progress” - https://summerforprogress.com/.

../_images/web-dnslytics-reverse.pngResults of Google Analytics ID search on dnslytics.com

Metadata analysis

../_images/Investigating-Websites_Breakdown_01-cik-illustration_small.png

When someone creates a file (such as a document, PDF or spreadsheet) on their computer, the programs they use automatically embed information in that file.

We can consider “data” to be the contents you see in a file: the words in a document, the charts in a PDF, the numbers in a spreadsheet or the elements of a photograph.

On the other hand, the automatically embedded information is called “metadata”.

Examples of metadata might include the size of the file, the date when the file was created, or the date when it was last changed or accessed. Metadata might also include the name of the file’s author or the name of the person who owns the computer used to create it.

There are many types of metadata. Here, we look at how to find and make sense of several examples that are useful for investigations.

With documents, even if metadata doesn’t always identify the author or creator of a file (if they take steps to keep this identity hidden, for example, by deleting metadata such as name or dates), it often still provides clues to their identity or other significant facts about them or the devices and software they used to work on those files.

A similar situation happens when we take photos: the image files our cameras produce often contain a type of metadata called EXIF (Exchangeable image file format). EXIF metadata can reveal information related to when and where the photo was taken: time, date, GPS (Global Position Satellite) location, etc.

Users can manually remove this potentially identifying information, and many apps and websites clear metadata from uploaded files in order to protect their users. In some cases, however, EXIF metadata that remains in the final version of a photograph may end up revealing clues about the identity of the photographer, locations, dates and other information that can help you connect the missing links in your investigation.


Example:

For example, American serial killer Dennis Rader was arrested after mailing a disk containing documents from his church to a news organisation. The documents contained metadata that identified their author. Here is an article in The Atlantic showing how it happened.

With this in mind, if you can’t find the owner of a domain name through the means and tools presented above, it can be useful to download all text documents, spreadsheets, PDFs and other files hosted by the site. From there, you can analyse the documents’ metadata and look for an author name or other identifying details. You can do this by checking the properties of the documents after you download them. Keep in mind, however, that documents like these sometimes contain malware that can put you and those with whom you work at risk. To avoid thid, you should not open them with a device that you use for any other purposes (work or personal) or that is connected to the internet.


Safety First! - Opening downloaded files from unknown sources

Some investigators maintain a separate laptop that they use only to open untrusted files. These devices are often called ‘air gapped’ computers because, once they are set up, they are never connected to the internet.

As an alternative, you can restart your computer from a USB stick that contains the Tails operating system when you need to analyse suspicious documents. Even if a document contains malware that affects Tails, any damage it might do will become irrelevant once you reboot back into your normal operating system. And the next time you restart into Tails, you will have a clean system once again. Tails is based on the GNU/Linux operating system, however, so it comes with a bit of a learning curve.

To use either of these techniques, you will need a USB stick or an external hard drive so you can transfer the files in question.

Finally, if you are not worried about associating yourself with the documents or about exposing their contents to Google (or to anyone with the authority to access other people’s Google accounts), you can upload them to Google Drive, and search for metadata using Google Docs. Don’t worry, Google is pretty good at protecting their servers from malware!

Not all documents will contain metadata. It’s not always embedded in the first place, and the creator can easily delete or modify it, as can anyone else with the ability to edit the document. Moreover, not all metadata relates to the original author. Documents change hands and are sometimes created on devices that belong to people other than the author.

Again, any piece of information you find needs to be verified and corroborated from multiple sources. Despite that, metadata could provide you with additional leads or help to confirm other evidence you have already found.


Case Study

In addition to helping you identify the true owner of a document or website, metadata can also provide clues about employment contracts and other affiliations and connections. For example, a Slate writer analysed the PDFs found on a conservative policy website run by former American media personality Campbell Brown and discovered that all of them were written by staff working for a separate right-leaning policy group. The link between these two groups was not known until the metadata analysis was conducted. The full story is available here.

Let’s look at how this finding can be replicated.

The PDF described in this article was originally found at the following web address on the commonsensecontract.com website: http://commonsensecontract.com/assets/downloads/Rewards_for_Great_Teachers.pdf.

It has since been taken down and, indeed, that domain name now points to a completely different website: http://commonsensecontract.com. You can still find the original one archived on the Internet Archive’s Wayback Machine.

To learn more about the Wayback Machine, see our resource on “Retrieving and Archiving Information From Websites”)

../_images/web-archived-commonsense.pngArchived webpage from "commonsensecontract.com"

You can follow the steps below to examine the metadata in question. But first:

  • We recommend using an online document viewer to avoid exposing yourself to any malware that might be lurking within the online documents you download. (We did not find any malware in this document, nor is it particularly sensitive, but it’s best to plan for the worst.)
  • If you are using an online document viewer that requires you to sign in, such as Google Docs, we recommend creating a separate account on that service. This will help you avoid associating your investigative activities with your personal online profile. In the example below, we will use a simple online service that does not require an account.
  • Keep in mind that you are showing this document, and its metadata, to whoever runs the service you use. They, in turn, could share or publish it. If that is not acceptable, you might have to use one of the other techniques mentioned in the “Safety First” sections of this Kit.

To view the metadata in this PDF:

  1. Browse to the Wayback Machine - https://archive.org/web/
  2. Search for the original web address: http://commonsensecontract.com/assets/downloads/Rewards_for_Great_Teachers.pdf
  3. Click on the year 2014
  4. Click on one of the blue dots in the calendar (the one in May or one of the two in September)
  5. Click the download link toward the upper, right-hand corner of the screen
  6. Save the PDF somewhere on your device, but do not open it yet
  7. Browse to the Online PDF Reader (it also works on Tor Browser and does not have CAPTCHA)
  8. Click the “Start Online PDF Read” button
  9. Upload the Rewards_for_Great_Teachers.pdf file
  10. Click the “Rewards_for_Great_Teachers.pdf” Document >Properties link toward the upper, left-hand corner of the screen
  11. Note that the author is listed as Elizabeth Vidyarthi.

Exposing Hidden Web Content

../_images/Investigating-Websites_Breakdown_02-cik-illustration_2.png

Nearly every site on the internet hides something (and often, many things) from visitors, intentionally or not. For example, the content management systems employed by most sites hide the internal files used to generate posts and maintain the website. Databases that store data for sites and applications are usually hidden from public access. Cookies and other client-side data, while accessible and legible to a knowledgeable user, are concealed from the view of the average user, stored and processed automatically in the background.

There are simple tools and techniques that allow anyone to access such information without doing anything shady. These are just small tricks that let you see what a website is made of and what additional data it might reveal to you. Accessing such information can be helpful when investigating a website to determine its owners or to identify connections to other sites. It can also help turn up contact details or further leads for your research.

Robots.txt

Websites indicate how scrapers and search engines should interact with their content by using a file called “robots.txt”. This file allows site administrators to request that scrapers, indexers, and crawlers limit their activities in certain ways (for instance, some do not want information and files from their websites to be scraped).

Robots.txt files list particular files or subdirectories - or entire websites - that are off-limits to “robots”. As an example, this could be used to prevent the Wayback Machine crawlers from archiving all or part of a website’s content.

Some administrators may add sensitive web addresses to a robots.txt file in an attempt to keep them hidden. This approach can backfire, as the file itself is easy to access, usually by appending “/robots.txt” to the domain name.

Be sure to check the robots.txt file of the websites you investigate, just in case they list files or directories that the sites’ administrators want to hide. If a server is securely configured, the listed web addresses might be blocked. If they are accessible, however, they might contain valuable information.

Each subdomain is managed by its own robots.txt file. Subdomains have web addresses that include at least one additional word in front of the domain name. For example, the Internet Archive itself has at least two robots.txt files: one for its main site, at https://archive.org/robots.txt, and one for its blog, at https://blog.archive.org/robots.txt.

It is worth noting that robots.txt files are not meant to restrict access by humans using web browsers. Also, websites rarely enforce these restrictions, so email harvesters, spambots, and malicious crawlers often ignore them. If you are scraping a website using automated tools, however, it is considered polite to comply with any directives you might find in a robots.txt file.


Example:

As a test, we can access the robots.txt file for the Payment Card Industry Security Standards Council.

This is an interesting example not because the Council is trying to hide anything but because their robots.txt file - pcisecuritystandards.org/robots.txt - includes a number of digital files- including Word documents, PDFs and spreadsheets - none of which would turn up in regular search results:

../_images/web-robots-txt.pngScreenshot of robots.txt

In order to visit a webpage or download a document that you find this way, just copy the partial web address on the right-hand side of a “Disallow:” restriction and paste it into your browser’s address bar after the domain name. In this case, you can download the “SAQ_C_V3.docx” file you see in the image, for example, using the following web address: https://www.pcisecuritystandards.org/SAQ_C_v3.docx.

Often, such files will be accessible through the website itself, so this might just be a shortcut. In some cases, however, you might stumble upon pages or files that a website administrator was trying to hide.

Remember - digital files can contain malware, so take care when opening them. Consider using an online document viewer unless you are concerned about sharing the content of those documents with whoever operates your document viewing service.

Sitemap.xml

Sitemap files are sort of the opposite to the robots.txt files. They are used by site administrators to inform search engines about pages on their site that are available for crawling. Websites often use sitemap files to list all of the parts of the site they want to be indexed, and how often they want search engine indexes to be updated.

Like robots.txt files, sitemaps live in the topmost folder or directory of the website (sometimes called the ‘root’ directory).

For large and complex websites, the sitemap often links to other Extensible Markup Language (XML) files, which are sometimes compressed, or ‘zipped’. Where these files are accessible, they sometimes point to sections of the website that might be interesting.

The result is sometimes URLs that typically do not show up in searches. You can explore those manually.

To access sitemaps, you need to add “/sitemap.xml” to the domain name. Not all sites will have an accessible sitemap.xml file.

The UK-based open-source investigations site Bellingcat has one that you can reach it by typing https://www.bellingcat.com/sitemap.xml into your browser address bar. You will get a list of xml files, as seen below.

../_images/web-sitemap.pngSitemap examples for bellingcat.com

You can click any of the addresses listed to see what they contain. In this example we can access https://www.bellingcat.com/attachment-sitemap1.xml

../_images/web-sitemap2.pngSitemap examples for www.bellingcat.com/attachment-sitemap1.xml

Subdomain enumeration

A subdomain is an extra identifier, typically added before a domain name, that represents a subcategory of content. For example, “google.com” is a domain name whereas “translate.google.com” is a subdomain.

Websites often have unlisted subdomains that their administrators believe are private. These subdomains occasionally point to unfinished content or content that is intended for an internal audience.This might include development subdomains used by programmers to test new content, event pages with links to materials distributed at conferences, or login pages for internal webmail.

Many subdomains are uninteresting from an investigative standpoint, but some can reveal hidden details about your research subject that are not easily accessible through basic online searching.

Here are some tools and techniques you can use when researching website subdomains:

FindSubdomains

Probably the best starting point is FindSubdomains.com, which boasts more than 890 millions subdomains indexed. One of the best things about this tool is that it doesn’t actively scan the website, so you don’t risk alerting the site administrator that you’re investigating their activities. The tools also works via the Tor Browser for an added layer of privacy to your searches.

In addition to subdomains, FindSubdomains.com also lists IPs, countries where a site is hosted and other potentially interesting and useful data. Here is a fragment of what comes up when searching for “tacticaltech.org”:

../_images/web-findsub.pngSubdomain example for tacticaltech.org via FindSubdomains.com

DNSDumpster

DNSDumpster provides similar data about subdomains, server locations and other domain information. Like FindSubdomains.com, it does not actively scan the website as you request this information, which means that your searches cannot be tracked by the website you are investigating. It also works via the Tor Browser.

../_images/web-dnsdumpster.pngSubdomain example for tacticaltech.org via DNSDumpster.com

Although we reviewed quite a lot of tools and methods already, there is much more out there for those of you passionate about online investigations. For more tips and techniques related to uncovering hidden website content, have a look at another Kit resource: “Search Smarter by Dorking”.



Safety First!

HOW TO STAY SAFE WHEN INVESTIGATING WEBSITES

Searching and collecting information about domain ownership, history, website source code, metadata and many other elements that can help you build your evidence when investigating websites, involves navigating a large number of online tools and services. Some of these work with the Tor Browser and that allows you to protect your privacy to some extent. Others not only do not work on Tor but they also require you to sign up with an email address, name and other personal details.

Here are some suggestions for digital safety tools and techniques you can use to protect your privacy as well as the security of your devices and data when investigating online.

ACCOUNTS

Some services require users to create an account, to choose a username, to provide payment information, to verify an email addresses or to associate a social media profile in order to gain access to information on their platforms.

You should consider establishing a separate set of accounts, for use with services like these, in order to compartmentalise (separate) your investigative work from your personal online identity.

In some cases, you might even want to create a single use “identity” for a particular investigation, and dispose of it once research is done.

Either way, your first step will be to create a relatively secure, compartmentalised email account, which you can do quite easily with Tutanota tutanota.de or Protonmail protonmail.com.

BROWSERS

As someone who is looking to uncover hidden truths, you probably already use the internet for personal communication and for some of your research.

It’s a good idea to use different browsers for your research and for casual web browsing. By doing so, you are practicing “compartmentalisation”, marking one browser for research and another for everything else. It’s like sorting things into two different boxes or compartments.

We recommend you choose a “privacy aware” browser for your research and avoid logging in to web-based email and social media on that browser. Using a privacy aware browser will prevent a lot of your personal data from being sent to the sites you visit.

Before using any of the online tools we talk about here or in the overall kit, it’s a good idea to download and install one of these browsers. Then, add an extra layer of certainty by testing the browser with a tool like Browser Leaks, Panopticlick (shown in the “SafetyFirst!” section above) or other similar tools. The results of what you see should look different from when you visit Browserleaks or Panopticlick with a normal browser, which would usually reveal more weaknesses.

These are some examples of tools that can help protect your privacy while researching online, with some pros and cons of using them.

Tor Browser

Pros: This is the best privacy aware browser. The code is published openly so anyone can see how it works. It has a built-in way of changing your IP address and encrypting your traffic.

Cons: There are places in the world where Tor Browser usage is blocked or banned. While there are ways around these blocks, such as Tor Bridges, using Tor may also flag your traffic as suspicious in such places.

What if I can’t use Tor Browser? There are cases when Tor browser might not be the best for you. Here are some other options. These other browsers are not on the same level as Tor but they can be considered. Be sure to always test the browser you choose in Browserleaks, Panopticlick or other such tools.

Firefox

Pros: Firefox blocks trackers and cookies with a setting called “Enhanced Tracking Protection”, which is automatically turned on when you set “Content Blocking” to “strict”.

Cons: You need to turn on this option, it’s off by default. When you use Firefox, it’s important to remember that your IP address is still visible to the sites you visit. WebRTC is enabled by default, and can leak your real IP address, even if you are using, for instance, a VPN.

Brave

Pros: Brave tries to protect privacy without the need for turning options on or adding add-ons or extensions. Brave has a security setting to erase all Private Data when the browser is closed. It has a feature called ‘Shields’ where you can block ads and trackers. Brave also allows you to create a new “Private Tab with Tor”, which uses the Tor network to protect your IP address (regular use won’t protect it). This even allows you to visit Tor hidden service sites - which are sites that end in .onion and are configured to be securely accessed only by Tor-enabled browsers. If you encounter a webpage that blocks Tor you can decide whether or not to visit it with Tor turned off.

Cons: Brave has a feature called “payments” or “Brave payments” – this is for those wishing to donate to content creators or websites they access via Brave (a portion of the payments goes to the browser to sustain its operations). It’s important to keep this option off as it sends data that could be used to identify you. When you use Brave, you should use the ‘Private Tab with Tor’ feature to protect your IP address.

Epic Browser

Pros: Epic browser has a built-in technology to hide your IP address called an encrypted proxy.

Cons: Epic is only for Mac and PC, not Linux.

Waterfox and Palemoon

Pros: These are two different projects based on Firefox but they have removed code that can send information to Mozilla, the owner of Firefox.

Cons: These browsers are based on older versions of Firefox code. Palemoon is not available for Apple computers. When you use Waterfox or Palemoon, it’s important to remember that your IP address is still leaked to the sites you visit.

DuckDuckGo

Pros: This is a privacy-aware search engine (not a browser) that claims not to collect any personal data about its users. You can use DuckDuckGo in combination with the Tor Browser to further preserve your privacy.

Cons: DuckDuckGo does save your search queries but it doesn’t collect data that can identify you personally.

VIRTUAL PRIVATE NETWORKS (VPNs)

Unless you are using Tor Browser, we recommend you always use a Virtual Private Network (VPN) when conducting your research.

We have explained that visiting a website is like making a phone call. The website you are visiting can see your “number” - your IP address - which can be used to map where you are coming from.

To illustrate, if you are researching a corporation and frequently visit its board of directors page – a page that typically gets very little traffic - your repeated visits from your specific location might make the company aware of your research.

One way you can work against being identified in this situation is by disguising your IP address. This is what a VPN does: rather than seeing your real IP address, sites you visit will see the IP of the VPN provider.

You can think of the VPN as a concrete tunnel between you and the site you want to visit. The VPN creates a tunnel around your traffic so it can’t be observed from the outside, and routes it through an intermediary server owned by your provider, so your traffic looks to any site you visit like it is coming from a different location than where you actually are. Neither the web browser, your internet service provider nor the site you visit will see your IP or be able to identify you. Sites will only see that your traffic is coming from the IP address of your VPN provider.

There are many VPN options and it can be confusing when deciding which one to pick. To add to the confusion, most VPN reviews and listings are not independent, some are really biased. ThatOnePrivacySite is a VPN review site we can endorse.

It is recommended you choose a VPN company that claims that they do not record logs of your traffic.

While most free VPNs should be avoided because they are often funding their operation by selling their log data (records of what sites users visit via the VPN), there are some reputable ones we can endorse, such as:

Published April 2019

Resources and tools

RESOURCES

Articles and Guides

Tools and Databases

  • IntelTechniques by Michael Bazzell. An open source intelligence and digital forensics resource with tools, guides and tips useful for investigating websites and people online.
  • ICANN Whois, from the Internet Corporation for Assigned Names and Numbers. The official ICANN Whois search tool for websites registered around the world.
  • Panopticlick, from the Electronic Frontier Foundation. An online tool that analyses how well your browser and add-ons protect you against online tracking techniques.

Glossary

term-algorithm

Algorithm – an established sequence of steps to solve a particular problem.

term-api

API – stands for application programming interface, by which a platform can make its data accessible to external developers for free or under some conditions or fees. (not used)

term-bandw

Bandwidth – in computing, the maximum rate of information transfer per unit of time, across a given path.

term-bot

Bot – also called web robot or internet bot, is a software application that runs automated tasks over the internet. For example, a Twitter bot that posts automated messages and news feeds.

term-extension

Browser extension – also called add-ons, they are small pieces of software used to extend the functionalities of a web browser. These can be anything from extensions that allow you to take screenshots of webpages you visit to the ones checking and correcting your spelling or blocking unwanted adds from websites.

term-brute

Brute force - a password cracking technique that involves trying every possible combination.

term-captcha

CAPTCHA – an automated test used by websites and online services to determine whether a user is human or robot. For example, a test asking users to identify all traffic lights in a series of nine pictures.

term-cloud

Cloud storage – a data storage model whereby information is kept on remote servers that users can access via the internet

term-cms

Content Management System (CMS) - software used to manage content that is later rendered into pages on the internet.

term-crawler

Crawler – also called a spider, is an internet robot that systematically browses the internet, typically for the purpose of Web indexing (Wikipedia)

term-database

Database – a system used to store and organize collections of data with a particular focus or purpose. For example, a database of land ownership in country Z.

term-dataset

Dataset – a collection of data sharing some common attributes and that is usually organized in rows and columns for easier processing. For example, a dataset of the foreign owners of land and properties in country Z.

term-directory

Directory – a container used to categorise files or other containers of files and data.

term-domain

Domain name - a name that is commonly used to access a website (e.g. tacticaltech.org). Domain names are translated into IP addresses.

term-dnservice

Domain Name Service (DNS)- the distributed service that converts domain names into IP addresses like 213.108.108.217

term-dns

Domain Name System (DNS) – a naming system used by computers to turn domain names into IP addresses in order to connect to websites.

term-dnsleak

DNS leak – when requests to visit a certain site or domain are exposed to an internet providere despite efforts to conceal them using VPN.

term-dnsquery

DNS query – the process of asking to translate a domain name into an IP address.

term-fulldisk

Full-disk encryption(FDE) – encryption that happens at a device or hardware level. For example, encrypting and entire computer’s disk would also automatically encrypt all the data saved on it.

term-encryption

Encryption- A way of using clever mathematics to encode a message or information so that it can only be decoded and read by someone who has a particular password or an encryption key.

term-ip

Internet Protocol (IP) address – a set of numbers used to identify a computer or data location you are connecting to. Example: 213.108.108.217

term-json

JSON – stands for JavaScript Object Notation, a data-interchange format.

term-metadata

Metadata – information about information. E.g.: the content of a sound file is the recording, but the duration of the recording is a property of the file that can be described as metadata.

term-feed

Public (web) feed – an online data providing service that gives updated information on a regular basis to its users or the general public. It can be set up via subscription to the feed of a website/media or it can be publicly available to everyone.

term-registrar

Registrar - a company that provides domain registration services.

term-registrant

Registrant - a person who registers a domain.

term-robottxt

Robots.txt – a file on a website that instructs automated programs (bots/robots/crawlers) on how to behave with data on the website.

term-root

Root Directory – the topmost level folder or directory, which may or may not contain other subdirectories.

term-script

Script – a list of commands that are executed by a certain program to automate processes, e.g. visit a URL every two seconds and save the data that is returned.

term-server

Server - a computer that remains on and connected to the Internet in order to provide some service, such as hosting a webpage or sending and receiving email to/from other computers

term-serverconfig

Server configuration – a combination of settings that determine the behavior of the server.

term-sitemap

Sitemap protocol - a set of guidelines that enables site administrators to inform search engines about pages on their site that are available for crawling.

term-subdomain

Subdomain – an extra identifier, typically added before a domain name, that represents a subcategory of content (e.g. google.com is a domain name whereas translate.google.com is a subdomain).

term-sourcecode

Source code - The underlying code, written by computer programmers, that allows software or websites to be created. The source code for a given tool or website will reveal how it works and whether it may be insecure or malicious.

term-targetad

Targeted advertising – a form of advertising that aims to reach only certain selected groups or individuals with particular characteristics or from specific geographic areas. For e.g. placing bicycle sale ads on Facebook accounts of young people in Amsterdam.

term-sd

Subdirectory – a directory within a directory.

term-tor

Tor Browser – a browser that keeps your online activities private. It disguises your identity and protects your web traffic from many forms of internet surveillance. It can also be used to bypass internet filters.

term-tracker

Web tracker – tool or software used by websites in order to trace their visitors and how they interact with the site.

term-url

Universal Resource Locator (URL) – a web address used to retrieve a page or data on a network or internet.

term-vpn

Virtual Private Network (VPN) - software that creates an encrypted “tunnel” from your device to a server run by your VPN service provider. Websites and other online services will receive your requests from - and return their responses to - the IP address of that server rather than your actual IP address.

term-vps

Virtual private server (VPS) - a virtual machine, rented out as a service, by an Internet hosting company.

term-webdomain

Web domain – a name commonly used to access a website which translates into an IP address.

term-webinterf

Web interface – a graphical user interface in the form of a web page that is accessed through the internet browser.

term-weblog

Website log – a file that records every view of a website and of the documents, images and other digital objects on that website.

term-webpage

Webpage – a document that is accessible via the internet, displayed in a web browser.

term-webserver

Web server – also known as internet server, is a system that hosts websites and delivers their content and services to end users over the internet. It includes hardware (physical server machines that store the information) and software that facilitates users’ access to the content.

term-website

Website – a set of pages or data that is available remotely, typically to visitors with internet or network access.