Retrieving and Archiving Information from Websites¶
by Wael Eskandar, Brad Murray
In Short: You will explore ways to find and retrieve historical and ‘lost’ information from websites, to serve as evidence that something existed online, and ways to archive and preserve your own copies of webpages for future reference.
Sometimes, when you want to verify online information, you’ll end up following a trail that leads to broken links or to websites that are no longer available.
Other times, you’ll come across websites with vital information that could add great value to a story, but you won’t realize its value until later.
When you revisit that website to document it, you may find that it no longer exists, that the specific webpage you remember has been removed or that the information you need is no longer accessible and has been replaced with new content.
You’re likely to face all of these challenges at some point during the course of your investigations.
One notorious example of a webpage being removed, which would later prove to contain crucial evidence for investigators, was Facebook’s list of success stories in political campaigns around the world.
Originally, Facebook’s website championed several “Government and Politics” projects whereby political parties and candidates used the social network’s tools and services to target voters online and influence electoral outcomes. That page was available here: https://www.facebook.com/business/success/categories/government-politics. The link was valid until, all of a sudden, it wasn’t.
Facebook removed the page from its list of success stories in early 2018, after the Cambridge Analytica data harvesting scandal broke out and attracted intense scrutiny to the company’s practice of allowing third parties to access its userbase for commercial and political purposes.
The Intercept provides background on this case here.
What if there were some way to travel back in time and get a copy of that webpage, or even a portion of it, before it was altered or taken down?
Luckily, there are some easy ways to retrieve old content and deleted pages so you can still reference them in your investigation. You can also save currently accessible pages so that you can use them later, even if they are modified or deleted in the meantime.
There are several such services that automatically archive prior versions of websites. Apart from content, these digital archives often contain information that can help you identify other important data such as the owner of a website, useful names, contact details, documents and links to other sites. Some of these services allow you to contribute to the list of websites they archive by manually saving webpages at times of your choice. You (and others) can then retrieve snapshots of those websites later on.
Returning to our case above, with the help of one such service - the Internet Archive’s Wayback Machine (explored in detail below) - we can find an archived list of the political projects that Facebook once featured under the now defunct “Government and Politics” section of its “Success Stories” webpage https://www.facebook.com/business/success. A search for “https://www.facebook.com/business/success/categories/government-politics” in the Wayback Machine reveals that these “Government and Politics” examples were still online in 2017, as saved in the Internet Archive here.
Screenshot of Wayback copy of Facebook's now-removed webpage on "Success Stories - Government and Politics".
More importantly, some of the old content is accessible as some of the old links from the archived page still work, so you can actually read about the details of their political campaign projects.
Archived versions of websites like these preserve information that can be incredibly valuable for investigators.
Journalist and security researcher Brian Krebs used archived material from a website that sold malware in order to identify the likely authors of that malware. An archived version of the site contained an account number for WebMoney (a global payment system for online businesses) that was linked to a username belonging to someone who had been promoting the malware on an underground forum. Following this lead, Krebs was able to trace usernames from that forum back to real identities of the individuals who allegedly created and distributed the malware kit.
When you direct an archive service to a webpage that interests you, it will crawl that webpage and store a copy of it. When it does so, the webpage being archived will automatically add a record to an ongoing “access log” (that most websites keep) of when, and by what IP addresses, it has been visited.
Either an attentive website administrator or an automated process might then realise that a portion of their site has been archived by the Wayback Machine.
This, in turn, might give them clues that someone is investigating a particular piece of content or a person relevant to them. In some cases this alone could diminish the impact of your investigation if what you are working on is sensitive and must be kept away from public eye for at least a while.
At a minimum, the website administrator could have the archived material removed from the Wayback Machine. (This is one reason why it’s a good idea to make your own offline copy of anything that’s crucial to your investigation.) That administrator could also remove or modify similar content that you have not yet found.
Most archiving services keep access logs as well.
Furthermore, some services require each user to create an account, to choose a username, to provide payment information, to verify an email addresses or to associate a social media profile.
You should consider establishing a separate set of accounts, for use with services like this, in order to compartmentalise (separate) your investigative work from your personal online identity.
In some cases, you might even want to create a single use “identity” for a particular investigation, and dispose of it once research is done.
Paying for commercial services in a way that does not link back to your personal identity is much more difficult. If you live in a region where you can buy a prepaid credit card with cash, that may be your best option.
In the potential situation above - the website administrator who observes a sudden interest from the Wayback Machine - it is worth noting that the subject of your investigation cannot necessarily trace that interest back to you. If you your archiving service is trustworthy, and if nobody has access to both the website’s logs and the archiving service’s logs, that administrator may have a difficult time connecting the dots.
That said, it is better to take the precautions recommended above than to rely on this assumption. - Suppose, for example, that only a handful of IP addresses viewed the archived page on the same day that it was added to the Wayback Machine. It would be easy for anyone to figure out that they are being watched from a particular place.
Any small investment of time, before you begin your investigation, can help you limit these kinds of risks.
Archiving and retrieving content with the Wayback Machine¶
The Wayback Machine is a project of the San-Francisco based non-profit Internet Archive, a digital library that has been dedicated to preserving billions of websites since 1996, as part of an effort to archive the internet and provide universal access to all knowledge. As of early 2019, it has archived approximately 345 billion websites.
The Wayback Machine
The Wayback Machine is an essential tool for researchers, historians, investigators and scholars. It is freely available to the public and can help you access archival snapshots of webpages taken at various points in time.
The Wayback Machine’s automated crawlers (also referred to as spiders) can access and archive virtually any public website. However, crawlers don’t have a fixed pattern of deciding which websites they visit and how often they do so, as they are subject to resource constraints and policy decisions that influence their operation.
As a result, you may not always find an archived version from a specific day, month, or even year. Furthermore, websites can opt out of being archived by services like the Wayback Machine. By publishing a set of restrictions in a text file called ‘robots.txt,’ a website can instruct crawlers to exclude some or all of its content from archiving or indexing. Nevertheless, the Wayback Machine’s vast trove of data will likely be indispensable in many of your investigations.
Robots.txt is a file that sits on a website and lists portions of the site that should or should not be accessed by crawlers. If a website has a robots.txt file, you can view it by adding “/robots.txt” to its domain or subdomain. For example: https://google.com/robots.txt.
Websites can use this file to block crawlers from the Wayback Machine, from search engines like Google or from any other indexing or archiving service. There are a number of reasons why some website administrators opt for restrictive robots.txt files: to limit bandwidth costs, to reduce strain on overloaded servers, to protect trademarked images or to keep unfinished websites from showing up in search results, for example. In some cases, however, they do so in order to obscure potentially sensitive content.
While the Wayback Machine does not always comply with these restrictions, there are still many websites that its crawlers refuse to archive as a result of robots.txt directives. If you have trouble using the Wayback Machine to view or archive some but not all of the pages on a website, you can check its robots.txt file to see if any portions of the site are “disallowed.”
Apart from offering a simple interface for retrieving automatically archived websites, the Wayback Machine also allows you to manually store snapshots of webpages so you can make sure they do not suddenly disappear.
Not only can this service archive webpages that are relevant to your investigation, but it also provides an easy way for you to cite research and link to content as your investigation takes shape.
While it is often a good idea to save HTML or PDF copies of important webpages to your own devices to make sure that you have multiple back-ups, archiving them with the Wayback Machine can add an element of neutrality and trust if you end up sharing those archives with others. It is also far more convenient, for most people, than maintaining an offline library of digital files.
Looking up pages with Wayback Machine¶
In order to find a page that is no longer accessible, or to view an older version of a webpage, simply go to https://web.archive.org and enter the web address that you are searching for.
If the page was previously archived, the dates when it was saved will appear on a calendar of the current year. You can navigate to previous years using the timeline, which also displays a graph of how often the page was archived each year. After clicking on the year in which you are interested, archives from that year will be marked on the calendar with color-coded dots.
Here, we are using the example of https://cambridgeanalytica.org/, a website that was taken down in 2018 due to the closure of the company (see above example of the Cambridge Analytica scandal).
Screenshot of the Wayback Machine calendar to access Cambridge Analytica's website
A blue dot indicates that a full webpage capture took place on that date. These are usually the archives you are looking for. A green dots indicates that, when the crawler accessed that web address, it was automatically redirected to another page on the same website. These archives might not contain the content for which you are searching. Orange and red dots indicate that an error occurred during the archiving process, possibly due to a fault with the crawler or the website’s server. A large dot indicates that multiple archives were stored on that day. You can hover over them to select a specific archived based on the time of day.
After you select an archived version of the page, the Wayback Machine’s navigation bar is displayed at the top of the screen. This allows you to browse between different archives of that page by using the timeline or by clicking on the “next” and “previous” buttons.
Archived Cambridge Analytica page in Wayback Machine
In order to help establish the validity of your online evidence, you might need to verify the exact date and time when the Wayback Machine crawled and archived a webpage. You can do this by checking the ‘time stamp’ that is embedded in the web address of the archive. This time stamp is formatted with a four digit year followed by two digit representations of the month, day, hour, minute and second when the archive was captured. You can find it between “https://archive.org/web/” and the web address of the archived page. For example, the following archive was captured in 2017, 31st of August, at 06:00 and 27 seconds: https://web.archive.org/web/20170831060027/https://cambridgeanalytica.org.
Quick look-up techniques using your browser¶
The Wayback Machine also allows you to request a particular website archive that it stores without going through its search interface. Instead, you can do so from your own browser by going to a correctly formatted web address.
Just add the website’s address to the end of the Wayback Machine address:
“https://web.archive.org/www.yoursite.com/” (where “www.yoursite.com/” is any site you wish to search)
- your browser will display the latest archived version of the site you wish to view.
- If you separate the two addresses with an asterisk (*), your browser will load the archive’s calendar view: “https://web.archive.org/*/www.yoursite.com/”
- If you add an asterisk to the end as well, the Wayback Machine will show you all of the archives under that domain, not just the homepage: “https://web.archive.org/*/www.yoursite.com/*”
For example, browsing to https://web.archive.org/web/*/cambridgeanalytica.org/* will display a page-by-page listing of all cambridgeanalytica.org pages archived by the Wayback Machine.
Cambridge Analytica page listing in Wayback Machine
Using the Wayback Machine to archive webpages¶
Another key feature of the Wayback Machine is its ability to archive webpages on demand.
Whether you are looking to save and preserve information for an investigation or ensure the accessibility of your own published work, you can navigate to https://archive.org/web and find the “Save Page Now” form toward the lower, right-hand corner of the page. Simply enter a web address (say “http://www.yoursite.com/projects”) and click the “SAVE PAGE” button.
Unless the website you enter has denied access to the Internet Archive’s crawlers, as discussed in the robots.txt section above, the Wayback Machine will begin archiving it. You will see a progress bar that will let you know when the page has been saved. At that point, you will be able to view the page’s archive, and a timeline will display any previous captures from that site.
Saving Guardian webpage on Cambridge Analytica in Wayback Machine
Saved Guardian webpage on Cambridge Analytica in Wayback Machine
The above steps will only archive the page you submitted (“http://www.yoursite.com/projects”, in this case) not all of the content on that website. If you want to archive an entire website using this method, you will need to submit each page separately.
Furthermore, this feature does not guarantee that regular archives of the page will be captured in the future, so you might want to revisit the Wayback Machine from time to time to request additional snapshots.
Downloading archive content¶
Unfortunately, the Internet Archive does not allow you to search the full text of all the websites in its vast archive. While it does offer a search function for the main pages of certain archives, it does not currently index all of its 345 billion pages. If you want to search through archived content from a particular domain, however, there is a way to do it.
If you install the Ruby programming language on your computer (version 1.9.2 or higher), you can use the Wayback Machine Downloader script to download all of the archived files under a given domain. This script lets you specify the date range you want to download, which can be helpful if you are working with sites that have been archived for several years.
Limitations of the Wayback Machine¶
As mentioned above, not all websites are automatically or regularly archived by the Wayback Machine.
Sites are chosen based on algorithms that use criteria such as how often people visit them and how often other websites link to them (which is also an indicator of credibility). Some of this data comes from the rankings produced by Alexa, a leading web traffic, statistics, and analytics company.
In addition, the Internet Archive runs its own crawlers and works with hundreds of volunteers who execute searches and archive websites to preserve the internet’s abundance of information.
While you can archive certain pages manually, as shown above, you cannot influence the set of websites that the Wayback Machine will automatically and regularly archive.
The Wayback Machine has other limitations as well. Examples include:
- Password-protected websites are not archived.
- Website administrators can explicitly request that their sites not be archived, either by publishing a restrictive robots.txt file, as seen above, or by sending a direct request to the Internet Archive.
- Website administrators can request that previously archived content be removed from the Wayback Machine.
- There is currently no full-text search available on the Internet Archive.
To illustrate how archives can also disappear sometimes, the Internet Archive was recently at the centre of a debate over a blog run by journalist Joy-Ann Reid. Reid’s attorneys reached out to the Internet Archive and attempted to have archived versions of her blog removed, claiming that some of her articles had been manipulated by an unknown party who inserted fraudulent content in her writings – content that was then archived with the blog.
When that didn’t work, Reid’s blog simply changed their robots.txt file to restrict the access of Wayback Machine’s crawlers. When the crawlers picked up the change, they automatically removed the blog’s archive altogether. This case illustrates how people and organisations can use both legal and technical means to remove content from these third-party archives.
In the European Union and a few other regions, The Right To Be Forgotten provides individuals with the option to request that search engines and digital archives remove indexed content related to them, which they deem harmful or libelous. This right has limitations so not everything can or will be removed upon request but it is worth keeping in mind that some subjects of your investigation (politicians, criminals and other controversial figures) could be using the opportunity to take down internet content related to them that is relevant to your investigation.
Keep in mind that domain names can be sold and that abandoned domain names can be re-registered. As a result, a single domain is sometimes managed, over time, by mutliple owners. In such cases, a website’s archive history might not be continuous, and older material might not be relevant to your investigation.
Other ways to retrieve and archive webpages¶
Archive.today (formerly archive.is) archives web pages much like the Wayback Machine.
Archive.today differs, however, by only storing individual pages, rather than entire websites, and it does so only at the request of its users, not automatically.
Here is an example of archived pages from https://cambridgeanalytica.org/:
Cambridge Analytica accessed in Archive.today
Since it doesn’t crawl sites, it doesn’t have nearly the breadth of information you can find on the Wayback Machine.
It does provide three key features, however:
- First, unlike the Wayback Machine, it allows you to search the full text of its archives.
- Second, it ignores any restrictions that might be specified in the robots.txt files of the websites that it archives. As a result, it can save snapshots of some pages that the Wayback Machine cannot, such as public Facebook profiles and Twitter posts.
- Third, it also saves both a text copy and a graphical screenshot of the archived pages. This sometimes provides greater accuracy than saving the page itself, especially when archiving content that changes rapidly (such as rolling images or snapshots of forum messages, etc.).
You can look up a webpage archive by typing its exact web address (such as “https://cambridgeanalytica.org”) or you can use a wildcard (*) to find archived subdomains or subdirectories of the website (for example, “*.cambridgeanalytica.org”). Here is a search for *.cambridgeanalytica.org in archive.today:
Search for Cambridge Analytica in Archive.today
Like the Wayback Machine, archive.today provides you with direct links to the archived content using web addresses with embedded date stamps, like the following: http://archive.today/2018.01.01-042001/https://ocean.cambridgeanalytica.org/
Archive.today also offers a Tor onion service at archivecaslytosk.onion. Onion services can only be accessed through the Tor Browser, but they make it easier for you to keep your interaction with the service anonymous. This is particularly useful and vital if you are researching a sensitive topic or you suspect that your online activities may be tracked.
Google Cache is another way to find a page that has recently been taken down or is otherwise inaccessible.
When Google accesses a webpage, it creates a cached version, or a copy, of that page as a backup. It often makes these copies available in its search results.
In order to access Google’s cached version of a page, use Google’s search engine to search for the page you want to find, click on the small arrow to the right of the search result’s web address and select “cached”. This will load a cached version of the website that was backed up by Google when its crawlers previously indexed the site.
Google Cache screenshot
In the case above, we tried searching for a cache of the now defunct website http://cambridgeanalytica.org/ but as of 28 February 2019 that is no longer available in a Google search (we could only find a webform instead). However, a cached version of it was still available on 26 February 2019 and, as seen below, we were able to capture it with archive.today
Cambridge Analytica on Archive.today
Unlike the archiving services mentioned above, Google’s cache does not provide historical records of the pages it stores.
Instead, it displays the contents of those pages the last time its crawlers accessed them, so it might reveal content that is missing from the current version of a webpage or give you access to a page that has since been taken down.
Finding a cached webpage indicates that it once existed, but caches are frequently overwritten with updated content or disappear altogether (like in our case above). Furthermore, website administrators can request that Google remove pages from its cache.
For one reason or another, Google might not preserve a cached page long enough for you to use it as evidence in your investigation, so it is often a good idea to backup the cached page itself using an additional service, such as archive.today, and to make your own offline copy as back-up. Screenshots and PDFs are useful for documenting how you found a particular version of a page and can help you later on if you need to demonstrate that the information is accurate.
When you archive a webpage with a service like the Wayback Machine or archive.today - especially if it has a long, complicated web address like an archived copy of a Google Cache entry - be sure to record that link somewhere in a file on your computer, in a secure cloud folder or elsewhere. Relying on your browser history to find such things is a recipe for disaster.
Webcite is a free service that offers a way to preserve links that have been cited in articles or journals, including webpages or other digital content on the internet.
This service is generally used by authors, editors, researchers and publishers who wish to preserve the online citations in their work.
WebCite allows for quick, manual preservation of individual web addresses. It also has a service that automatically ‘combs’ through uploaded text documents to preserve all citations that originate from online sources.
WebCite supports several different ways to retrieve cited material. In addition to readable and shortened web addresses, WebCite also provides citations with more advanced reference formats, such as DOI (Digital Object Identifier) and cryptographic hashes.
Note: Visual site monitors
Another option to retrieve website contents and to stay updated if any changes occur is to use visual site monitors. These are services that can track and monitor visual changes in webpages, whether they happen in code, images, text etc.. They can be very useful for researchers and help automate some of the work if you need to monitor many websites that are useful in your investigation.
Visual site monitors archive webpages in a different way than the tools and services we explored above. You give the service a particular section of a webpage to watch, and it takes a snapshot, then monitors the page for visible changes.
If there are any changes, big or small, the site monitor will send you an email to let you know.
The email will include a link to a website where you can see more details. Some site monitors attach screenshots from before and after the change.
As an investigator, you can use a site monitor in conjunction with an archiving service to stay abreast of important website updates.
In order to notify you of changes, these tools require you to set up an account and to provide them with access to an email address or a phone number. You can avoid exposing your true identity and contact details by creating a separate email address, especially if you work on sensitive investigations.
Visualping offers a free plan that allows you to monitor up to 62 webpages a month. This means it can check anything in-between two webpages a day (it gives you updates for two different webpages daily, if changes occur) or several pages on a weekly basis, to 62 webpages a month (where it checks 62 pages for changes once a month) – or other combinations that work for you. The free version can run checks hourly, daily, weekly or monthly to compare a webpage with its previous versions and alert you by email when modifications in text, images, keywords or any selected page areas take place. The service also works via the Tor Browser and we recommend to use this option for an extra layer of privacy and security.
ChangeTower offers a free plan that monitors up to three websites and conducts up to six checks per day (in this case, it can scan a website for changes twice a day). It can monitor a specific URL (webpage), an entire website or different variations (you can select which pages of a website you wish to monitor). It can search for changes in content (text), visual content, html, keywords etc. The free plan stores your monitoring results for up to a month. The service also works via the Tor Browser and we recommend to use this option for an extra layer of privacy and security.
Published April 2019
Resources and tools¶
Articles and Guides¶
- Archive Today FAQs. A list of useful tips on how to preserve information and how use already archived material in Archive Today.
- Wayback Machine and Internet Archive FAQs. A list of useful tips on how to preserve information and how use already archived material in Wayback Machine. Legal FAQs also available here.
- WebCite FAQs. A list of useful tips on how to preserve information and how use already archived material in WebCite.
Tools and Databases¶
- Archive Today. A web archiving tool and database of archived web content.
- Wayback Machine. A web archiving tool and database of archived web content, run by the the Internet Archive.
- WebCite. An on-demand archiving service and database that digitally preserves scientific and educationally material on the web.
Access log - a file that records every view of a website and of the documents, images and other digital objects on that website. It includes information such as who visited the site, where from, for how long and what content they accessed
Algorithm – an established sequence of steps to solve a particular problem.
Bandwidth – in computing, the maximum rate of information transfer per unit of time, across a given path.
Bookmarklet – a complex web address that you can add to your list of browser ‘bookmarks’ or ‘favourites’. When you click on a bookmarklet, it typically sends information about the page you are currently visiting to a third party service.
Browser extension – also called add-ons, they are small pieces of software used to extend the functionalities of a web browser. These can be anything from extensions that allow you to take screenshots of webpages you visit to the ones checking and correcting your spelling or blocking unwanted adds from websites.
Cache – a temporary, high-speed storage for data that has been used or processed and may be retrieved again quickly rather than visiting the original source or redoing computing associated with the requested data.
Crawlers – software that automatically traverse internet pages to perform typically exploratory functions.
Cryptographic hash - a way of fingerpinting data by sending a file or other piece of information through an algorithm that summarizes it with a fixed-length alphanumeric string (a combination of letters and numbers, under 100 characters). This string is very hard to break mathematically, which means that you can give it to someone to help them determine if a larger file is the right one or is intact.
Directory – a container that is used to categorise files or other containers of files and data.
Digital Object Identifier (DOI) - a unique identifier that refers to published work, similar to ISBN, but for digitally published works. Allocation and administration of DOIs is coordinated by DOI Foundation https://www.doi.org/.
Domain name - also called a web domain, is a name commonly used to access a website which translates into an IP address.
Internet Protocol (IP) address - a set of numbers used to identify a computer or data location you are connecting to (e.g. 188.8.131.52)
Malware - software that has malicious behaviour that is typically hidden from users.
Robots.txt - a file on a website that instructs automated programs (bots/robots/crawlers) on how to behave with data on the website.
Web server - also knows as “internet server”, is a system that hosts websites and delivers their content and services to end users over the internet.
Screenshot - an image of the device screen captured in a digital format.
Script – a list of commands executed by a program.
Subdomain - an extra identifier typically added before a domain name to indicate a subcategory of data or pages. e.g google.com is a domain name, translate.google.com is a subdomain.
Third party - a person or entity that is not directly part of a contract but may have a function related to it nevertheless.
Tor Browser - a browser that keeps your online activities private by disguising your identity and protecting your web traffic from many forms of internet surveillance
Userbase - a list of users associated with a particular platform or system.
VPN - software that creates an encrypted ‘tunnel’ from your device to a server run by your VPN service provider, masking your actual IP address when you visit websites
Website - a set of pages or data made available remotely, usually to people with internet or network access.
Webpage - a document (page) that is accessible via the internet, displayed in a web browser.
Wildcard – in this technical context, it is a symbol such as “*” or “?” which is used in some computing commands or searches in order to represent any character or range of characters. (https://www.collinsdictionary.com/dictionary/english/wild-card)