How much does the Internet really know about us?
We’ve seen plenty of sci-fi movies exploring target tracking, privacy hacking and information breaches. However, our computers in the real world may be tracking way more of our information than we know.
Why Track Users?
Well, the foremost reason why websites track users is for advertising purposes. Have you ever done a Google search for something and later on, as you’re browsing a completely unrelated website, you see advertisements for that thing you searched on that website? That is tracking in action.
Advertising companies make use of web tracking to detect which websites you have visited in order to deliver a personalized advertising experience. As you’re browsing the web, you are more likely to click on an advertisement that aligns with your interests than an ad about something you would never buy.
Apart from its use in ad networks, some websites also track their users to collect data about them. This can be for a variety of reasons, from security purposes to more nefarious schemes such as selling that data for profit.
So, how much of our information is really out there in the network?
Every device that is currently using the Internet has an IP address. Your IP address is the most basic way of identifying who you are online and contains information about your location, down to what city you are in.
Tracking an IP address is not the greatest way to keep track of a user over time, because IP addresses can be changed or masked and are usually shared with the other devices connected to the same network. As such, if you connect to a public Wi-Fi at a café or use the Internet at your workplace, you probably have the same IP address as everyone else who is using that network.
However, IP addresses are almost always collected online on every website you visit. They can be used together with other collected information to pinpoint your geographical location.
An HTTP referrer header is sent when you click a link in your browser window, telling the new webpage where you came from. For example, if you searched something in Google and clicked one of the websites in the results, that website would see the Google search you came from, which also includes the exact phrase you searched for. This is one way for websites to track which keywords are bringing in the most visitors, and they could publish more content about similar topics. Combined with your IP address, websites can draw up patterns between visitors’ countries and the search terms most of them used to get to that website.
Even if you don’t click links on a webpage, the HTTP referrer can still be sent. If the webpage you are currently viewing contains advertisements or some form of media – images, videos or widgets – from a third party, your browser will tell that third party what webpage you are currently viewing.
If you’ve ever noticed that your email client warns you about loading images inside an email, it’s because of the HTTP referrer header. Assuming you load the images inside the email, once whoever sent the email sees an HTTP referrer to some image inside the email, then they will know that you have read it.
Some websites make use of “web bugs”, which are tiny transparent images around the size of one-by-one pixels. These images are invisible when embedded inside a webpage or an email, but can still receive HTTP referrer headers as a means of tracking viewers. Remember that just because you don’t see any visible media doesn’t mean there are none. You can still be invisibly tracked.
User agents don’t contain any personally identifiable information, but they are unique enough to identify a user when coupled with their IP address and a few other browser settings. A user agent string normally looks something like this:
User-Agent: Mozilla/<version> (<system-information>) <platform> (<platform-details>) <extensions>
A user agent is sent to a website when you load the page. In essence, it tells the website your browser type, browser version, operating system and more.
There are millions of user agents out there and the chance of you having the exact same user agent as someone else is pretty small. As such, if a website stores cookies, logs your IP address and tracks your browser extensions, they can easily use your user agent to identify if you are the same user that visited them earlier.
The user agent is just one piece of information used in browser fingerprinting, which detects additional data such as your installed plugins and their versions, your screen resolution, your installed fonts, your time zone, your browser’s language, whether you allow cookies and other details. Each piece of data may not be unique in itself, but once all this data is put together, it constructs quite a clear profile for your browser, which websites can then use to pinpoint if you are that “same visitor”.
There are browser fingerprinting checks out there, where you can see just how unique your browser information is. One such tool is Panopticlick by the Electronic Frontier Foundation.
AOL Search Data Leaked
Data breaches are very common, with many having occurred in the 21st century. In fact, a full list of data breaches is available on Wikipedia. Let’s look at one of the most infamous search data leaks by the American web portal AOL.
In August 2006, AOL Research accidentally released a compressed text file containing three months’ worth of search data, consisting of 20 million search keywords by some 657,000 different users. AOL deleted the search data on their website soon after, but not before the dataset had been mirrored and distributed all over the Internet.
To protect users’ privacy, their usernames were replaced by a numeric ID. However, although the search data was anonymous, the things people search for tend to be related to themselves, their family or their friends. As such, upon close analysis of some search terms, people began to string together the identities of some of those AOL users. In particular, one woman was identified by The New York Times with her permission – Thelma Arnold as user #4417749, a 62-year-old widow from Lilburn, Georgia.
Aside from the searches of a sixty-year-old woman, some other users took a slightly darker path. There were numerous searches for adult and explicit material, as well as searches related to terrorism and violence. The unidentified user #927 was discovered to have a rather disturbing search history regarding child pornography and zoophilia.
Of course, the data leak raised numerous questions about whether it was even ethical to use this information for research, as AOL had intended. Public outcry over this privacy leak was huge and led to the resignation of AOL’s CTO, Maureen Govern. Insider information also alleged that two AOL employees had been fired due to this incident: the researcher who released the search data and their immediate supervisor, who reported to Govern.
So, we know that websites and companies can collect this information about us, which can be dangerous in the event that they are hacked or accidentally release data, as in the case of AOL. How then can we minimize the amount of information websites gather about us? Find out more in the next article.