What is fingerprinting?
Fingerprinting is a technique, outlined in the research by Electronic Frontier Foundation, of anonymously identifying a web browser with accuracy of up to 94%.
Browser is queried its agent string, screen color depth, language, installed plugins with supported mime types, timezone offset and other capabilities, such as local storage and session storage. Then these values are passed through a hashing function to produce a fingerprint that gives weak guarantees of uniqueness.
No cookies are stored to identify a browser.
It’s worth noting that a mobile share of browsers is much more uniform, so fingerprinting should be used only as a supplementary identifying mechanism there.
In this post I’m going to explain how it works in detail and give you real-life statistics accumulated over the period of 4 months of production usage.
I was given an experimental task to implement the fingerprinting for both anonymous and logged-in users of one of our web sites. We wanted to see if it was possible at all to rely on identifying someone this way and not leave cookies. The idea was to accumulate the fingerprints and associated preferences and then pre-filter the information on front page based on what’s known about a user.
So I got to work and started making a basic outline in my head. What is that identifies a browser? I gathered it would be: browser agent, browser language, screen color depth, installed plugins and their mime types, timezone offset, local storage, and session storage.
Initially I added the screen resolution as well, but a colleague adviced that one can use multiple monitors with a single laptop, for example connect an external monitor when working in office, so I removed it.
On my laptop browser the values are:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
So I now knew all my browser had, and I needed to produce the fingerprint itself. For that I wanted to use a fast, non-cryptographic hashing function, such as murmur hashing.
Murmur hashing produces 32-bit integer as a result and works really well. When compared to other popular hash functions, MurmurHash performed well in a random distribution of regular keys.
I picked this implementation and added it to the code.
The last step was to combine all browser’s capabilities into a long string and pass it through hashing.
The end result on my laptop was:
As a finishing touch, I wanted to get rid of jQuery, so I implemented the
map methods and got a no-dependencies script.
How to improve accuracy?
The above research states that the identification accuracy is surprisingly high. But to improve it even further, Flash or Java integration is required to get a list of installed fonts, thus making each browser even more unique.
What about hash collisions?
My tests show that for random strings Murmurh hashing indeed produces collisions, but their number is negligible for my purposes: 5-7 collisions per ~200K of capabilities strings.
What about mobile browsers?
It’s simple: browser fingerprinting is not good with mobile browsers, unless you want to distinguish Android users from iPhone ones.
After having had the fingerprinting on production for 4 months, I have some data to analyze. First of all, let me say that I’m not at liberty to tell the exact number of visitors to the web site, but I can say it is several millions a month, so we have some data to play with. All numbers below represent our usage and do not represent what you might have.
89% of fingerprints are unique
20% of our users have more than one fingerprint, i.e. several browsers or devices.
Very few users have a staggering amount of fingerprints, for example 20-25. I don’t know if they have a lot of devices, use different browsers or something else.
After viewing the results we removed the fingerprinting because of poor identification, especially with mobile devices. If your traffic mostly comes from desktops and you’re OK with 10-12% of false identifications you might want to try it.