Using DuckDuckGo’s tracker radar domains as a piHole AdList


TLDR;
If you are looking to consume the DuckDuckGo tracker radar domains into your piHole from an online source, I have a hosted list right here. The list has a fingerprinting threshold of 2 (fingerprinting score 2 or above) and is updated daily. If you are looking to tweak the script and run it locally, read on.

Introduction

If you, like me, run piHole as your DNS provider, you probably are always on the lookout for good ‘AdsLists’. Unfortunately this post is not about blocking ads. What got me to writing the script and this post was the need to strike a balance between DNS blocking and usability. Over time I have used several blocklists from various sources and while they are good at their respective jobs of either blocking trackers or ads, what I really wanted to do was protect users of my network from being fingerprinted. Everything else was fair game in the name of usability.

That led to me to the DuckDuckGo Tracker Radar project. You probably know DuckDuckGo as the search engine that cares about your privacy. Their tracker radar project crawls and analyzes domains known for injecting tracking code on websites, gathers metadata around them, categorizes them into behaviour buckets and adds a prevalence score. The automated flavour of the algorithm makes it a fresher and more accurate data source to rely on (v/s hand curated block lists usually used with piHole).

DuckDuckGo tracker for piHole

Now that I had the datasource, I wanted to figure out a simple way to consume the data into piHole as a AdList. nGrande had a script that could parse through the JSON files produced by the DuckDuckGo web crawler and produce a piHole compatible AdList. That script acted as my baseline. On top of it, I also wanted to add related sub-domains that could be used for tracking and I wanted make a shell script that I could set up a cron job with. My repository based on nGrande’s code does both.

How does the script work

The schema of a domain in DuckDuckGo’s web crawler output is:

{    "domain": "adspeed.net",    "owner": {        "name": "ADSPEED.COM",        "displayName": "ADSPEED.COM"    },    "source": [        "DuckDuckGo-CA"    ],    "prevalence": 0.000402,    "sites": 4,    "subdomains": [        "g"    ],    "cnames": [],    "fingerprinting": 2,    "resources": [        {            "rule": "adspeed\\.net\\/ad\\.php",            "cookies": 0.000101,            "fingerprinting": 2,            "subdomains": [                "g"            ],            "apis": {                "Date.prototype.getTimezoneOffset": 1,                "Navigator.prototype.cookieEnabled": 1,                "Navigator.prototype.javaEnabled": 1,                "Screen.prototype.width": 1,                "Screen.prototype.height": 1,                "Screen.prototype.colorDepth": 1            },            "sites": 4,            "prevalence": 0.000402,            "cnames": [],            "responseHashes": [                "760a78e19be43414089886f8a258cccad72de784bb3ea9707842c96773e8c680",                "78c69832cf4d323e035a7a54634c13886b72253da7d8f28486a790a4f374d5c0",                "286d0fc3aaf3ed819f89084b70fe2f6c853fcbdf0654e2a7f25571a04c150ebe"            ],            "type": "Script",            "firstPartyCookies": {},            "firstPartyCookiesSent": {},            "exampleSites": [                "thecse.com",                "www.sciencenews.org",                "www.thecse.com",                "www.thealternativedaily.com"            ]        },        {            "rule": "adspeed\\.net\\/ad\\.php",            "cookies": 0.000402,            "fingerprinting": 0,            "subdomains": [                "g"            ],            "apis": {},            "sites": 4,            "prevalence": 0.000402,            "cnames": [],            "responseHashes": [                "5704a2e9f2f7ce43a79f9b407f1aedcfd50223cbe8bd2f71ff8c5c819e469cbc"            ],            "type": "Image",            "firstPartyCookies": {},            "firstPartyCookiesSent": {},            "exampleSites": [                "thecse.com",                "www.sciencenews.org",                "www.thecse.com",                "www.thealternativedaily.com"            ]        }    ],    "categories": [],    "performance": {        "time": 0,        "size": 0,        "cpu": 0,        "cache": 3    },    "cookies": 0.000402,    "types": {        "Script": 4,        "Image": 4    },    "nameservers": [        "ns2.dnstag.net",        "ns0.dnstag.net",        "ns1.dnstag.net",        "ns3.dnstag.net",        "ns4.dnstag.net"    ]}

There’s a ton of data on the domain but the things we care about are:

  1. domain – the domain name from where the tracker is served.
  2. subdomains – all the subdomains from where the tracker could be served.
  3. fingerprinting – the likelihood that this domain is fingerprinting users. 0 – no use of browser API’s, 1 – some use of browser API’s, but not obviously for tracking purposes, 2 – use of many browser API’s, possibly for tracking purposes and 3 – excessive use of browser API’s, certainly for tracking purposes.

The idea for the script is fairly simple:

  1. Iterate through all domains in all regions.
  2. For each domain, if fingerprinting score is greater than or equal to the given threshold, add that domain and all related sub-domains to the AdList.

To run it as a Python script, you will need Python 3.6 and above and execute with the required parameters. You will also need to clone the DuckDuckGo Tracker Radar project first and pass in the absolute path to the domains data to the script.

To run it as a Shell script, either manually or through cron, it’s very similar except the shell script can auto clone the DuckDuckGo Tracker Radar project, you will just need to provide the path to the generate script amongst other parameters.

Conclusion and future work

This is my first attempt at using DuckDuckGo’s web crawler data to protect my home network. On subsequent attempts, I am planning to understand if the prevalence score could be utilized to fine-tune the script. I would also be gathering metrics around usability and capture rate. If you have other ideas, please do leave them in the comments section below. Till then, happy hacking!

Cover image attributed to jcomp.


Leave a Reply

Your email address will not be published. Required fields are marked *