|Abstract:||Blacklists are a collection of origins (URLs, domains or IP addresses) associated with malevolent activities like the dissemination of malware, facilitation of command and control (C&C) communications, and delivery of spam messages. Blacklists are a simple and convenient way to protect users from these known malicious websites. Although blacklists are not designed for security research, they are commonly used in security research projects as a scan target list for retrieving malicious web contents. However, domain blacklist scans are noisy, as a large portion of websites scanned do not perform malicious activities during the time they are visited. Many blacklisted websites are offline or parked, even though some of them may have contained malicious contents before. Consequently, blacklists cannot be used out-of-box for security research. Kuhrer et al.  evaluated the effectiveness of major blacklists in 2014, and proposed a heuristic mechanism to collect training data to build a machine learning classifier to identify parked domains in blacklists. They found up to 10.9% of entries are parked. In this work, we reproduced their approach and found that most of the heuristics and features they used have become stale after five years.
We modernized the prior work approach to identify offline domains and parked domains in blacklists. First, we built and open-sourced an efficient blacklist filter, MGRAB, that can filter domains at several layers: domains with no DNS resolution, closed TCP ports, or error HTTP response code. Using MGRAB, we found that only 43% of all domains in our blacklist (aggregated from 27 domain blacklists) have valid IPv4 address and only 40% of the total domains can be successfully grabbed. Second, we implemented an updated mechanism to detect parked domains using new heuristic strings and created a new random forest classifier. Using the updated mechanism, we found that around 4% of the successfully grabbed domains are parked. Overall, only 33% of the total domains contain meaningful content. Researchers can use MGRAB and the parked domain detection methodology to filter blacklisted website scans for future security research.