Sounds like a nightmare? Actually if you’re able to detect this early on, that’s good news because it’s usually difficult to be aware of this kind of situation.
Imagine a malicious remote party impersonating a company by registering domain names to confuse you and make you believe you’re in the right place. Then they start sending specially crafted messages (phishing) to your coworkers, to deceive them and lure them into the attackers’s trap.
This scenario actually happened to a few companies over the last few weeks, and has been happening for quite some time. This often happens through techniques like similar domain name registration, IDN domain name impersonation.
But how to be aware? How do you realize you’re being tricked, or at least be promptly alerted to any strange activity regarding domain or subdomain name registrations?
Let’s dive into what this post is all about:
- We’re going to find out how many newly registered %zoom% domain names were registered this year (2020), using the SecurityTrails domain feeds API endpoint
- Then we’ll do the same thing for subdomain feeds and visualize the obtained data
- We’ll check on how “normal” a day can be, by looking at the amount of domain and subdomain registrations, using Machine Learning techniques to help us determine a baseline and detect anomalies
- We’ll also submit our findings for analysis, to see if we find a malicious website
- Finally, we’ll briefly review a few assessment tools that can help us defend ourselves
Finding new Zoom-based domains using domain feeds
While we covered this topic in our article about newly registered domain feeds, we’d like to share a historic overview on the popularity of certain words, and check for any temporary spikes of interest that could indicate suspicious situations.
The following script will use dateseq (part of dateutils) to list a range of dates from which we’ll get our data. For this example, we’ll choose to get all activity from January 1, 2020 up to today.
#!/bin/sh SEARCH_PATTERN="zoom" START_DATE="2020-01-01" # Date Format YYYY-MM-DD END_DATE=$(date +"%Y-%m-%d") # Date Format YYYY-MM-DD API_KEY="YOUR_API_KEY" FOLDER="domain-feeds-new" # Create folder manually first for DATE in $(dateseq $START_DATE $END_DATE) do URL="https://api.securitytrails.com/v1/feeds/domains/new?apikey=$API_KEY&date=$DATE&ns=false" echo "[+] Checking date $DATE at -> $URL" curl --silent --output - $URL | gzip -d | grep -i $SEARCH_PATTERN | sort > $FOLDER/$DATE.txt done
Once executed, you should see the working script output like this:
When finished downloading, all files will be placed individually by date within the folders. Let’s see what’s inside:
Great! The results correspond to all newly registered domain names from 01-01-2020 containing the word “zoom” anywhere within.
Once this is ready, we need to create a comma-separated file (CSV) in which we’ll place the date in one column and the amount of domains registered that day in the other. We’ll do so in this fashion: date, registered_domains (we’ll call them “hits” from now on, for simplicity).
The resulting CSV file will look like this:
$ head -n 25 domains-by-date-stats.csv date,hits 2020-01-01,23 2020-01-02,22 2020-01-03,34 2020-01-04,31 2020-01-05,13 2020-01-06,16 2020-01-07,26 2020-01-08,44 2020-01-09,31 2020-01-10,26 2020-01-11,32 2020-01-12,310 2020-01-13,20 2020-01-14,27 2020-01-15,32 2020-01-16,35 2020-01-17,14 2020-01-18,26 2020-01-19,23 2020-01-20,14 2020-01-21,22 2020-01-22,28 2020-01-23,48 2020-01-24,30
Let’s visualize this by parsing this CSV file, and placing the “date” in the X axis and the “hits” in the Y axis. See the resulting chart below:
Nice. We can see some obviously odd behaviours in the form of big spikes—particular dates have a suspicious amount of zoom-containing name registries—but are those dates the only ones? Are there other, not-so-apparent anomalies?
Let’s do the same checks with subdomains.
Analyzing newly discovered subdomain names
Subdomains can also show us any unexpected behavior regarding our company’s name. Here, we’ll ask the API for new records since March 1, 2020.
#!/bin/sh SEARCH_PATTERN="zoom" START_DATE="2020-03-01" # Date Format YYYY-MM-DD END_DATE=$(date +"%Y-%m-%d") # Date Format YYYY-MM-DD API_KEY="YOUR_API_KEY" FOLDER="subdomain-feeds-new" for DATE in $(dateseq $START_DATE $END_DATE) do URL="https://api.securitytrails.com/v1/feeds/subdomains/new?apikey=$API_KEY&date=$DATE&ns=false" echo "[+] Checking date $DATE -> $URL" curl --silent --output - $URL | gzip -d | grep $SEARCH_PATTERN | sort > $FOLDER/$DATE.txt done
The script output will look like this:
The file output will look like this:
Now we need to create a CSV file like the one we did with domain names, so we can process the information and check for the amount of new subdomains per domain within a day.
In this case, because we know that Zoom has a subdomain registry service using their own domains (zoom.us, zoom.com and zoom.com.cn), we’ll remove them from the listing, so they don’t affect our measurements.
This will give us a CSV file with an output like this:
You’ll see three columns, date, domain and hits. This means that on the date shown above (2020-03-07), the domain name “agenciazoome.com.br” had four subdomains registered, which we can verify by filtering this domain name with the original date text file:
$ grep agenciazoome.com.br subdomain-feeds-new/2020-03-07.txt agenciazoome.com.br,clubedasideias.com.agenciazoome.com.br agenciazoome.com.br,clubedasideias.com.br.agenciazoome.com.br agenciazoome.com.br,www.clubedasideias.com.agenciazoome.com.br agenciazoome.com.br,www.clubedasideias.com.br.agenciazoome.com.br
Detecting anomalies using machine learning
As we saw earlier, we have two CSV files containing a number of hits, associated with a date and a related domain name that we can trace the date to. Let’s visualize both datasets and related Python code in a Jupyter notebook:
- First we import needed libraries and algorithms
from sklearn.ensemble import IsolationForest import numpy as np import pandas as pd import matplotlib.pyplot as plt
- Then we read the CSV file corresponding to the domain names data
Here we’ll see the date of the dataset and the amount of domains registered that day.
Next, we need to initialize our Isolation Forest algorithm so it can analyze the provided dataset and automatically learn our acceptable baseline.
cln = IsolationForest(n_estimators=100, max_samples='auto', contamination=float(0.1), \ max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0) cln.fit(dn[['hits']])
To explain a little further, “Isolation Forest” is an unsupervised learning algorithm specially designed for anomaly detection. For our purposes, anomalies will be the registrations per day (hit numbers) that are outliers and not part of normal behaviour. Such deviations can be pointed out.
We choose this method to let the algorithm train itself and propose a model that can both filter out normal profiles of registries and allow us to work with the odd cases.
- Once trained, using fit() with the hits numbers data we can ask the algorithm to make a prediction based on the data collection we choose. Here, we’ll ask to spot anomalies in the already-parsed hits dataset.
prediction_domain = cln.predict(dn[['hits']]) dn['anomaly'] = prediction_domain
- We also added a column to the DataFrame called “anomaly” so we can save the algorithm verdict regarding whether the hit validated is an outlier or not. If it’s a normal behaviour it scores “1” and if it’s an anomaly it scores “-1”.
outliers_domains = dn.loc[dn['anomaly'] == -1] outlier_domains_index = list(outliers_domains.index)
- Throughout the check we can tell that this configuration has spotted 10 instances where it determined that the amount of registrations falls outside the normal daily transaction profile.
- To improve visualization of the results, below is the corresponding chart showing the normal behaviour in yellow and the anomalies in black.
- In the case of subdomains, the code is almost the same, but in the case of an anomaly we can see the date side-by-side with the associated domain name.
- In the subdomains case, visualization will show spikes that also involve an odd amount of registrations.
Once we have the results, we can easily go to the retrieved feeds data and dig deeper into the analysis:
In the domains case, we can spot that the file includes domain registries for 2020-02-29 with 1115 matching %zoom% domain registrations. As this is highly interesting, further analysis is needed. We’ll show it in the next section.
$ wc -l 2020-02-29.txt 1115 2020-02-29.txt
In the subdomains case, we can begin filtering for strange cases like the following:
$ grep cas.ms zoom-feeds/2020-04-15.txt 1 cas.ms,zoomappdownload.com.admin-eu.cas.ms 1 cas.ms,zoomappdownload.com.admin-eu2.cas.ms 1 cas.ms,zoomappdownload.com.admin-us.cas.ms 1 cas.ms,zoomappdownload.com.admin-us2.cas.ms 1 cas.ms,zoomappdownload.com.admin-us3.cas.ms 1 cas.ms,zoomappdownload.com.eu.cas.ms 1 cas.ms,zoomappdownload.com.eu2.cas.ms 1 cas.ms,zoomappdownload.com.us.cas.ms
To go one step further (and as an exercise for the reader), you can feed the algorithm to create a model based on what normal baseline activity is per domain name, and not just per amount of subdomains. This will provide a better input on checking what anomalous behaviour is for every domain, as one value may be normal for some domains, but abnormal for others.
Analyzing the results
As a check, we’re going to conduct website verification using urlscan.io. Their bulk submit URL capability allows us to upload one of the domain name date files and take a look at the findings:
Once logged in as a user of urlscan.io, simply go to your profile name, hit the “Bulk Submit” button and place your domain list.
After the scan, we verified that one domain on the list was deemed malicious. This was probably due to an attack, which unfortunately compromised this website with phishing material.
Despite that, the script actually found too many subdomains using only the statistics retrieved using the Security Trails API. So should this undermine our efforts? Not at all!
One oft-forgotten technique of checking your own infrastructure is to actually test it against different attack vectors. This lets you discover new ways of gaining information about your company.
Another method of testing is to think like a perpetrator, and test odd, even obscure ways to take advantage of unwitting victims.
To accomplish this, we’ll check out two different security tools that show us what can be done to trick users into thinking they’re actually accessing our domains.
This tool allows you to swap characters (mostly vocals) for similar characters available in different encodings, which can be legitimately registered like IDN’s (Internationalized Domain Names)—but for criminal purposes instead.
Once you’ve downloaded the script from the following link, execute it and place the desired domain, domain level. Then, EvilURL will suggest different options for you to visually reproduce the exact domain name, but with imperceptibly different unicode characters.
Downloading and executing EvilURL is really easy:
git clone https://github.com/UndeadSec/EvilURL.git && \ cd EvilURL && \ python3 evilurl.py
We are placing zoom at the insert name, and .com on the level domain to check all possible combinations. The output should look like this (and please note that despite the fact that Zoom has several working domain names, we’re focusing on the .com):
In this result for our case study, the script suggests replacing the “o” characters with the lowercase “о” corresponding to the Cyrillic script. If while reading this you found them to be visually the same, you now get an idea why it’s used in these attacks.
You can test this theory and try to purchase this domain name on your favourite registrar:
This tool allows us to look more closely at the different choices that can be used to examine similar domain names, both those that are already registered and those that are not. You can browse the documentation in the following link and review the different outputs it can produce.
Below, you can see different yet similar combinations that could trick users into believing they’re entering the correct website.
It combines different attack approaches like bitsquatting
As an example, if we were to register the last record of the above image (zööm.com), the equivalent domain name in IDN format would be xn–zm-fkaa.com, a perfectly valid domain name, ready to be purchased.
While there are multiple options, another tool that can assist the defensive side of this attack is “Punycode Alert”.
This browser extension notifies you when entering a unicode encoded domain name website. It won’t save you, but it can give you a nice heads-up if your access to the site was unintended.
On the offensive side, URLCrazy is a tool included in the Kali Linux security distribution that allows you to examine different domain name combinations, possible typos users can make, and additional features, such as domain popularity, at Google’s search engine.
Fake domain names are a real issue. They can attract the wrong audience and even make your project look bad. To preserve your online reputation, being aware of these threats is crucial.
In this post, we saw how any person could use machine learning algorithms to do this work automatically. From this point on it should be easy to integrate this kind of logic into your dashboard of choice.
Also, IP intelligence gathering tools like domain and subdomain feeds are a great asset to monitoring scripts, as we explored today.
These kinds of resources can help your company stay ahead of the game—and promptly identify malicious attempts to deceive you or your customers. Combining your current, traditional security tools with anomalous behavior detection software will get you far in processing the incredible amount of data generated by network and application protection equipment.
Our API and Feeds make awesome assets to your enterprise monitoring system, allowing you to easily integrate scripts and rapidly parse accurate information. Get ready to base your defensive decisions on reliable data sets—contact our sales team and schedule a call today!