tools

SecurityTrails Blog · Apr 28 · by Esteban Borges

theHarvester: a Classic Open Source Intelligence Tool

Reading time: 7 minutes
Listen to this article

Have you ever wished you could retrieve data from multiple sources in a quick and easy manner as part of your reconnaissance of a threat?

Although there might be a handful of proprietary tools out there with this capability, the spirit of open source is on full-display in the tool we are reviewing today. theHarvester is super-simple to install and obtains its data from a dozen plus sources, paid and free. The good news is tha it includes a native integration with the SecurityTrails API™.

What is theHarvester?

theHarvester (purposely spelt with a lower-case ‘t’ at the beginning) is a commandline-based tool made by the team at Edge-Security. It is a Python-based tool meant to be used in the initial stages of an investigation by leveraging open source Intelligence (OSINT) to help determine a company’s external threat landscape on the internet.

The tool was originally designed to be used in the early stages of a penetration test or red team engagement. However, the passive reconnaissance abilities of theHarvester also make it suitable for blue or purple teams, depending on the situation.

We will list a few of the passive and active reconnaissance data-sources used by theHarvester below.

Passive:

  • Baidu
  • Bing
  • dnsdumpster
  • Duckduckgo
  • Google
  • Hunter
  • Qwant
  • SecurityTrails
  • Shodan
  • Trello
  • Twitter

Active:

  • DNS brute force: dictionary brute force enumeration
  • Screenshots: Take screenshots of subdomains that were found

Installing

There are 4 different installation methods for theHarvester. These are:

  • Kali Linux – already comes installed
  • Docker
  • From Source (without using Pipenv)
  • From Source (with Pipenv)

We can now proceed with installing the software by using the third option. As always, it’s important to use some type of sandboxing environment when installing new software. You could opt for a virtual machine (VM), container or a remote test server. We are using Ubuntu 20.04 for this review, and any commands used here should apply to Debian-based distros (and with a few minor tweaks, to other distros as well).

First we should update our sandbox and install the software we will need:

sudo apt update
sudo apt upgrade
sudo apt install git python3-venv

Now we can make a Python virtual environment to install the necessary Python packages:

python3 -m venv harvest
cd harvest/
source bin/activate
git clone https://github.com/laramies/theHarvester
cd theHarvester/
pip install wheel
pip install -r requirements/base.txt

We found a minor bug during the installation and had to install the pip package ‘wheel’ before installing the other packages from `base.txt’.

If you followed our installation instructions, you can run:

python theHarvester.py -h

and you will see the following:

TheHarvester tool

Lastly, we need to add some API keys to the api-keys.yaml file. We will just use free/freemium APIs for running the tests. You can use any editor to edit the api-keys.yaml file. In order to add the API keys, just edit the file with nano or vim, as you see below:

nano -w api-keys.yaml

Usage

We will use some existing data from our Recon Safari #3 to see what else we can uncover using theHarvester.

We will do a basic search of rpfront[.]com, limiting our results to 50 and using Google as the source:

python theHarvester.py -d rpfront.com -l 50 -b google

TheHarvester tool usage

The results are not very interesting, as we don’t expect Google to have the OSINT data that different APIs might have. Let us try again, but this time using different APIs:

1. SecurityTrails

Run the following:

python theHarvester.py -d rpfront.com -l 50 -b securityTrails

SecurityTrails API

There is a lot more data here. We found 3 related IPs and 2 hosts.

2. ThreatCrowd

We found no results when using ThreatCrowd as the source.

3. UrlScan

Running:

python theHarvester.py -d rpfront.com -l 50 -b urlscan

UrlScan API

We were able to find 5 IPs and 1 host.

Primary OSINT data sources are great for finding IPs and hosts. The good thing is that this tool offers a big set of sources to choose from, we will now attempt to run more tests using other third party services.

Strangely, we found nothing on Twitter for either ‘rpfront’ or rpfront[.]com. We tested another active domain but we found nothing on Twitter regarding ‘moslempress’:

python theHarvester.py -d moslempress -l 100 -b twitter

We used DuckDuckGo to search for moslempress[.]com and came back with no results as well:

DuckDuckGo search

With the lack of results, it was a bit concerning that we might be doing something wrong. So we decided to test out the Hunter API to see if we found any data regarding ‘moslempress’ there. Unfortunately, we found nothing there either:

python theHarvester.py -d moslempress.com -l 10 -b hunter

In order to verify that Hunter was indeed working, we tested it out on a more common domain:

test on a more common domain

Fortunately we obtained some email results, as shown above. The discrepancy here is most likely due to the lack of data available for this niche domain on Hunter and other sources.

ThreatMiner and RapidDNS helped us uncover the following:

ThreatMiner and RapidDNS ThreatMiner and RapidDNS

The 2 data sources corroborate what is the current IP address for www[.]moslempress[.]com.

In order to get a full view of the moslempress[.]com domain, we will now attempt to run a scan on all sources and output the result to an HTML file to see what the results look like. We can do so running the following:

python theHarvester.py -d moslempress.com -l 50 -b all -f moslempress.html

This was the most useful scan so far. We uncovered lots of data, as shown in the screenshots below:

most useful scan Emails found

The number of data points here showcases the value of theHarvester. The only drawback for us was when we attempted to run the scan again. We suspect that most of the engine-sources deal with spam-like scans all the time and our attempt to run the scan again using the same IP and so soon would probably be flagged by their firewall/protection services. This was definitely the case for LinkedIn and we also noticed that the number of Hosts dropped from 47 hosts to 31 (indicating blocking/rate-limiting from 1 or more of the services). One way around this issue is another option that is available on theHarvester --proxies where multiple proxy IPs can be added to proxies.yaml.

The output from the HTML is also very useful:

theHarvester --proxies

The table above has multiple filtering options, so we can see what Results were from each source. The plot-graphs (not shown) were not very useful for us. Based on their functionality, they seem to be more useful for repeat scans and investigations which will show changes in the Results. At the end of the HTML output we also got a summary of the Results from each Source:

theHarvester statistics

What we found a bit confusing was the --shodan option, which was not a data source and had to be included as a flag for a regular domain search. After a bit of tinkering, the usage is like so:

python theHarvester.py -d moslempress.com -l 50 -b securityTrails –shodan

and the output is as follows:

Subdomain option

What we gathered is that in order for Shodan to work, an IP address has to be found so that it can be looked up by the service.

Our final scan was run as follows:

python theHarvester.py -d moslempress.com -l 50 -b securityTrails -c --screenshot ~/harvest/theHarvester/sc/

We attempted to brute force the DNS (using -c) and to take screenshots and save them. The DNS brute forcing was not useful because we already obtained these results using multiple Sources. The screenshot feature did not work for us and we got the same error after attempting it a few times:

Final scan

Summary

Overall, we would rate theHarvester as an excellent tool for any basic OSINT research. The passive reconnaissance from multiple sources is a huge enabler for starting any investigation.

One of the drawbacks of this tool is the lack of documentation, but an extensive review like ours covers many different uses of the tool that the bulk of other tutorials don’t. We also found 1 or 2 bugs, but this is expected with any open source tool and does not decrease the value of the tool in the slightest.

Esteban Borges Blog Author
ESTEBAN BORGES

Esteban is a seasoned security researcher and cybersecurity specialist with over 15 years of experience. Since joining SecurityTrails in 2017 he’s been our go-to for technical server security and source intelligence info.