For us, security practitioners pinned to the certitude that the internet still is, and will remain for the foreseeable future, a technological wild west, investigating the evolution of large scanning activities is of paramount importance to determine any potential countermeasures. Among these techniques, internet-wide scanning reigns supreme: searching for publicly available records in the internet is as widespread as it’s ever been, fostered by a growing number of tools and platforms that can be queried at the push of a button. For instance, albeit its simplistic overtones, internet scanning can be a powerful ally in the search for the now-hackneyed low-hanging-fruit scenario miscreants are still so willing to pursue.
In this blog post, we’ll introduce these and other benefits of internet scanning, regardless of purpose or degree of interaction with the intended target, periodizing the most relevant aspects of the technique. Finally, we’ll account for a handful of the most popular tools, including our very own SurfaceBrowser™ and API framework, to understand how they can be leveraged to collect data at an unprecedented speed and scale.
We have a few things to unpack, so let’s get started!
What is internet scanning?
Internet scanning, sometimes referred to as internet intelligence, accounts for the composite set of data collection techniques based on the presence of online devices to identify their individual footprint. This form of horizontal reconnaissance has opened an important window into the current state of affairs of the internet at large, not only by allowing vulnerable services and endpoints to be discovered in a matter of minutes, but by empowering academia to track the evolution of the worldwide network in a manner that resonates with emergent behaviors.
For example, using a growing number of publicly available search engines and projects, researchers can quickly assemble and transform a seemingly insignificant amount of disparate resources into a wealth of information. This can result in the identification of malicious entities, such as botnets, bent on disrupting normal internet activity via spamming or denial-of-service patterns, or in the ability to empirically analyze interacting protocols that can lead to performance improvements. In this perspective, internet scanning can target transient (or more permanent) issues and other fundamental shortcomings by using the combined knowledge of services such as geo-referenced applications to isolate the problems at hand.
However, the most common form of internet scanning takes place at the hands of attackers as the first in a long sequence of steps to find unsecured endpoints or even entire networks. There are a few generalizations of the manner in which malicious actors approach the challenge: generally speaking, the attacker will have a specific target in mind before engaging in any sort of scanning, whether the activity calls for large-scale vs. more localized alternatives. Devices and services exposed to the internet are identified on an open port basis, that is, the scanning in question is aimed at determining listening processes as a precursor to intrusion attempts, and possibly exploitation.
Internet-scanning dynamics—benefits & challenges
From an attacker’s vantage point, there are statistical benefits associated with the aggregation of one’s traffic to the onslaught of scanning activity already present in the network—as much as 67 percent in overall unsolicited one-way traffic according to some estimates. After all, millions of compromised hosts in botnet fashion scour the internet day and night looking for additional targets of opportunity, aggressively scanning the entire IPv4 address space in a matter of minutes while making heavy demands on mitigation systems.
Some of the challenges associated with internet scanning predate the technique and are closely related to the inherent lack of a comprehensive taxonomy to determine which type of probing packet would elicit the best (most accurate) response. In general, traditional internet scanning is based on TCP, UDP and ICMP probes leveraging a composite set of different protocols across the various network layers to obtain echo responses from a “live” host, yielding important insights about the target.
Given what we know about certain ambiguities in the language of a number of RFCs, it is possible for a skilled attacker to evade perimeter defenses by crafting bogus packets that take advantage of interoperability issues, and potentially unknown security flaws, introduced by such uncertainties across different operating systems. There is also the recent case of improperly generated ISNs (initial sequence numbers) in custom TCP/IP implementations generating new CVEs caused by everything from denial-of-service conditions to authentication bypasses.
In turn, when it comes to properly scanning and detecting devices across the IP stack, understanding your tool of choice’s fingerprinting capabilities and limitations is paramount to obtaining an accurate representation of the target of interest.
Brief history and future prospects of internet-wide scanning
But, how did it all get started?
Well, if we go back to the late 1990s we begin to see the onset of host classification attempts in the shape of IP/port value pair identification from prominent scanning platforms such as Nmap (discussed later on) and others. However, internet-wide scans were not limited to the use of standalone applications; in fact, when Google Dorks arrived on the scene in the early 2000s, googledorking quickly became synonymous with internet hacking—a trend that has stood the test of time in that you can still leverage relatively unsophisticated search engine (Google) queries using a browser to explore vulnerable applications and services across the globe.
Crawling the world wide web using techniques like googledorking helped raise cybersecurity awareness on a number of fronts. For instance, using a handful of simple operators and some crafty lookup combinations, bad actors would uncover a trove of sensitive information that entailed the presence of default user accounts and settings; similar approaches would also yield significant database exposures and everyone’s favorite: leaked credentials. Despite multiple attempts to cushion the impact of these techniques on vulnerable web infrastructure using anti-automation measures, the criminal element succinctly turned to bots to impersonate the search process as if coming from single users.
Founded in 2004, Shadowserver was conceived as a volunteer-driven watchdog organization amidst an internet already teeming with egregious activity. The key to its success pivoted on the efforts of the entire security community, including governments, law enforcement and major corporations, to pull together resources and research to assist in tracking down and reporting emerging threats. For years now, Shadowserver has remained an unpretentious source of valuable threat intel by scanning the internet every single day, collecting over a billion malware samples and offensive literature in the process.
Modern-day internet scanning has seen an explosion in the number of services and engines dedicated to collecting cyber intelligence, making these a much-needed requirement for OSINT deployments across the board. They include projects such as Censys, Zmap, Shodan, ZoomEye, Greynoise, and our own SecurityTrails portfolio of services—combined, they embody a collective framework referred to as internet intelligence, whose capacity for aggregating and enriching disparate web resources and data is central to the idea that the more you limit the interaction between attacker and victim, the more successful any reconnaissance attempts will be.
At the end of 2020, a dynamic cloud infrastructure framework known as Axiom was launched, taking the internet-scanning game to the next level. Although not an internet-scanning tool per se, Axiom takes a repeatable approach to the challenge of creating distributed pentesting instances, which in turn can be used to parallelize your scanning efforts in a predictable and intuitive manner. Axiom works by using a pre-installed base image of your favorite tools which can be replicated into several cloud providers to perform the desired parallel scan.
In all, as the internet landscape continues to expand beyond the client-server paradigm to encompass newer forms of connectivity—think here aspects like IoT (internet of things) or IoB (internet of bodies)—chances are that internet scanning will likely be kicked into overdrive, fueling the transformation of the entire industry by raising the obvious privacy concern levels and social implications derived from tracking vulnerable systems and users. But, for now, let’s see what some of these tools can accomplish, and in what sort of fine-grained manner can they purposefully reveal the data we seek.
Popular internet-scanning tools
Once again, the security implications of internet-wide scanning cannot be overstated. As aforementioned, these tools can serve a multitude of purposes which include searching and discovering new CVEs (a public reference to a disclosed vulnerability), examining the adoption of defense mechanisms, or collating running services and software components inasmuch as what can be discovered from the public side of things in a contactless fashion.
The Nmap, short for “Network Mapper”, project is perhaps the most popular network mapping tool there is, having an extensive list of features and optimization techniques that translates into highly accurate results. Although it is not particularly suited for internet scanning activities due to the somewhat invasive nature of the probes and the high amount of connections it needs to maintain, Nmap remains the go-to open source port-scanning alternative for its ability to create and manipulate raw packets and sockets, as well as for its firewall subversion opportunities.
Belonging to the more traditional TCP/IP arsenal of fingerprinting tools, Nmap can also make use of a variety of ICMP ping types to solicit replies from active hosts. Subsequently, Nmap is frequently used by large companies, as well as smaller-sized organizations, for port auditing, host monitoring, penetration testing and similar tasks. As explored in one of our recent articles detailing the top Nmap use cases and examples, the tool contains over 600 predefined scripts as part of its Scripting Engine (NSE), which can extend its capabilities well beyond the command line domain.
If higher speeds and agility are required, Masscan can certainly get the job done; in fact, the tool boasts the ability to scan the entire internet IPv4 spectrum in less than five minutes, exhibiting rates close to ten million packets per second according to its creator. Its use of arbitrary addresses and port ranges allows for an asynchronous approach to the challenge of connectionless scanning—a procedure that does not rely on the completion of the entire probe before reporting results, albeit its potential for generating erroneous data due to dropped packets.
Masscan produces results similar to Nmap, plus the added bonus of being able to generate and output files in xml, json, grep, and text format, as well as a grep derivative consumable by other command-line tools.
Dubbed “the modern port scanner,” Rustscan features an adaptive engine that greatly accelerates port discovery (usually in a matter of seconds) while piping the results into Nmap for additional processing. Empowering Nmap this way translates into better scripting opportunities as well. For example, the tool allows for the addition of common scripting languages, like Python or Perl, in the form of custom files that extend Rustscan’s functionality beyond cursory port scanning.
The origins of modern-day searching for online devices can be traced back to 2009 when John Matherly, a software developer and freelancer, launched what will become the most popular implementation of a search engine exclusively dedicated to finding publicly exposed infrastructure. The progression to search engines for internet-connected devices meant that, all of a sudden, companies were able to quickly locate and assess their competitors’ web presence, a concept known as market intelligence, by simply entering the domain name in question in Shodan’s home page using a browser of their choice.
Interested in finding if a specific technology has any sort of web presence you can explore? No problem, let’s take the case of PrismView, the software stack behind many of the billboard displays we commonly see out there. Now, let’s use Shodan to do a quick search for any devices running PrismView:
As you can see below, the results can be quite surprising; not only did we find visible endpoints, but we can also pivot to the destination to confirm our findings.
The idea behind Shodan was rather simple: explore the metadata—the collection of underlying information and context about a particular piece of software or service—running on the endpoint, by means of banner-grabbing, and make it searchable. Shodan also exposed the entire community, including those lacking any creditable intentions, to a myriad of devices ranging from industrial control systems to baby monitors. Fortunately, the tool’s generic usability also extended enterprise security by allowing defenders to round up and close any potential gaps and vulnerabilities before being discovered by attackers.
In 2013, Zmap introduced the academic practice of single-packet network scanning using a modular architecture to achieve internet-wide survey speeds exceeding those of Nmap, by a factor of a thousand or higher, with equivalent accuracy. This kind of optimized probing forgoes keeping connection states in favor of faster results; this also prevents network saturation from taking place by randomizing the scanning order using a permutation technique supported by cyclic multiplicative groups.
Zmap is well suited for commodity hardware, spanning a supporting list of operating systems that includes many flavors of GNU/Linux as well as BSD. Zmap’s output modules allow application-specific results to be delivered to a number of resources, including databases or additional processes that integrate piped data. Consequently, output modules can serve as additional triggers in testing and validating certain packet exchanges such as SYN/ACK conditions, if required, without resorting to the local system’s TCP stack for session support.
ZoomEye is yet another web-based search engine for security researchers capable of performing fingerprint retrieval of internet-connected devices and services using two distributed crawling detection engines known as Wmap and Zmap. Using ZoomEye, security practitioners can quickly discover specific vulnerabilities by narrowing their search to a specific version of the impacted application or product, which can cut down on research time.
Here’s an example of ZoomEye being leveraged to find FTP services running on a given host:
As in the case of many of these web-based platforms, ZoomEye provides programmatic access via its API—this boils down to automating and provisioning additional application snippets and add-ons for integration purposes. Its library of special topics is an excellent resource for those looking to do vulnerability impact assessments based on clustered components, providing filtered searches instantly transferable to the backend engine with a click of the mouse.
Alternative ways to access internet-wide data
Subsequently, if you’re looking for an alternative to conducting a mass scan in favor of a more passive approach whereby data are already available and indexed for consumption purposes, you have a few choices. In fact, we’ve built this company with that mission in mind: let people access all the OSINT data they need in just mere seconds, no need to perform any manual or even automated scans.
SurfaceBrowser™ is none other than SecurityTrails’ very own passive-intelligence, curated engine for OSINT-related activities, that provides massive amounts of data from the Internet.
When we launched SurfaceBrowser™ almost three years ago, the tool became an instant success for its ability to cull and augment IPs, domain names, and several other feed-based data points into a one-stop application for browsing corporate surface areas. As of today, SurfaceBrowser™ provides the most accurate technical representation of internet-facing assets, doing so in a guardrail manner that provides safety from legal issues and similar ethical considerations.
Our algorithm prioritizes findings and helps you uncover hidden data and services you may have thought lost or simply forgotten: an always-welcomed proposition for system administrators and security teams seeking to identify unknown areas of exposure across the cyber spectrum. The tool also enhances visibility in areas such as SSL certificate transparency, associated domains, and reverse DNS records—this is in sharp contrast with other platforms that can only supply a subset of this information.
As explained earlier, API usage imparts a new dimension to internet scanning: on the one hand, APIs can expose a larger set of functions calls and attributes than those usually accessible via a web browser, allowing data to be consumed in discrete, yet very adaptable ways. Additionally, API-centric approaches act as a soft boundary between interacting components, allowing users to transform manual processes into more agile versions tailored to the needs of the organization.
To this effect the SecurityTrails API™ allows you to programmatically access all IP, DNS, WHOIS, and other company-related information available in the SecurityTrails Web Platform and beyond. In a nutshell, the API uses a form of internal domain-specific language (DSL) capable of building flexible, yet complex, queries across all datasets similar to the syntax used for SQL WHERE predicates. When it comes to internet scanning, API integration can always be bundled with additional SDKs and wrappers for extra validation and augmentation.
Internet scanning, as a discipline, will continue to evolve as the number of connected devices and computer networks continues to grow in both size and complexity. In the race to understand and mitigate the risks associated with the expansion, companies and cyber practitioners alike are turning to more novel approaches to proactively identify targeted scanning campaigns leading to a potential cyberattack. After all, as history would have it, there is a positive correlation between scanning and malicious activity, with internet scanning becoming more and more an invariable component of any meaningful reconnaissance project.
But none of today’s results are deterministic in the sense that analyzing endpoints in the internet at large remains a challenging task: certain subtleties and ambiguities in protocol design specifications can lead to inconsistent interpretations and erroneous reporting. Unsolicited traffic, often in the form of an impervious background noise, also adds to the problem—largely uncategorized, it is nevertheless dominated by worms and similar activity in sufficient volume as to require adequate baselining.
Such an heterogeneous mix of technologies, computing resources, services, and applications calls for a robust tool capable of transforming information that is publicly available into a multipurpose taxonomy of consumable security signals. SurfaceBrowser™ belongs to one such class of asset-discovery applications that can boost your productivity a hundredfold by providing the insight you need in a one-of-a-kind presentation format. Give SurfaceBrowser™ a try today and let us know what you think.
You’ll be pleasantly surprised.