Through a Data Scientist's Lens: Interview with Ilija Subašić, PhD
Reading time: 12 minutesFor many, the job title of data scientist can seem a bit perplexing. What do they actually do? Is this really the sexiest job title of the 21st century? What makes their perspective on data so different than our non-data perspective? Today, we ask SecurityTrails' Lead Data Scientist to find out.
A good data scientist is someone who can look for and find patterns in the most unexpected places, and make sense of it all, with the goal of creating systems that benefit an entire organization and function within real life.
By studying the thought processes of data scientists, we can then learn and understand the benefits they bring to organizations and companies.
In this interview, we took a different route by going to one of our own experts working here at SecurityTrails. Ilija Subašić is a data scientist who left his home in Serbia to pursue a career in academia, and help startups discover and share knowledge through data analysis and he is currently our Lead Data Scientist. We’re exceptionally glad to have a man of his caliber working at our ranks at SecurityTrails, enriching our company’s culture and employing his expertise as a PhD in data science.
After finishing his PhD, experiencing a career in academia and publishing many papers, he is now travelling the world as a digital nomad, and he worked with different companies on real life systems based on big data. He agreed to show us the true lens through which data scientists observe and analyze data, so we can all learn something from their approach and solve data problems more efficiently.
SecurityTrails: You are in Hawaii at the moment, but you travel all around the world while working as a data scientist. What drove you to start this type of lifestyle?
Ilija Subašić: I worked remotely for a number of years before I started traveling more often. I never liked working in an office, and after a spell of working from home, I decided to try a bit of traveling and divide my time between several locations. Then I got into traveling, and have been on the road for 5 or so years. Now it is hard to stop, even though the marginal excitement from new locations has long passed.
Traveling while working comes with its own set of challenges. What would you say were the biggest ones for you?
Ilija: The biggest challenge of working remotely is staying productive without any formal structure of hours or environment. It is not like in offices people look over your shoulder, but there is still a sense of a panopticon organization. There is a difference when you have to keep a level of discipline and productivity driven by yourself. With the logistics of traveling with a 10 kilo bag for years, there is an additional layer of self-discipline that is kind of location oriented. It requires an extra effort to stay productive when you can go to the beach and leave your job tasks for later. This can easily turn into a cycle of night work, or high intensity periods when catching up, and that can't last for a long time. But, once you learn to balance tasks with the environment the productivity just explodes, at least for me. However, working remotely, and especially on the road, is not for everyone. Even though people often think you are on an endless summer holiday, it is a whole different situation remote workers find themselves while traveling. Especially if you are not specifically geared towards traveling for work like for example photographers, it requires a certain mindset to be able to do it while contributing to the organizations you are a part of.
You are actually from Serbia. When did you leave your native country and how did you come to that decision?
Ilija: I had left some time before I started working remotely. The main reason I first left in 2007 was to do my PhD. I wanted to do higher-level research than what was available back home. Once I left I never really went back. While working remotely, I briefly went back for 6 months to a year, to be with my family, but I never really officially went back.
You've worked in academia. What was the thing that attracted you to start your career there?
Ilija: When I started my academic career, machine learning wasn't really accessible to learn aside from being a part of a few large tech companies or some high-end military research facilities in few countries. That was my main reason for getting into academia.
When I first started getting into research in 2005, what is now called data science, was then called data mining or KDD — knowledge in data discovery. It wasn't a field you could just go find a job and start working in, outside of academia. There were no undergrad, few postgrad courses and literally no MOOCs. There were a few companies that used data science/AI/machine learning but they were highly specialized and difficult to get into an entry level. At the time the entry level for any data science position was a PhD, and it had to come from a university that is known to be good in the field. So, earning your PhD or masters degree was pretty much the only way to get your foot in the door.
How did your transition from academia to data science go?
Ilija: It was more of a transition from academia to industry. At the time I was doing my PhD, data science wasn't really as defined as it is today, it was way more fuzzy. There were a bunch of people and companies doing different stuff using their own tools and infrastructure, but the understanding and willingness to invest wasn't really there. It was mostly a research thing or POC stuff.
It was toward the end of my PhD (around 2009) when big data really started to come through. At that time, I got a Marie Curie scholarship. It was an academia-industry partnership, that took people with mostly academic background and embed them into a real-life organization. That's how I transitioned into the industry, and I just stayed in it. I did go back to academia a couple of times to teach, but just as a side gig, but never again as a full time job.
There are many definitions out there, but what is data science to you?
Ilija: I still stick with the old KDD definition. Basically, discovering things that are not trivial in your data, with the addition of building systems around them. That is what data science does slightly different than other approaches to the same problem. Data scientists solve with an emphasis, or partial emphasis on building a working system and doing some level of software engineering. Previously this wasn't really the case, mostly because people did ad hoc stuff.
From your experience, what's been driving the noticeable growth in organizations having more data-driven culture?
Ilija: There are a couple of things. One being, the hardware became much cheaper. You can get your own processing cluster for a fraction of the price without investing in the infrastructure. Second, a bunch of software became more available - people have standard libraries that have now been proven to work well in many cases. The third thing is the data load just increased. You are able to collect and store data more than ever before, and today it's much cheaper to do so.
These are the main driving forces. Now, there is a bit of a gold rush, with so many people pushing data science and AI, and with AI becoming a new buzzword in the last couple of years. You can't hear data science without AI anymore, even though both were around for many years. This has certainly pushed many people to invest in data science.
There is a big difference in companies that claim to be data driven, and companies that actually invest in building the data driven culture. That can't be done overnight, and understanding that data science is not magic, but a kind of a scientific process.
What are the skills and techniques that separate a great data scientist from an average one?
Ilija: For me, it's being able to recognize and tie a real life problem to a data science solution, to talk with everyone involved - with the product team building the backbone, software engineers, and infrastructure people, and be able to hold a conversation with everybody. That is, I think, the main difference, the level of interdisciplinary knowledge built on top of data science knowledge
You can find all these people that are really good in data science competitions or research problems, in which you get a set problem and have to go on and test different things, and come up with the best solution (that is usually measured at the 3rd decimal people)... I've worked with people who are really good with coding and extremely knowledgeable indifferent libraries and algorithm implementation, but have a harder time handling a problem understanding data. For me that's the difference between people that are just good, and those that actually understand every part of a data science project and be able to work with all the people involved in building a system.
What are some of the non-technical skills you found very useful in your field of work?
Ilija: It really depends on the type of data you work with. I don't come from a strictly technical background. I actually started with economics, so I'm able to understand how businesses come together. I have found having that background is useful.
I've worked with a lot of news articles, so I have a bit of an understanding of media studies. As well as being able to read and understand stuff from that niche, and having a sense of intuition with stuff I'm working with. I had a couple of projects that were based on biomedicine, and I just didn't have any feelings on what the results should be.
So, I would say having a higher than layman understanding of a domain you are working on is kind of essential non-technical skills for data scientists.
What can we learn from data scientists in approaching data problems, so we can be more effective in resolving them?
Ilija: For us, data is something we use to build upon, getting insights from the data or build product on top of it.. It's different to people who think of data in a strictly database sense: organizing it, how to store it, and how to retrieve it in the fastest way.
I think the best thing that data science brings to organizations, is a scientific approach to problems. This why it's called data science. Basically taking your data and putting it through a scientific process of problem definition, solution hypothesis, experiments, and evaluation. You have these companies that build something using machine learning, without knowing how good it is, and basically hope for a home run, but essentially have no idea how good the system is before users start using it. There is a reason why it's called data science and not data alchemy or data magic. We have a certain set of procedures to follow and for anything that is produced there needs to be some proof of how good it is or what we expect. Think of data science as applying scientific methods to data. That's why you'll see a lot of biologists, physicists, or other "hard" scientists going into data science; since it is just a different problem they are solving in a way they are already used to.
You run the data team at SecurityTrails as the Lead Data Scientist. Please tell our readers about your first project here.
Ilija: One thing you need to be able to do, is discover domains, subdomains, and, basically, hostnames that are on the Internet. In order to do that, we need to figure out the hostnames that we didn't already see in data streams we process. You can basically go and think of any combination of letters that form a valid hostname. Since this is a more less endless space, we had to figure out a reasonable guess of domains that exist so that we can physically process it together with explicit data streams.
What attracted you to SecurityTrails?
Ilija: First of all the quantity of data available. Although many companies work with a vast amount of data, it is not often the case to work on billions of data points being processed every day. Having a chance to “data science” these huge silos is a challenge most people in my line of work would jump to.
What is the thing you are looking for the most moving forward at SecurityTrails?
Ilija: Building a system which allows for a sort of commodity approach to dns security, or a commodity price level - where small companies can have a simple readable dashboard that is going to tell them about the state of their domains and security issue they are likely or unlikely facing.
I think that is something that could be useful to a huge number of companies that can't afford to do these kind of analysis themselves. To allow companies, with just one push of a button, to figure out what they should focus on when it comes to their security.
What is your vision for the future of data at SecurityTrails?
Ilija: Provide a ready-made, customisable, overview of DNS security driven by processing large data streams moving from reporting analytics into discovering insights. One goal is to be able to provide “smart” oversight of possible issues surrounding a domain for small and medium companies that can’t invest resources in building security solutions. Coupled with that, I see securitytrails.com be a go-to source of data and insights on a large scale for companies building security solutions. We will be able to provide both raw data and “assisted-intelligence”, as in pre-trained models or aggregated statistics for security researches and analysts to enhance their products.
For the last question: What advice would you give to aspiring data scientists?
Ilija: Try to solve a problem. It is easy to get stuck in a loop of books, online courses, github examples, and library demos. For me these are all references to go back to, but as soon as you pass one introductory level course, there is no better way to learn but to try to solve a problem. Once you do that you have something to showcase, and in most cases employers want to see a combination of procedure and methods understanding coupled with a project showcase. Working on a self defined problem will help you progress on both of those. I gave courses on data science intensively over the past five years, and this is the best advice I have. Some of my students build quite interesting demo projects, like prediction of daily calorie intake, self-sentiment detection based on emails and social media feeds, office lunch recommendation, friends clustering… It is just a great thing to show when you are looking for a new (and especially) first data science job.
Have you learned something new about data science from this interview? Are there any other professional careers you think we need to dissect and find value in? Let us know by sending a tip to [email protected] and follow our blog for new additions to our interview series!
If you are also passionate about data and interested in joining our team, check out our [Careers][careers] page for more info on open positions.
