Notes from Industry

How Data Science Could Make Cybersecurity Troubleshooting Easier: Firewall Logs Analysis

Improving incident response ops. with data analysis techniques and tools.

Luis Francisco Monge Martinez

Published in

Towards Data Science

10 min readAug 19, 2021

Without a doubt we are in the information age, there have never been so many sources of information and in the cybersecurity sector it is no different.

When someone starts working in data analysis he constantly hears that most of the time is spent in processing, correcting and standardizing the data before working with it. This could be a trivial operation if the different manufacturers would use standardized logs, if they were consistent between their different products or even if the log format was easily processable… how nice it would be…😍

In the real world this is not the case, in incident response every customer is different and it is very common to have to fight with logs of all kinds.

One of the most useful logs in a large-scale incident are the firewall logs, which provide information on network connections to and from the outside, internal organizational traffic and in some cases even VPN access by users.

In my experience, one of the most common manufacturers I encounter is Fortinet. This company’s firewalls are very common and have a type of log that, although common, is terrifying to process.

This kind of logs is the nightmare of bash and awk lovers and for those of us who like to go further, it can be an opportunity to explore and make understand the importance of data analysis tools 😉.

In this case, the VPN traffic of a single site of a medium-sized organization will be analyzed. During one week, 14 million VPN traffic events were generated.

First step — transformation to CSV

Since the mission is to work from Pandas with the logs in order to obtain evidence, we need the logs in a format that Pandas can read.

This function allows us to extract the titles and fields of each line of the file as follows:

Extraction of attribute groups by capturing separately the data key, the separator(s) and the data. In this way, for each line we will be able to separate what are the data and what are the keys.
Assign the same index to all values.
Since there are fields that have separators, three groups are generated. It is necessary to group the results in a single column eliminating the null results.
Generate a single row with the data, using column 0 for titles and column 1 for values.

The following function can be used to iterate between the files of an entire folder and read and create a Pandas DataFrame line by line:

This function does nothing special, it only creates a DataFrame in the first line of the log and adds each of the lines once processed.
Once it has finished with all the lines, it generates a CSV file in the indicated folder and deletes the originated DataFrame to free memory.
This function can be easily modified so that the DataFrame generated is maintained during the processing of all the files and at the end we have the data in memory.

These last two functions can be executed in parallel to achieve greater processing efficiency. For this, it is necessary to store them in a separate file to be able to execute the multiprocesing library correctly from Jupyter.
After that we can execute the following code to carry out the task in a parallelized way.

In order to continue, there are two main drawbacks. In the first place, the number of columns that each file has, since due to the type of logs, there are files that contain different columns and, secondly, the amount of memory necessary to process all the records. Let’s go step by step 🏃🏻.

Adjusting columns with d6tstack

Due to the type of log formats, it is very common to find that there is a different number of columns in the files and this is not compatible with the data analysis libraries.
To fix this there is a library called d6tstack.
This library performs an analysis of the csv files and tells us which columns are present in each of the files and which are not.

import glob
import d6tstack.combine_csvcfg_fnames = list(glob.glob(r'C:\logs_vpn_out\*.csv'))
c = d6tstack.combine_csv.CombinerCSV(cfg_fnames)# check columns
print('all equal',c.is_all_equal())
print('')
d = c.is_column_present()

And the output will be something like this.

To correct it, the library itself offers the possibility to rewrite the data in new files with the fixed columns.

fnames = d6tstack.combine_csv.CombinerCSV(cfg_fnames).to_csv_align(output_dir=r'C:\logs_vpn_out_fix')

Once fixed we can move on to the next phase, reading the files.

Fighting the need for memory 💪💻

This is a very common problem in data analysis, we have a lot of data and not enough RAM to process it.
When our mission is to store standardized information on a recurring basis, we usually choose to store the information in a database such as Spark, Elastic… but in the case of incident response it is common to have to quickly analyze a type of event that it is very likely that we will never see again.

In order to deal with these situations there is Dask.

This data analysis library offers us the possibility to perform queries on log files in a parallelized way using the resources of our computer.
In other words, it allows us to work efficiently with large amounts of data without having to load them into memory.

When you are used to work with Pandas, the process of reading files is fast, but time consuming, it is surprising that when reading files with Dask, the response is instantaneous.

It is important that at this point Dask has not read the files, it has divided the data by the indexes and will read them when it needs to, that is why it is so fast at this stage.
The df object is not a Pandas.DataFrame is a Dask.DataFrame, it is important not to forget it.
During the creation of the DataFrame it is very useful to mark the data type of each column to speed up the query and to avoid problems.

One of the most common use cases in log analysis is the search for exfiltration of information during an incident. To do this, we will perform a query with Dask to obtain a small subset of data that we can work with comfortably in memory.

df_exf = df[["remip", "tunnelid", "sentbyte"]].groupby(["remip","tunnelid"]).max().compute()

This query took 1 minute and 28 seconds to run, this is where the big difference between Pandas and Dask lies. In Pandas this query would have been much faster, if the data would have fit in memory 😅.
The .compute() function allows the creation of a Pandas DataFrame with the Dask result.

In 1 minute and 28 seconds we have made a query on 14 million logs, we have made a grouping of them and we have it in a DataFrame, Dask is awesome!

The query does not make much sense if you do not know how such logs are recorded. Each time a user makes a session to another IP address over the VPN, a unique identifier called tunnelid is assigned. In the sentbyte field, the amount of information leaving the organization during that session is recorded, but instead of recording the number of bytes in each of the connections, they are reflected cumulatively, so that only the highest value of all the connections made during a session is taken into account.

To facilitate the reading of the information in bytes, I will use this code.

The first two IP addresses are legitimate and are not relevant for research. This would be the graph of the first 5 IP addresses to investigate.

At this point it is necessary to enrich the data we have. For this we are going to use several free services, MaxMind GeoIP, MaxMind ASN, AbuseIPDB and Dan Tor List.

MaxMind ASN

This service will allow us to obtain the AS of the IP addresses we need. The company MaxMind offers a database that we can download and consult offline in a very fast way.

With this function we will be able to obtain the name of the company to which the IP addresses of the events belong.

MaxMind GeoIP

The same company offers an IP geolocation database that allows us to know the location of an IP address quite reliably.

In the same way we will be able to know the locations of the IPs of our connections.
On a map you can always see everything better 😏.

Dan Tor Node List

It is quite common for attackers to use Tor network nodes to anonymize their communications and sometimes it can be a quick check that gives us a good lead to follow.
An updated list of Tor nodes can be found at the website dan.me.uk.

Although they are not related to data exfiltration, they will be connections that will have to be checked for legitimacy.

AbuseIPDB

AbuseIPDB is a well-known domain and IP address reputation service. Through its API it allows us to perform 1,000 queries per day for free.

AbuseIPDB code

In this case, we see multiple connections to IP addresses with a very bad reputation and which will have to be analyzed to verify their legitimacy.

At this point we are going to make a new query in order to perform new visualizations of the data to try to identify anomalies.

df = df_dask[~df_dask[“tunnelid”].isnull()][[“remip”, “duration”, “tunnelid”,”sentbyte”]].compute()

This query will return a subset of all the data, and through the following code we will get.

Now we have 4 variables to draw graphs, number of tunnels created to a specific IP address, number of total connections made to each IP address, the sum of the duration of the connections made to each IP address and the number of bytes sent.
Firstly, if we make a graph relating number of connections and total duration the resulting graph will be this one.

In red I have highlighted the IP addresses that were used in the exfiltration of the data and that would be one of our targets.
Considering the data displayed, the malicious IPs have been hidden among most of the other connections, but this graph can help us to detect two types of anomalies very common in cybersecurity.
On the one hand, in the upper left part, IP addresses to which many connections of short duration are made will be reflected, as would be the case of Cobalt Strike beacons for example, and in the lower right part, long connections such as SSH or RDP connections will appear.

In the following graph we are going to relate the data sent with the duration of the connections.

As you can see, this is the perfect graph to detect a large amount of information exfiltration. In the upper left corner we will show the IP addresses to which short connections have been made but which have sent a lot of information.
Finally, we will visualize the ratio between information sent and number of connections.

This graph can help us to visualize both data exfiltration in few connections such as uploading files to the cloud or sending files via SSH, as well as data exfiltration in covert channels such as DNS or ICMP, since we would see many connections and a somewhat high number of information sent.
To finish the article, we will perform a more advanced analysis to try to detect anomalies in the connections using the four variables at the same time.

PCA — Principal Component Analysis

Simplifying it a lot, this Machine Learning technique will allow us to perform a reduction of dimensions without losing the uniqueness of our data.
We will go from having four dimensions to two, in such a way that we will be able to represent it graphically and try to visualize anomalies.

As you can see in the code, a reduction to two dimensions is performed and then the remote IP column is added to make it easier to identify the data in the graph, the result will look like this.

And once displayed on a graph, we can identify anomalies in the connections.

In this way, we could quickly identify anomalies in the connections that would give us a clue to start investigating.

Although a multitude of further analyses could be made on these data, it is time to end the article and continue in the next one. The complete notebook can be obtained at this link.

All in all, it is clear that data analytics is one of the keys to the future in virtually every industry, but in the case of cybersecurity it is and will be a core skill for analysts.
Knowing how to use the tools at our disposal will make the difference in time and results during the investigation of an incident or during monitoring.

I will continue to talk about notebooks that are useful during my daily work and I hope you find them useful.

Happy research!