Using LLMs and GPT to streamline data analysis in cybersecurity incidents.

Or another way to forget programming 😁

Luis Francisco Monge Martinez
11 min readMay 10, 2023
Reddit user: MsElsaMars

Introduction

It is difficult to consider writing an article about ChatGPT and not feel like one of those who join the trend and try to talk about a topic simply because everyone else is doing it.

I think that since I started my working life, this has been the moment in which a technological advance has generated the most noise and, of course, I could not not try it and try to solve certain problems that may arise in the day to day of incident response.

For the development of this article I am going to use one of my previous articles to try to find a different way of obtaining the same results.

In this article I talked about how to analyse firewall logs during a ransomware incident using Python and my good friend the Pandas library.

Although Pandas was for me one of my greatest discoveries, it is true that the learning curve is not easy and that has meant that some of my colleagues never got to use the tools I had for the simple fact that understanding, modifying or generating Python/Pandas code is not easy.

During this article I am going to attempt a complicated goal. To abstract the analyst from the code and make him/her focus only on asking the right questions.

I hope this idea excites you as much as it does me and that I will be able to keep your attention during the article because it will be worth it.

What is ChatGPT and how does it work?

Just kidding, I’m not going to explain what ChatGPT is and how it works, I’m sure you’ve already read it too many times and if not, I’m sure there are people much more prepared than me to do it 😊.

The only thing I would like to mention in this section is that although since its inception the tool has been used as a source of truthful information, this is not the purpose of the tool and if we do not remember this, it is very likely that we will be disappointed and will not be able to understand the benefit we can get from it.

A few weeks ago, OpenAI together with DeepLearning.AI have released a free course called “ChatGPT Prompt Engineering for Developers”, I can’t recommend it more to anyone who is interested in this technology. It has really changed my understanding of the Large Language Model (LLM) and the enormous work of OpenAI.

Getting started with Jupyter Notebooks 🪐

The OpenAI team has developed a very intuitive Python SDK that we are going to use during the article, easily installed with pip.

pip install openai

To use the OpenAI API you need to have an API Key and to do so you must register at https://openai.com and generate one in the section https://platform.openai.com/account/api-keys.

With the library installed and an API Key in our possession we can get started!

import openai
openai.api_key = "YOUR_API_KEY"
def get_completion(prompt, model="gpt-3.5-turbo", temperature=0):
messages = [{"role": "user", "content": prompt}]
response = openai.ChatCompletion.create(
model=model,
messages=messages,
temperature=temperature,
)
return response.choices[0].message["content"]

To quickly explain the previous code, after telling the openai library our apikey we have defined a function called get_completion that we will use to interact with ChatGPT and get its answers.

We will pass our information and questions to the function in text format and it will return the answers in text format. As parameters to modify we have the model to use, and the temperature.

On the OpenAI website they explain the different models they have available right now and their functionalities.

DESCRIPTION OF MODELS

GPT-4 — Limited beta — Set of models that improve GPT-3.5 and can understand and generate natural language or code.

GPT-3.5 — Set of models that enhance GPT-3.5 and can understand and generate natural language or code.

DALL-E — Beta — Model capable of generating and editing images from a natural language instruction.

Whisper — Beta — Model capable of converting audio to text.

Embeddings — Set of models capable of converting text into numeric form.

Using GPT as intended 🤖

One of the most common mistakes made when using ChatGPT is to attribute special powers to it that we do not usually attribute to any other type of programme, and this is perhaps the fault of all the noise that has been generated around it.

It is true that ChatGPT works really well with generic questions and helps us by providing solutions to quite complex problems if the situations are not extremely particular.

Things change when we go deeper and try to delve into complex problems, problems where experts have to spend time reasoning, understand the context and have enough information to be able to have the necessary tools to move forward.

Just like us, ChatGPT needs context and as much information as possible that we are able to give it to understand the problem and even, in necessary cases, guide it through the logical process so that it is able to reach the goal we are looking for.

Let’s start with a simple example to take the problem I mentioned above to the absurd.

prompt = f""" How do I know the IP address of the command and control console? """
response = get_completion(prompt)
print(reponse)

As expected, the answer is not very useful for our research.

Let us list what went wrong with our approach:

  • Lack of context.
  • Lack of necessary information.
  • Lack of indications of what we expect from her.
  • Lack of necessary instructions to give us what we want.

Let’s continue little by little to solve these problems.

The importance of context 👨‍🏫

We often forget the particularity of our thoughts and tend to think that others, or in this case artificial intelligences, should understand how we think rather than the other way around, as would be logical.

When working with LLMs the context is perhaps the fundamental difference in improving the results obtained.

Following the example above, let’s provide context to our question.

prompt = f""" Context: You have to provide Python code to be used in a Jupyter Notebook.
Your expected output is Python code to assist a cybersecurity analyst in an investigation during a ransomware attack on an enterprise network.
We are going to analyse a file containing traffic events from a Palo Alto brand firewall. The content is already in Pandas DataFrame format in a variable called 'data'.
Question: How do I know the IP address of the command and control console? """
response = get_completion(prompt)
print(response)

As we can see, the result is taking shape, but it is certainly far from what we are looking for, if only it were as easy as knowing the addresses of the command and control consoles.

If you have information, tell it 📓

Although one of the reasons for using this technology is to obtain information, it is important to understand that this is not the main purpose. If we provide as much information as possible in our queries we will see a great improvement in the answers.

It is obvious that in this kind of situation, when you have GBs of logs and even a 200MB $MFT file, it is not realistic to have to send this information to the AI to process it, besides having problems with data protection, maybe we would have higher than expected costs with the API 😉 Instead we are going to tell the AI what our data looks like, giving it as much context information as possible so that the data itself is irrelevant.

In this case we are going to provide GPT with the columns that contain the logs and a description of each field and of course, to do the description we will use GPT 😊

prompt = f"""
Creates a list of log fields from a Palo Alto brand firewall.
The fields to list are: {df.columns.tolist()} and the expected format is:
````Field name: Field description in less than 10 words.
"""
fields = get_completion(prompt)
print(fields)

With this data and the private IP range information we can try again.

prompt = f""""Context: You are a Python code generator to be used in a Jupyter Notebook.
You have to provide Python code to be used in a Jupyter Notebook.
Your expected output is Python code to assist a cybersecurity analyst in an investigation during a ransomware attack on an enterprise network.
We are going to analyse a file containing traffic events from a Palo Alto brand firewall. The content is already in Pandas DataFrame format in a variable called 'data'.
The columns in this DataFrame are: {fields}
And the private IP address ranges are:
- 10.0.0.0 a 10.255.255.255
- 172.16.0.0 a 172.31.255.255
- 192.168.0.0 a 192.168.255.255
Question: How to know the IP address of the command and control console?"""
response = get_completion(prompt)
print(response)

Asking the right question 🙋

After multiple tests with GPT it is clear to me that the role of the analyst is more than necessary. Not just someone who understands how the technology works, but someone who understands the problem and knows what questions to ask, even if they want to abstract from the intermediate process.

In this case we are going to change such a generic question as the one used above for much more precise clues that can guide us in our research process.

We will use the same context as before but change the question.

Question: List for me the external IP addresses to which the most data has been sent.

Although there are some things that are not well implemented, such as the filtering of private IP addresses, the proposed code works and returns a result that could be used to look for signs of data exfiltration.

Now that we have obtained a satisfactory result, we are going to increase the complexity of our questions to obtain something more interesting and realistic. Go with a new question.

Question: List the external IP addresses to which the most web-browsing connections were made between 18:00 and 8:00 and show me in a column the number of unique source IP addresses that made connections to each of the external IP addresses.

In this case, we only received one IP address and that is quite suspicious. In an incident of this type, the fact that only two source IP addresses are making HTTP connections after hours is something we need to analyse in depth.

Guiding AI in the right direction 🔍

Another surprising thing we can do is to tell the AI the right way to get to the result we want.

In some cases, we are clear about the path to follow to reach the right goal but we do not want to spend time looking for the Python functions that would lead us to the result, i.e. we know the what, but not the how, and for that we can use GPT to help us.

Question: We want to find external IP addresses suspected to be command and control consoles, so we are going to look for repetitive connections throughout the day between two IP addresses, follow these steps:
- Filter for connections to external IP addresses.
- Extract the time of the connection without taking into account the minutes.
- Group the IP connections by source and destination.
- During the above grouping it counts how many unique values exist in the connection times, the sum of the bits sent to the destination IP and counts the number of connections made between these two IP addresses.
- It displays the results in a table sorted in descending order by the number of connection hours.

And the results is:

When we run this code we get the following result.

In the result we have IP addresses that we should analyse, as the fact that there are thousands of connections between two IP addresses between 13 different times of the day (all of those in the log) is suspicious, which could be a beacon to a command and control console.

Finally, as a last example we are going to look for long connections in the traffic, this kind of connections are usually something interesting to look at as more and more attackers are using remote assistance tools like TeamViewer to evade detection and maintain access to the organisation.

To do this we will use the following question.

Question: We want to create a graph of long connections between an internal IP address and an external IP address by following the steps below.
- Filter the communications to external IP addresses.
- Sum the connection time to each destination IP and store it in a separate column called 'sum_time'.
- Add the number of connections made to each destination IP and store it in a separate column 'sum_conn'.
- Filter and keep the results with 'sum_conn' greater than 10.
- Keep only one row for each destination IP address.
- Divide the number of connections by the sum of connections and store it in a column called 'avg_conn'.
- Filter out the ten results with the highest 'avg_conn'.
- Create a graph with matplotlib where on the 'x' axis is the number of connections and on the 'y' axis is the total connection time.

This is GPT’s response

And this the result

As can be seen in the image, the first result with an average duration of 840 seconds for each connection is something we should certainly analyse.

Bonus

Although the fact of working with Jupyter Notebooks makes the work much easier and very comfortable, we can create a small interface to abstract even more the analyst from the code and make it more intuitive.

I leave the Notebook in my Github and I hope you will be encouraged to try GPT in your research and that you will enjoy it as much as I did.

--

--

Luis Francisco Monge Martinez

Cyber incident response analyst obsessed with the data analysis.