Artificial Intelligence

Beating GPT-4: Fine-Tuning Llama-3 for Software Security

Autor:in
Lesezeit
7 ​​min

In this blog post, I will walk you through the process of fine-tuning and evaluating Llama-3 and DeepSeek-Coder for detecting software vulnerabilities. I will share valuable insights and key takeaways gained from this work.

Large Language Models (LLMs) have become present everywhere on the internet, known for their versatility in performing a wide range of tasks. At inovex, we leverage LLMs for various internal and external projects. In my thesis, I explored the potential of LLMs to detect software vulnerabilities and enhance their performance through fine-tuning.

The Problem

Software vulnerabilities can emerge in any project, no matter its size. Detecting them quickly is important, as undetected vulnerabilities can result in significant financial losses and mitigate user trust. [1]
To address this issue, static application security testing (SAST) tools are commonly used, which rely on pattern matching, data and control flow analysis, and more. However, traditional SAST tools have limitations in identifying more complex vulnerabilities and cannot adapt to new ones. This is where advanced approaches, such as leveraging machine learning, come into play. These approaches are able to generalize to unseen cases and detect emerging vulnerabilities. Whereby fine-tuning LLMs, I aimed to improve the detection of intricate vulnerabilities that conventional tools might miss.

Experimental Evaluation

In the following, I will introduce the dataset and the prompts used for the experiments. Figure 1 shows the fine-tuning and evaluation process, offering a structured overview.

overview of the llm fine-tuning process
Figure 1: Fine-tuning and evaluation pipeline.

Dataset

I chose to use the Draper VDISC dataset, as it provides 1.27 million synthetic and real-world function-level samples of C and C++ source code. Each data point is labeled with its respective Common Weakness Enumeration (CWE), summarized in Table 1.

CWEFrequencyDescription
CWE-1203.70%Buffer Copy without Checking Size of Input (‚Classic Buffer Overflow‘)
CWE-1191.90%Improper Restriction of Operations within the Bounds of a Memory Buffer
CWE-4690.95%Use of Pointer Subtraction to Determine Size
CWE-4760.21%NULL Pointer Dereference
CWE-other2.70%Improper Input Validation, Use of Uninitialized Variables, Buffer Access with Incorrect Length Value, etc.
Table 1: Types of vulnerabilities in the Draper VDISC dataset and their frequencies across the entire dataset.

This dataset was chosen as most system-level vulnerabilities emerge from C and C++ code. Since both C and C++ allow for manual memory handling, they provide programmers with a high degree of control and flexibility over system resources. However, this flexibility comes at the cost of increased risk of memory management errors, such as buffer overflows, use-after-free vulnerabilities, and dangling pointers.

Prompt design

As prompt design has a great impact on the performance of LLMs, I decided to use well-performing prompts from related works instead of creating new ones. The used prompts are displayed in Table 2. Both prompts provide a binary response, allowing for easy automatic classification.

Prompt IDPrompt
PD-1 [2]Is this code vulnerable? Answer in only Yes or No.
PD-2 [3]I want you to act as a vulnerability detection system. Is the following function buggy? Please answer Yes or No.
Table 2: Prompts used for the evaluation and fine.

Results of pre-trained LLMs

To compare the effectiveness of the pre-trained and fine-tuned models, I conducted an evaluation of several pre-trained large language models. Among these, GPT-4-0613 was used as the baseline for comparison. This approach allowed for a clear assessment of the improvements gained through fine-tuning.

Due to the limitations of most models in providing strictly „yes“ or „no“ answers, I initially reviewed some outputs manually and identified specific keywords for pattern matching the results. Examples of keywords for positive responses include “yes“ and “is vulnerable“. Additionally, responses that were not automatically classified using these keywords were then manually reviewed and classified.

ModelPromptUnclassifiedPrecisionRecallF1-score
Deepseek-coder-6.7b-instructPD-11430.0400.378 0.072
GPT-4-0613PD-110.1640.4420.239
Llama-2-7b-chatPD-11290.052 0.957 0.099
Llama-2-7b-chatPD-23940.0511.0000.098
Llama-3-8B-InstructPD-11380.0730.6380.131
Mistral-7B-Instruct-v0.2PD-13600.0550.5130.100
Mistral-7B-Instruct-v0.2PD-29410.1030.6000.177
Table 3: Comparing pre-trained models on vulnerability detection

The results indicated that the prompt design PD-1 performed greatly better than PD-2. Especially in providing a reasonable and classifiable output.
GPT-4 performed best overall and did a great job in only returning “yes“ or “no“ with just one sample remaining unclassified.

For the fine-tuning process I chose Llama-3 and DeepSeek-Coder. Llama-3 has the best performance among the open-weight models, and DeepSeek-Coder was mainly trained on source-code, which could help with understanding the given code.

Fine-Tuning Llama-3 and DeepSeek-Coder

For fine-tuning both models, I used the script provided by Phil Schmid in his blog post about fine-tuning LLama-3-70B. I will share interesting insights gained from the experiments that could be important for further experiments in that direction. The insights are pretty similar between Llama-3 and DeepSeek-Coder. Some of these might also apply to other tasks where the LLM should classify things in binary classes. The insights concern the chat template for Llama-3 and the hyperparameters.

Chat template:
Phil Schmid provided an Anthropic-/ Vicuna-like chat template and used it in his work. In my experiments, I discovered that the Llama-3 specific chat template greatly improved the performance. Since the model was trained on the Llama-3 template, it can better use its previous knowledge. The number of unclassified samples of the fine-tuned models was reduced from around 20% to 0.

Batch size:
Typically, a larger batch size ensures stability during fine-tuning and allows for a more stable process. A batch size greater than 4 led to almost 0 as positive classified samples. As the dataset is unbalanced, each batch contains more negative samples than positive, which leads to training mostly towards negative responses.

Training split:
There exist several approaches to generating a training dataset. Most sources suggest a 50:50 split [4] to reduce overfitting on the majority class. During my tests, a training dataset where 25% of the samples were vulnerable worked the best. The given dataset split does not provide enough vulnerable samples, and a 50:50 split led to too many false positives.

Fine-Tuned Results

Table 4 shows the results from the best fine-tuned models as well as the best pre-trained model. Fine-tuning the models greatly improves their performance and allows them to outperform GPT-4-0613, with an F1-score almost twice as high.

ModelUnclassifiedPrecisionRecallF1-score
Llama-300.3910.4570.422
Deepseek-Coder0
0.2770.4400.340
GPT-4-061310.1640.4420.239
Table 4: Comparison between fine-tuned models and the best pre-trained model.

Performance per CWE

An interesting measurement is the performance per CWE class. Table 5 shows the accuracy of the best-performing fine-tuned Llama-3 model. Its performance is noticeably low for CWE-476 which is a NULL pointer dereference. This kind of vulnerability cannot be detected well in a function itself. The model might need more context of how the function is used. Also, the performance for the CWE-other class is considerably low, which could arise from few examples per specific CWEs in the training data.

CWECorrectIncorrectAccuracy
CWE-11960420.588
CWE-12056360.609
CWE-469530.625
CWE-4765510.089
CWE-other34580.370
Non-vulnerable44012490.946
Table 5: Performance for each CWE class achieved by the best-performing Llama-3
model.

Conclusion

In this blog post, LLMs were used for software vulnerability detection, both in their pre-trained and fine-tuned state. Fine-tuning greatly helps the model to understand the given task and reduce the amount of false positives, allowing it to beat GPT-4.

Nevertheless, this experiment shows that there is still a lot of work to be done before LLMs can be used for reliable vulnerability detection. One important step is creating a large training dataset with reliable labels. Another step could include providing additional context to improve the classification of complex vulnerabilities.

A different work using abstract syntax trees from Feras Zaher-Alnaem achieved better results than the fine-tuned LLMs.

References

[1] A. Anwar et al. “Measuring the cost of software vulnerabilities“, 2020.
[2] Moumita D. Purba et al. “Software Vulnerability Detection using Large Language Models“. 2023.
[3] Benjamin Steenhoek et al. “A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection“. 2024.
[4] Susan S, Kumar A. “The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art“. 2021.

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert