Hybrid supervised/unsupervised network anomaly detection

Notice:
This post is older than 5 years – the content might be outdated.

The previous two posts gave a short introduction of network anomaly detection in general. We also introduced the k-means algorithm as a simple clustering technique and discussed some advantages and drawbacks of the algorithm. Furthermore we gave some general information about techniques other than clustering which can be used for anomaly detection. In this post we want to introduce a hybrid unsupervised/supervised approach. We are going to use Balanced Iterative Reducing and Clustering using Hierarchies, also known as BIRCH as a pre-clustering step for a subsequent Support Vector Machine (SVM) classifier.

Unfortunately we do not have a labeled data set of the inovex network traffic to train the classifier on. So we will first introduce the general structure of the hybrid approach and the result of the analysis of the NSL-KDD data set. Afterwards we are going to briefly introduce an adaptation of the hybrid approach which can also handle unlabeled data.

The NSL-KDD data set

For training and testing purposes we used the NSL-KDD data set. The NSL-KDD data set is based on the KDD 99 data set, which is a popular choice for intrusion detection tasks. The training data set consists of 125,973 records where 53.46% are normal records and 46.54% represent attacks. The test data set consists of 22,543 records where 43.08% are normal records and 56.92% represent attacks. For training and testing purposes the whole training resp. testing data was used. The data was preprocessed according to the following steps:

Categorical features: One Hot Encoding
Continuous features: No adjustments were made
Binary features: No adjustments were made

All features have been scaled in a second step to ensure that the clustering algorithm is not biased by different valuations.

Clustering & classification

In order to understand the functionality of the hybrid approach you should first get a grasp of the underlying clustering and classification algorithms. Below we are going to introduce them briefly.

BIRCH

BIRCH was originally introduced by Zhang et al. in 1996. The main idea of BIRCH is to scan a data set only once and thereby build a so called Cluster-Feature tree (CF tree) in-memory. The CF tree is a rooted tree consisting of non-leaf nodes, leaf nodes and undirected edges. It stores information about the clusters in so-called Cluster Features. Cluster Features are vectors which store information in the following manner:

\(CF_x = (N, LS, SS)\)

with N as the number of elements in the cluster, LS the linear sum and SS the squared sum of all elements. The cluster features are a compressed way of representing the points of each cluster.

The size and the shape of the tree is mainly influenced by the values of the branching factor B, the threshold T and the maximum number of elements in a leaf node L. B and T are hyperparameters. By changing the value of B and T one can influence the size and the shape of the tree. Every non-leaf node can have a maximum number of B child nodes. Every leaf node contains up to L data points which are no further away as T from each other.

In BIRCH data points are added to the CF tree point by point. This makes it perfectly suitable for handling streaming data as well as large amounts of data. Inserting new data points is done like so:

Identify the closest leaf node by traversing the CF tree
Update the leaf node and consider the following rules
- If the entry does not fit into existing leaf nodes create a new leaf if the control parameter B will not be violated
- Else split the leaf nodes and redistribute all entries
Update all non-leaf nodes (cluster features) if they are affected by the insertion

Not only that inserting new data points can be done dynamically, BIRCH also comes with some additional, very handy properties.

Time complexity: BIRCH scans the data set once. One scan of the data set implies a linear time complexity O(n).

Hyperparameters: In clustering, the user often has to specify k – the expected number of clusters (e.g. in k-means). In BIRCH there is no need to specify k beforehand. This can be pretty handy because often the user doesn’t have enough domain knowledge to specify k. Instead the user specifies some shape parameters for the CF tree. A threshold T which defines the biggest distance from an observation to its cluster center. And a branching factor B which limits the number of possible child nodes in the tree.

Dynamic: The CF tree can be built dynamically. Inserting new observations into an already existing tree is possible.

Using BIRCH one should always consider the following things. First, it can only handle numerical data – so non-numerical features need an additional pre-processing step. Second, BIRCH is sensitive to the order of the elements in the data stream. Supplying BIRCH with a permuted data set might result in a different tree, hence different clusters.

For our purposes we used the scikit-learn library. With this implementation it is fairly easy to use BIRCH. One first defines the model specifications.

model = Birch(threshold=0.5, branching_factor=150, n_clusters=None)

1	model = Birch(threshold=0.5, branching_factor=150, n_clusters=None)

After specifying the model one can train the algorithm either with streaming or batch data.

model.fit(data) 			# batch training

model.partial_fit(data)		# online training

model.fit(data) # batch training

model.partial_fit(data) # online training

Support Vector Machine

SVMs are very common in today’s machine learning. An SVM tries to map the observations of the training data (vectors) into a higher dimensional space by a function Θ. The SVM then tries to find a hyperplane in that space which separates the data points from each other with the highest margin. The separating hyperplane is depending on the chosen kernel function. There are several kernel functions to choose from, e.g. a linear kernel:

\(k(x_i, x_j) = \theta (x_i)^T\theta (x_j)\)

Usually SVMs perform very well but they also have some drawbacks. SVMs have a time complexity somewhere between \(O(n2)\) and \(O(n3)\) which makes them not suitable for the analysis of huge amount of data. The computational complexity is highly dependant on the number of support vectors.

Hybrid approach

In this section we are going to show you how we combined the two algorithms to get a scalable SVM classifier for network intrusion detection. We use BIRCH as a pre-processing step for the SVM. We do this mainly because of two reasons: First, with BIRCH we can generalize from the data and thereby remove outliers. Second, we reduce the amount of input data for the SVM classifiers dramatically. The hybrid approach has the following structure:

Preprocessing of the data (transforming features, setting class labels, …).
Apply BIRCH
1. Construct CF trees, one for each attack type and one for the normal records
2. The cluster centers of the newly generated CF trees are used as new input data
Train four SVM classifiers, one for each attack type. So for each attack type do the following:
1. Set the class labels of all observations which don’t belong to the current attack type to ‘normal’.
2. Train the SVM classifier with the adjusted data set
Combine the four SVM classifiers to build an IDS. If at least one of the four classifiers predicts a record as an anomaly, the record will be classified as an anomaly.

Results

Unfortunately we do not have a labeled network traffic data set of the inovex network. For our analysis we are going to use the NSL-KDD data set which we briefly introduced in a previous section.

In a first step we measured the performance of the hybrid approach with respect to its accuracy and false-positive rate on the test data set. The results are shown in figure 1. As we want to misclassify the least possible attacks as normal records we decided to choose the configuration of the algorithm with the lowest false-positive rate – although this reduces the overall accuracy. We were able to achieve a false-positive rate of 1.4% and a total accuracy of 87.83% on the test data set. This result was obtained with a threshold of T = 0.5 and a branching factor of B = 160.

Anomaly detection results

The detailed results per attack type for this configuration are shown in the following table.

Type of traffic	Accuracy on training data (%)	Accuracy on testing data (%)
Normal	91.31	77.45
Denial of Service (dos)	92.26	88.86
Probe	88.94	87.56
U2R	99.96	99.83
R2L	99.21	97.48
Overall	91.32	87.17

With an overall accuracy of 87.17%, the hybrid approach performs quite good. For the attack types U2R and R2L we can obtain a really good result and are able to detect 99.83% respectively 97.48% of all attacks in the test set.

Other than its accuracy there is an even more interesting fact. Table 2 shows the number of cluster centers for each category after applying BIRCH (B was set to 160). In the original data set we have ~125.000 observations. Applying BIRCH and increasing the threshold T from 0.05 to 0.5 gradually will decrease the number of clusters dramatically. But why? If we increase the value of the threshold parameter T, more and more points will be within the range T. Thus there’s no need for splitting the nodes and hence we end up with less clusters.

Data type	normal	dos	probe	R2L	U2R	Rate of compression (%)
Original data set	67343	45927	11656	995	52	-
T = 0.05	20444	7391	3680	325	49	74.69
T = 0.1	9750	2049	1831	159	41	89.02
T = 0.2	3498	588	765	73	32	96.06
T = 0.3	1654	254	421	43	24	98.01
T = 0.4	859	114	214	26	17	99.02
T = 0.5	538	57	141	18	14	99.40

Less clusters imply less training data for the SVM classifier used later on. We already mentioned that an SVM has a very bad computational complexity. Thus, decreasing the number of training points leads to a heavy decrease in training time. Table 3 shows the decrease of training time for different threshold values. In our final approach we use a threshold value T = 0.5. This will decrease the amount of data points from ~125.000 to 768 in total and the training time for the SVM classifier from 1117 to only 0.11 seconds!

Threshold	Training time (s)	Training time (%)
original data	1117	-
0.05	147	13.16
0.1	32	2.86
0.2	3.9	0.35
0.3	0.9	.08
0.4	0.25	0.02
0.5	0.11	0.01

So using BIRCH as a pre-processing step helps to speed up things dramatically and makes the SVM classifier applicable on even bigger data sets.

Conclusions & further reading

We introduced a hybrid approach for network intrusion detection using BIRCH in combination with SVM classifiers. Using BIRCH we were able to handle the huge amount of data to train SVM classifiers. The SVM classifiers showed a reasonable performance on the training set as well as the test data set.

One drawback of the approach is that we can only apply it on labeled (or partially) labeled data. Often we do not have a labeled network data set and labeling the data set would be too expensive – maybe even impossible. In this case we can use a One Class SVM (OCSVM) instead of the proposed standard SVM classifier. OCSVMs are particularly useful when you have a lot of ‘normal’ observations and only a few ‘abnormal’ observations which is the case when it comes to network traffic data.

For further reading regarding BIRCH and SVMs have a look at the following links:

Python Fundamentals: Dein Einstieg in die Welt von Python

Dieses Training führt praxisorientiert in die Grundlagen von Python ein. Im Verlauf des Trainings lernen die Teilnehmer:innen anhand von interaktiven Beispielen und Praxisaufgaben die wichtigsten grundlegenden Konzepte der Programmiersprache kennen.

Zum Training

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

A hybrid supervised/unsupervised approach to network anomaly detection

The NSL-KDD data set