# Fingerprinting with Destination Context, using a Weighted Naive Bayes Classifier

Mercury's Fingerprinting with Destination Context (FDC) system identifies the software process that created a TLS client_hello, and can indicate when that process is malware or other unwanted software (when given a fingerprint knowledge base that contains malware data).  A major component of that system is a weighted naive Bayes classifier.  This note documents the function, design, and implementation of FDC and the classifier.



## Function

FDC takes as input a characteristic fingerprint string (such as "(0303)(1302)((0033)(002b00020304))") and a destination context, which consists of a destination IP address, a destination port, and the server_name field from the TLS client_hello extensions.  These fields represent the destination to which the client_hello was sent.

FDC returns several different types of data about the fingerprint and the process that generated the client_hello, including the finterprint status, information about the most probable process, probable attributes of the process, and a list of Operating Systems that were associated with the process in the ground truth data.

The fingerprint status is **labeled** if it appears in the fingerprint database, and is **unlabeled** if it appears in the fingerprint prevalence file; otherwise, it is **randomized**.  In the C interface, these states are represented with the fingerprint_status enumeration:
```C
enum fingerprint_status {
    fingerprint_status_no_info_available = 0,  // fingerprint status is unknown
    fingerprint_status_labeled           = 1,  // fingerprint is in FPDB
    fingerprint_status_randomized        = 2,  // fingerprint is in randomized FP set
    fingerprint_status_unlabled          = 3   // fingerprint is not in FPDB or randomized set
};
```
The fingerprint database contains detailed information about fingerprints, processes, operating systems, and destinations.  The fingerprint prevalence file only has information about fingerprints.  Randomized fingerprints are generated by evasive applications and TLS scanners.

The **probable process** is the process that the classifier rates as most likely to have generated the observed fingerprint and destination context.  The information about the most probable process is:

* the **name** of the most probable process (as a const pointer to a NULL-terminated string),
* a boolean **probable_process_is_malware** attribute that indicates whether or not the most probable process is malware (as a C99 bool), which is just called **malware** in mercury's JSON output,
* a probability **score** that represents the classifier's confidence that information about the probable process is correct (as a float between 0.0 and 1.0 inclusive).  Technically, the score is actually the probability that the probable process is the actual process, as computed by the classifier.

The probable_process_is_malware boolean attribute is **true** whenever the most probable process has been flagged as malicious by threat intelligence services.  That is, this attribute applies to the probable process.

In the fingerprint database, there is a special process name "generic dmz process" that is used whenever there is no ground truth available for a fingerprint.  If the most probable process is "generic dmz process", then the classifier reports the second most probable process instead.

The information about probable attributes is:

* the probability **p_malware** that the process that generated the client_hello is malware, regardless of what the probable process is (as a float between 0.0 and 1.0), and
* in the future, there may be an attribute associated with evasive behavior.

To understand the difference between probable_process_is_malware and p_malware, consider the case that the fingerprint is used by both a benign processes and a malware processes.  If the classifier computes that the probability of the benign process was 0.51 and the probability of the malware process was 0.49, then the probable process will be benign and probable_process_is_malware will be **false**, but p_malware will be 0.49.

The p_malware field is especially important when there is a multitude of similar processes, such as polymorphic malware.   In that case, there are many distinct processes that behave simialrly, and the classifier may not be able to identify the exact process with high confidence, but it can still accurately estimate the probability that the process is malware.  When using the classifier to report malware, the p_malware field is more important than the malware boolean associated with the probable process.

The OS information lists all operating systems and their prevalences that the most probable process has been observed using in the ground truth data. The analysis object contains the number of OSes, os_info_len, along with an array of structs where each struct represents a single OS and contains:

* an observed OS (as a const pointer to a NULL-terminated string), and
* a uint64_t count of the number of times the process was seen with the observed OS.

All of the above information can be accessed through these C99 functions:
```C
enum fingerprint_status analysis_context_get_fingerprint_status(const struct analysis_context *ac);

const char *analysis_context_get_fingerprint_string(const struct analysis_context *ac);

const char *analysis_context_get_server_name(const struct analysis_context *ac);

bool analysis_context_get_process_info(const struct analysis_context *ac, // input
                                       const char **probable_process,     // output
                                       double *probability_score          // output
                                       );

bool analysis_context_get_malware_info(const struct analysis_context *ac, // input
                                       bool *probable_process_is_malware, // output
                                       double *probability_malware        // output
                                       );

struct os_information {
    char *os_name;
    uint64_t os_prevalence;
};

bool analysis_context_get_os_info(const struct analysis_context *ac,     // input
                                  const struct os_information **os_info, // output
                                  size_t *os_info_len                    // output
                                  );

```

The FDC system reads a set of resource files at initialization time, from which it obtains all of the information that it uses in its analysis (other than the inputs listed above). The resource files include:

* fingerprint_db.json.gz: knowledge base that maps processes and their destinations to characteristic fingerprint strings. Each fingerprint entry is represented as a JSON object, and only fingerprints with associated process ground truth are included.
* fp_prevalence_tls.txt.gz: lists all characteristic fingerprint strings observed whether there exists process labels or not.
* pyasn.db: maps IPv4 and IPv6 subnets to autonomous systems.

If a fingerprint is not in fingerprint_db.json.gz, then the analysis JSON object will only contain a special "status" key. The "status" key's value is "unlabeled_fingerprint" if the fingerprint was in fp_prevalence_tls.txt.gz and "randomized_fingerprint" otherwise.

The following is an example of mercury's JSON output:

``` json
  "analysis": {
    "process": "microsoft internet explorer",
    "score": 0.969323,
    "malware": 0,
    "p_malware": 0,
    "os_info": {
      "cpe:2.3:o:microsoft:windows_10:1703:*:*:*:*:*:*:*": 53,
      "cpe:2.3:o:microsoft:windows_10:1803:*:*:*:*:*:*:*": 602617,
      "cpe:2.3:o:microsoft:windows_10:1809:*:*:*:*:*:*:*": 1845,
      "cpe:2.3:o:microsoft:windows_10:1903:*:*:*:*:*:*:*": 2493,
      "cpe:2.3:o:microsoft:windows_10:1909:*:*:*:*:*:*:*": 5999554,
      "cpe:2.3:o:microsoft:windows_10:2004:*:*:*:*:*:*:*": 1211,
      "cpe:2.3:o:microsoft:windows_10:20H2:*:*:*:*:*:*:*": 908
    }
  }
```


## Design

Mercury's FDC implementation uses a Weighted Naive Bayes (WNB) classifier to analyze destination context, and it uses the characteristic fingerprint string to select the WNB classifier.   Informally, when a string and a destination context are input, the string is used to select a classifier that is then applied to the destination context.   More formally, the classifier uses probabilities that are conditioned on the fingerprint.  The mathematical details are presented in [*Accurate TLS Fingerprinting using Destination Context and Knowledge Bases*](https://arxiv.org/abs/2009.01939).

A [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple, robust, and easily interpretable machine learning technique.  It estimates probabilities by applying Bayes' theorem, and makes the "naive" simplifying assumption that the data features are [conditionally independent](https://en.wikipedia.org/wiki/Conditional_independence), so that it is possible to estimate the model's probabilities from empirical data.  A WNB classifier assigns a different weight to each data feature, and tunes the weights during training to improve its accuracy.

Before the WNB classifier is run, the destination context is analyzed to find the destination's equivalence classes[^equivalence classes].   These classes are an important part of the data model; they enable the classifier to generalize IP addresses using [Autonomous System Numbers](https://en.wikipedia.org/wiki/Autonomous_system_(Internet)) and generalize server_names using [DNS domains](https://en.wikipedia.org/wiki/Domain_Name_System).  The equivalence classes currently used are:

* Autonomous System Numbers (ASNs) for IPv4 addresses[^IPv6]
* The second level DNS domains (or first level only, if it there is just one) for TLS server_names.
* ...



The classifier can easily be extended to understand new equivalence classes.   Currently, new classes can only be added at compile time, but it may be possible to extend the implementation so that it can learn new equivalence classes from resource files at run time.



## Implementation

To apply a naive Bayes classifier (weighed or otherwise), it is necessary to loop over all data features and all processes to compute an estimate the probability of each process, then find the maximum of those probabilities.   The straightforward way to do this would be with two nested loops, one for the data features, and one for the processes.  Mercury's WNB classifier avoids this double loop; it only loops over the data features.   For each feature, there is a lookup that returns a set of probability updates.   Each probability update identifies a process to be updated (with an index into a vector processes), and the amount by which it should be updated (with a long double number).



The classification algorithm can be summarized as:

1. Use the fingerprint string as an index into the table of WNB classifiers.
   1. If the fingerprint string could not be found in the table, return "unknown fingerprint".
   2. If a WNB classifier corresponding to the fingerprint was found, proceed to Step 2.
2. Find the equivalence classes for each of the data features in the destination context.
   1. Find the ASN of the destination IP addres.
   2. Find the second level DNS domain (or first level only, if there is just one) of the server_name.
   3. ...
3. Initialize a vector of long double numbers to the prior probabilities of each process, by dividing the total count by the count for that process.
4. For each equivalence class, perform a lookup to find the updates to be performed on the process probability vector.
5. Find the maximum value of the process probability vector and its corresponding index, and the second highest value and its corresponding index.
6. Normalize the probability of the most probable process by dividing the probability of the most probable process by the sum of all process probabilities.
7. Return the name of the most probable process and its normalized score.



### Numerical Stability and Accuracy

To improve the stability and accuracy, a WNB classifier implementation should use [log probabilities](https://en.wikipedia.org/wiki/Log_probability) to represent the process probabilities and probability updates, and should use long doubles to hold those values.   This simply means that, instead of working directly with a probability p, we work with its logarithm log(p); instead of multiplying two probability values to get a third (like p3 = p2 * p1), we add their logarithms (as with log(p3) = log(p2) + log(p1)).  This is straightforward, except that the normalization in Step 6 must apply the exp() function to each log-probability, because normalization must be applied to probabilities, not log-probabilities.  For computational efficiency, each log-probability should be cast to a float before exp() is performed on it, because the extra precision of long doubles is not needed, and reducing the size of the input to exp() significantly reduces its computational cost.  The [log-sum-exp trick](https://gasstationwithoutpumps.wordpress.com/2014/05/06/sum-of-probabilities-in-log-prob-space/) can be used to improve the accuracy of the normalization step.

The logic that determines the most probable process should be performed before normalization or exponentiation, to keep the loss of numerical accuracy inherent in those steps away from the process-selection logic.

The major loss of accuracy of the algorithm occurs during the updating of process probabilities.   It may be worthwhile to use a [compensated summation algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) to improve the accuracy of those computations.





# Footnotes



[^equivalence classes]:   Every set can be partitioned into non-overlapping [equivalence classes](https://en.wikipedia.org/wiki/Equivalence_class).  Each element of the set belongs to a single equivalence class.
[^IPv6]: IPv6 addresses are not yet fully supported, pending a rewrite of the lctrie library.
