Membership Inference
Overview
The goal of the Membership Inference attack is to determine whether specific data records can be inferred to be a part of the model’s training dataset. The attack is conducted by simulating an attacker with access to the model and a dataset, of which some records were used to train the model. We simulate the attacker building a classifier that predicts whether a data record was part of the training dataset based on the loss calculated for each data record using the model. The performance of this classifier indicates the vulnerability of the model to membership inference attacks, a proxy for how private the model is. Note: membership inference tests cannot be run for closed source API endpoint models.
Metrics
True Positive Rate: The true positive rate represents the percentage of data records correctly predicted to be members of the training dataset. We look at the true positive rate at a variety of low false positive rates to determine the attacker’s success in high-confidence scenarios.
ROC-AUC: The Receiver Operating Characteristic (ROC) curve measures the performance of the attack as a tradeoff between the True Positive Rate (TPR) and False Positive Rate (FPR) at various thresholds. We can then use the Area Under the ROC Curve (AUC) to measure the aggregate performance across all thresholds.
Walkthrough Example
The membership inference attack follows a threat model that assumes the attacker has a classifier trained to predict whether a given data record is a part of the model’s training set.
Model Input — Example Data Record: John, As discussed, the AIG exposure is $10B USD, and it is distributed among the price, option, and exotic books.
Model Output: The model will either classify the data record to as in the training set or to not in the training set. We then determine whether this is a true positive or false positive based on:
Classifier Result: in the training set Classifier Result: not in the training set True Membership: in the training set True Positive False Negative True Membership: not in the training set False Positive True Negative
Attacker success is then represented as the trade-off between the true positive and false positive rate. Intuitively, an attacker with a high true positive rate, while maintaining the false positive rate low, indicates a powerful classifier — and represents a high vulnerability to membership inference attacks.