Skip to main content

Privacy SDK Library

The dynamofl.privacy SDK methods provide tools to defend models against privacy vulnerabilities.

from dynamofl.privacy import *

Differential Privacy

Class DPTrainer

Release Notes

Introducing the DynamoFL Trainer designed for differentially private training on single GPU setups, incorporating state-of-the-art privacy techniques to safeguard training data.

Supported Features

FeatureSupport
Differential Privacy
BitsAndBytes Training
bf16 Training
PEFT and LoRA

Methods

__init__(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)

Constructs the DPTrainer, integrating differential privacy settings and initializing training attributes.

Parameters

ParamTypeRequired?Description
modelUnion[PreTrainedModel, torch.nn.modules.module.Module, Any]The model to be trained.
tokenizerPreTrainedTokenizerBaseThe tokenizer used for processing the input data.
argsTrainingArgumentsTraining arguments for configuring the training process.
privacy_argsPrivacyArgumentsConfiguration settings specific to differential privacy.
train_datasetDatasetThe dataset used for training the model.
eval_datasetOptional[Dataset]OptionalThe dataset used for evaluation during training.
**kwargs-OptionalAdditional keyword arguments for customization.

Example Usage

from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPTrainer
from your_model_loading_function import model, tokenizer

train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)

trainer = DPTrainer(
model=model,
tokenizer=tokenizer,
args=train_args,
privacy_args=privacy_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)

trainer.train()

Class DPSFTTrainer

Release Notes

Launches the DynamoFL Supervised Fine Tuning Trainer specifically tailored for differential privacy, ensuring secure and private model fine-tuning.

Supported Features

FeatureSupport
Supervised Fine-Tuning
Differential Privacy
BitsAndBytes Training
bf16 Training
PEFT and LoRA

Methods

__init__(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)

Constructs the DPSFTTrainer with differential privacy configurations and initializes additional attributes.

Parameters

ParamTypeRequired?Description
modelUnion[PreTrainedModel, torch.nn.modules.module.Module, Any]The model to be trained.
tokenizerPreTrainedTokenizerBaseThe tokenizer used for processing the input data.
argsTrainingArgumentsTraining arguments for configuring the training process.
privacy_argsPrivacyArgumentsConfiguration settings specific to differential privacy.
train_datasetDatasetThe dataset used for training the model.
dataset_text_fieldstrOptionalThe field name in the dataset containing the text. Default is "text".
eval_datasetOptional[Dataset]OptionalThe dataset used for evaluation during training.
max_seq_lengthintOptionalMaximum sequence length for input processing. Default is 1024.

Example Usage

from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPSFTTrainer
from your_model_loading_function import model, tokenizer

train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)

trainer = DPSFTTrainer(
model=model,
args=train_args,
tokenizer=tokenizer,
privacy_args=privacy_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field="text",
max_seq_length=1024,
)

trainer.train()

Class DPMultiGPUTrainer

Release Notes

Enhanced for multi-GPU environments with differential privacy, supporting a variety of training optimizations including BitsAndBytes, bf16, PEFT, LoRA, DeepSpeed, and Mixture of Quantization.

Supported Features

FeatureStage 1Stage 2Stage 3
Multi-GPU Training
Differential Privacy
BitsAndBytes Training
bf16 Training
PEFT and LoRA
DeepSpeed Integration

Methods

__init__(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)

Constructs the DPMultiGPUTrainer, integrating differential privacy settings and initializing training attributes.

Parameters

ParamTypeRequired?Description
modelUnion[PreTrainedModel, torch.nn.modules.module.Module, Any]The model to be trained.
tokenizerPreTrainedTokenizerBaseThe tokenizer used for processing the input data.
argsTrainingArgumentsTraining arguments for configuring the training process. Optionnally with a DeepSpeed config
privacy_argsPrivacyArgumentsConfiguration settings specific to differential privacy.
train_datasetDatasetThe dataset used for training the model.
eval_datasetOptional[Dataset]OptionalThe dataset used for evaluation during training.
**kwargs-OptionalAdditional keyword arguments for customization.

Example Usage

from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPTrainer
from your_model_loading_function import model, tokenizer

train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)

trainer = DPTrainer(
model=model,
tokenizer=tokenizer,
args=train_args,
privacy_args=privacy_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)

trainer.train()

Class DPMultiGPUSFTTrainer

Release Notes

Introduces the DynamoFL MultiGPU Differentially Private Supervised Fine-Tuning Trainer, optimized for multi-GPU setups with enhanced privacy features and support for various training optimizations.

Supported Features

FeatureStage 1Stage 2Stage 3
Multi-GPU Training
Differential Privacy
Supervised Fine-Tuning
BitsAndBytes Training
bf16 Training
PEFT and LoRA
DeepSpeed Integration

Methods

__init__(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)

Constructs the DPSFTTrainer with differential privacy configurations and initializes additional attributes.

Parameters

ParamTypeRequired?Description
modelUnion[PreTrainedModel, torch.nn.modules.module.Module, Any]The model to be trained.
tokenizerPreTrainedTokenizerBaseThe tokenizer used for processing the input data.
argsTrainingArgumentsTraining arguments for configuring the training process.
privacy_argsPrivacyArgumentsConfiguration settings specific to differential privacy.
train_datasetDatasetThe dataset used for training the model.
dataset_text_fieldstrOptionalThe field name in the dataset containing the text. Default is "text".
eval_datasetOptional[Dataset]OptionalThe dataset used for evaluation during training.
max_seq_lengthintOptionalMaximum sequence length for input processing. Default is 1024.

Example Usage

from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPSFTTrainer
from your_model_loading_function import model, tokenizer

train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)

trainer = DPSFTTrainer(
model=model,
args=train_args,
tokenizer=tokenizer,
privacy_args=privacy_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field="text",
max_seq_length=1024,
)

trainer.train()

Personally Identifiable Information (PII)

Method find_pii()

Find PII entities in a string, list of strings or a HuggingFace dataset.

Supported Models

ModelSupport
Transformers (token classification)
Flair
spaCy

Parameters

ParamTypeRequired?Description
textUnion[str, List[str], Dataset]Input text to analyze for PII entities.
model_typestrThe type of NER model to use. Supported models are: 'flair', 'spacy', 'transformers'.
model_configDict[str, str]Configuration for NER models:
  • lang_code (str): Language code for the spaCy model.
  • model_name (str): Name of the model.
entity_typesList[str]List of PII categories to search for.
unique_anonymizationList[str]List of PII categories to apply unique anonymization to. If True, all PII categories will be uniquely anonymized.
chunk_sizeintChunk size for NER model.
no_redactList[str]A list of PII categories to not redact. These categories will be replaced with their original values.
custom_entity_configDict[str, Union[str, List[str], float, Callable]]Configuration for custom entities:
  • entity_type (str): The type of custom entity.
  • recognizer_type (str): The type of recognizer ('regex' or 'deny-list').
  • deny_list (List[str]): A list of patterns to deny-list if using 'deny-list' recognizer.
  • regex (str): A regular expression pattern if using 'regex' recognizer.
  • score (float): Expected confidence level for this recognizer.
  • redacted_text_callback (Callable): A callback function if using custom redacted text.
dataset_configDict[str, Union[str, List[str]]]Configuration for HuggingFace and custom datasets:
  • text_column (str; HugggingFace): The name of the column containing text.
  • split_column (str; HugggingFace and custom): Used for dataset splitting. String if using a HuggingFace dataset, list of strings if using custom dataset.
  • train_name (str; HugggingFace): The name of the training split in the dataset.
  • val_name (str; HugggingFace): The name of the validation split in the dataset.
  • test_name (str; HugggingFace): The name of the testing split in the dataset.
entity_mappingDict[str, str]Mapping of entity types to custom entity types.
dynamofl_pii_configDict[str, Union[str, bool]]Configuration for dynamofl's PII model:
  • llm_endpoint (str): Endpoint for the language model.
  • system_prompt (str): System prompt for the language model.
  • use_redacted_text (bool): Whether to use redacted text for the language model.

Returns

A dictionary with the following keys:

  1. If the text is of type str:
    • redacted_text (str): The redacted string.
    • redacted_entities (dict): Redacted entities.
    • redacted_entity_positions (list of tuples): Positions of redacted entities.
    • redacted_entity_counts (dict): Counts of redacted entities.
    • entity_types_summary (dict): Summary of entity types.
  2. If the text is of type List[str] or Dataset (HuggingFace dataset):
    • redacted_text (str): The redacted string.
    • redacted_entities (dict): Redacted entities.
    • redacted_entity_counts (dict): Counts of redacted entities.
    • entity_types_summary (dict): Summary of entity types.