Privacy SDK Library

The dynamofl.privacy SDK methods provide tools to defend models against privacy vulnerabilities.

from dynamofl.privacy import *

Differential Privacy

Class `DPTrainer`

Release Notes

Introducing the DynamoFL Trainer designed for differentially private training on single GPU setups, incorporating state-of-the-art privacy techniques to safeguard training data.

Supported Features

Feature	Support
Differential Privacy	✅
BitsAndBytes Training	✅
bf16 Training	✅
PEFT and LoRA	✅

Methods

`init(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)`

Constructs the DPTrainer, integrating differential privacy settings and initializing training attributes.

Parameters

Param	Type	Required?	Description
model	Union[PreTrainedModel, torch.nn.modules.module.Module, Any]	✅	The model to be trained.
tokenizer	PreTrainedTokenizerBase	✅	The tokenizer used for processing the input data.
args	TrainingArguments	✅	Training arguments for configuring the training process.
privacy_args	PrivacyArguments	✅	Configuration settings specific to differential privacy.
train_dataset	Dataset	✅	The dataset used for training the model.
eval_dataset	Optional[Dataset]	Optional	The dataset used for evaluation during training.
**kwargs	-	Optional	Additional keyword arguments for customization.

Example Usage

from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPTrainer
from your_model_loading_function import model, tokenizer

train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)

trainer = DPTrainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    privacy_args=privacy_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

Class `DPSFTTrainer`

Release Notes

Launches the DynamoFL Supervised Fine Tuning Trainer specifically tailored for differential privacy, ensuring secure and private model fine-tuning.

Supported Features

Feature	Support
Supervised Fine-Tuning	✅
Differential Privacy	✅
BitsAndBytes Training	✅
bf16 Training	✅
PEFT and LoRA	✅

Methods

`init(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)`

Constructs the DPSFTTrainer with differential privacy configurations and initializes additional attributes.

Parameters

Param	Type	Required?	Description
model	Union[PreTrainedModel, torch.nn.modules.module.Module, Any]	✅	The model to be trained.
tokenizer	PreTrainedTokenizerBase	✅	The tokenizer used for processing the input data.
args	TrainingArguments	✅	Training arguments for configuring the training process.
privacy_args	PrivacyArguments	✅	Configuration settings specific to differential privacy.
train_dataset	Dataset	✅	The dataset used for training the model.
dataset_text_field	str	Optional	The field name in the dataset containing the text. Default is "text".
eval_dataset	Optional[Dataset]	Optional	The dataset used for evaluation during training.
max_seq_length	int	Optional	Maximum sequence length for input processing. Default is 1024.

Example Usage

from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPSFTTrainer
from your_model_loading_function import model, tokenizer

train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)

trainer = DPSFTTrainer(
    model=model,
    args=train_args,
    tokenizer=tokenizer,
    privacy_args=privacy_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=1024,
)

trainer.train()

Class `DPMultiGPUTrainer`

Release Notes

Enhanced for multi-GPU environments with differential privacy, supporting a variety of training optimizations including BitsAndBytes, bf16, PEFT, LoRA, DeepSpeed, and Mixture of Quantization.

Supported Features

Feature	Stage 1	Stage 2	Stage 3
Multi-GPU Training	✅	✅	✅
Differential Privacy	✅	✅	✅
BitsAndBytes Training	✅	✅	❌
bf16 Training	✅	✅	✅
PEFT and LoRA	✅	✅	✅
DeepSpeed Integration	✅	✅	✅

Methods

`init(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)`

Constructs the DPMultiGPUTrainer, integrating differential privacy settings and initializing training attributes.

Parameters

Param	Type	Required?	Description
model	Union[PreTrainedModel, torch.nn.modules.module.Module, Any]	✅	The model to be trained.
tokenizer	PreTrainedTokenizerBase	✅	The tokenizer used for processing the input data.
args	TrainingArguments	✅	Training arguments for configuring the training process. Optionnally with a DeepSpeed config
privacy_args	PrivacyArguments	✅	Configuration settings specific to differential privacy.
train_dataset	Dataset	✅	The dataset used for training the model.
eval_dataset	Optional[Dataset]	Optional	The dataset used for evaluation during training.
**kwargs	-	Optional	Additional keyword arguments for customization.

Example Usage

from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPTrainer
from your_model_loading_function import model, tokenizer

train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)

trainer = DPTrainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    privacy_args=privacy_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

Class `DPMultiGPUSFTTrainer`

Release Notes

Introduces the DynamoFL MultiGPU Differentially Private Supervised Fine-Tuning Trainer, optimized for multi-GPU setups with enhanced privacy features and support for various training optimizations.

Supported Features

Feature	Stage 1	Stage 2	Stage 3
Multi-GPU Training	✅	✅	✅
Differential Privacy	✅	✅	✅
Supervised Fine-Tuning	✅	✅	✅
BitsAndBytes Training	✅	✅	❌
bf16 Training	✅	✅	✅
PEFT and LoRA	✅	✅	✅
DeepSpeed Integration	✅	✅	✅

Methods

`init(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)`

Constructs the DPSFTTrainer with differential privacy configurations and initializes additional attributes.

Parameters

Param	Type	Required?	Description
model	Union[PreTrainedModel, torch.nn.modules.module.Module, Any]	✅	The model to be trained.
tokenizer	PreTrainedTokenizerBase	✅	The tokenizer used for processing the input data.
args	TrainingArguments	✅	Training arguments for configuring the training process.
privacy_args	PrivacyArguments	✅	Configuration settings specific to differential privacy.
train_dataset	Dataset	✅	The dataset used for training the model.
dataset_text_field	str	Optional	The field name in the dataset containing the text. Default is "text".
eval_dataset	Optional[Dataset]	Optional	The dataset used for evaluation during training.
max_seq_length	int	Optional	Maximum sequence length for input processing. Default is 1024.

Example Usage

from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPSFTTrainer
from your_model_loading_function import model, tokenizer

train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)

trainer = DPSFTTrainer(
    model=model,
    args=train_args,
    tokenizer=tokenizer,
    privacy_args=privacy_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=1024,
)

trainer.train()

Personally Identifiable Information (PII)

Method `find_pii()`

Find PII entities in a string, list of strings or a HuggingFace dataset.

Supported Models

Model	Support
Transformers (token classification)	✅
Flair	✅
spaCy	✅

Parameters

Param	Type	Required?	Description
text	Union[str, List[str], Dataset]	✅	Input text to analyze for PII entities.
model_type	str	✅	The type of NER model to use. Supported models are: 'flair', 'spacy', 'transformers'.
model_config	Dict[str, str]	❌	Configuration for NER models: lang_code (str): Language code for the spaCy model. model_name (str): Name of the model.
entity_types	List[str]	❌	List of PII categories to search for.
unique_anonymization	List[str]	❌	List of PII categories to apply unique anonymization to. If True, all PII categories will be uniquely anonymized.
chunk_size	int	❌	Chunk size for NER model.
no_redact	List[str]	❌	A list of PII categories to not redact. These categories will be replaced with their original values.
custom_entity_config	Dict[str, Union[str, List[str], float, Callable]]	❌	Configuration for custom entities: entity_type (str): The type of custom entity. recognizer_type (str): The type of recognizer ('regex' or 'deny-list'). deny_list (List[str]): A list of patterns to deny-list if using 'deny-list' recognizer. regex (str): A regular expression pattern if using 'regex' recognizer. score (float): Expected confidence level for this recognizer. redacted_text_callback (Callable): A callback function if using custom redacted text.
dataset_config	Dict[str, Union[str, List[str]]]	❌	Configuration for HuggingFace and custom datasets: text_column (str; HugggingFace): The name of the column containing text. split_column (str; HugggingFace and custom): Used for dataset splitting. String if using a HuggingFace dataset, list of strings if using custom dataset. train_name (str; HugggingFace): The name of the training split in the dataset. val_name (str; HugggingFace): The name of the validation split in the dataset. test_name (str; HugggingFace): The name of the testing split in the dataset.
entity_mapping	Dict[str, str]	❌	Mapping of entity types to custom entity types.
dynamofl_pii_config	Dict[str, Union[str, bool]]	❌	Configuration for dynamofl's PII model: llm_endpoint (str): Endpoint for the language model. system_prompt (str): System prompt for the language model. use_redacted_text (bool): Whether to use redacted text for the language model.

Returns

A dictionary with the following keys:

If the text is of type str:
- redacted_text (str): The redacted string.
- redacted_entities (dict): Redacted entities.
- redacted_entity_positions (list of tuples): Positions of redacted entities.
- redacted_entity_counts (dict): Counts of redacted entities.
- entity_types_summary (dict): Summary of entity types.
If the text is of type List[str] or Dataset (HuggingFace dataset):
- redacted_text (str): The redacted string.
- redacted_entities (dict): Redacted entities.
- redacted_entity_counts (dict): Counts of redacted entities.
- entity_types_summary (dict): Summary of entity types.

Differential Privacy​

Class DPTrainer​

__init__(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)​

Class DPSFTTrainer​

__init__(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)​

Class DPMultiGPUTrainer​

__init__(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)​

Class DPMultiGPUSFTTrainer​

__init__(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)​

Personally Identifiable Information (PII)​

Method find_pii()​

Differential Privacy

Class `DPTrainer`

`init(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)`

Class `DPSFTTrainer`

`init(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)`

Class `DPMultiGPUTrainer`

`init(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)`

Class `DPMultiGPUSFTTrainer`

`init(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)`

Personally Identifiable Information (PII)

Method `find_pii()`