Privacy SDK Library
The dynamofl.privacy SDK methods provide tools to defend models against privacy vulnerabilities.
from dynamofl.privacy import *
Differential Privacy
Class DPTrainer
Release Notes
Introducing the DynamoFL Trainer designed for differentially private training on single GPU setups, incorporating state-of-the-art privacy techniques to safeguard training data.
Supported Features
Feature | Support |
---|---|
Differential Privacy | ✅ |
BitsAndBytes Training | ✅ |
bf16 Training | ✅ |
PEFT and LoRA | ✅ |
Methods
__init__(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)
Constructs the DPTrainer, integrating differential privacy settings and initializing training attributes.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
model | Union[PreTrainedModel, torch.nn.modules.module.Module, Any] | ✅ | The model to be trained. |
tokenizer | PreTrainedTokenizerBase | ✅ | The tokenizer used for processing the input data. |
args | TrainingArguments | ✅ | Training arguments for configuring the training process. |
privacy_args | PrivacyArguments | ✅ | Configuration settings specific to differential privacy. |
train_dataset | Dataset | ✅ | The dataset used for training the model. |
eval_dataset | Optional[Dataset] | Optional | The dataset used for evaluation during training. |
**kwargs | - | Optional | Additional keyword arguments for customization. |
Example Usage
from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPTrainer
from your_model_loading_function import model, tokenizer
train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)
trainer = DPTrainer(
model=model,
tokenizer=tokenizer,
args=train_args,
privacy_args=privacy_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Class DPSFTTrainer
Release Notes
Launches the DynamoFL Supervised Fine Tuning Trainer specifically tailored for differential privacy, ensuring secure and private model fine-tuning.
Supported Features
Feature | Support |
---|---|
Supervised Fine-Tuning | ✅ |
Differential Privacy | ✅ |
BitsAndBytes Training | ✅ |
bf16 Training | ✅ |
PEFT and LoRA | ✅ |
Methods
__init__(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)
Constructs the DPSFTTrainer with differential privacy configurations and initializes additional attributes.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
model | Union[PreTrainedModel, torch.nn.modules.module.Module, Any] | ✅ | The model to be trained. |
tokenizer | PreTrainedTokenizerBase | ✅ | The tokenizer used for processing the input data. |
args | TrainingArguments | ✅ | Training arguments for configuring the training process. |
privacy_args | PrivacyArguments | ✅ | Configuration settings specific to differential privacy. |
train_dataset | Dataset | ✅ | The dataset used for training the model. |
dataset_text_field | str | Optional | The field name in the dataset containing the text. Default is "text". |
eval_dataset | Optional[Dataset] | Optional | The dataset used for evaluation during training. |
max_seq_length | int | Optional | Maximum sequence length for input processing. Default is 1024. |
Example Usage
from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPSFTTrainer
from your_model_loading_function import model, tokenizer
train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)
trainer = DPSFTTrainer(
model=model,
args=train_args,
tokenizer=tokenizer,
privacy_args=privacy_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field="text",
max_seq_length=1024,
)
trainer.train()
Class DPMultiGPUTrainer
Release Notes
Enhanced for multi-GPU environments with differential privacy, supporting a variety of training optimizations including BitsAndBytes, bf16, PEFT, LoRA, DeepSpeed, and Mixture of Quantization.
Supported Features
Feature | Stage 1 | Stage 2 | Stage 3 |
---|---|---|---|
Multi-GPU Training | ✅ | ✅ | ✅ |
Differential Privacy | ✅ | ✅ | ✅ |
BitsAndBytes Training | ✅ | ✅ | ❌ |
bf16 Training | ✅ | ✅ | ✅ |
PEFT and LoRA | ✅ | ✅ | ✅ |
DeepSpeed Integration | ✅ | ✅ | ✅ |
Methods
__init__(self, model, tokenizer, *args, privacy_args, train_dataset, eval_dataset=None, **kwargs)
Constructs the DPMultiGPUTrainer, integrating differential privacy settings and initializing training attributes.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
model | Union[PreTrainedModel, torch.nn.modules.module.Module, Any] | ✅ | The model to be trained. |
tokenizer | PreTrainedTokenizerBase | ✅ | The tokenizer used for processing the input data. |
args | TrainingArguments | ✅ | Training arguments for configuring the training process. Optionnally with a DeepSpeed config |
privacy_args | PrivacyArguments | ✅ | Configuration settings specific to differential privacy. |
train_dataset | Dataset | ✅ | The dataset used for training the model. |
eval_dataset | Optional[Dataset] | Optional | The dataset used for evaluation during training. |
**kwargs | - | Optional | Additional keyword arguments for customization. |
Example Usage
from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPTrainer
from your_model_loading_function import model, tokenizer
train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)
trainer = DPTrainer(
model=model,
tokenizer=tokenizer,
args=train_args,
privacy_args=privacy_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Class DPMultiGPUSFTTrainer
Release Notes
Introduces the DynamoFL MultiGPU Differentially Private Supervised Fine-Tuning Trainer, optimized for multi-GPU setups with enhanced privacy features and support for various training optimizations.
Supported Features
Feature | Stage 1 | Stage 2 | Stage 3 |
---|---|---|---|
Multi-GPU Training | ✅ | ✅ | ✅ |
Differential Privacy | ✅ | ✅ | ✅ |
Supervised Fine-Tuning | ✅ | ✅ | ✅ |
BitsAndBytes Training | ✅ | ✅ | ❌ |
bf16 Training | ✅ | ✅ | ✅ |
PEFT and LoRA | ✅ | ✅ | ✅ |
DeepSpeed Integration | ✅ | ✅ | ✅ |
Methods
__init__(self, model, tokenizer, *, args, privacy_args, train_dataset, dataset_text_field="text", eval_dataset=None, max_seq_length=1024)
Constructs the DPSFTTrainer with differential privacy configurations and initializes additional attributes.
Parameters
Param | Type | Required? | Description |
---|---|---|---|
model | Union[PreTrainedModel, torch.nn.modules.module.Module, Any] | ✅ | The model to be trained. |
tokenizer | PreTrainedTokenizerBase | ✅ | The tokenizer used for processing the input data. |
args | TrainingArguments | ✅ | Training arguments for configuring the training process. |
privacy_args | PrivacyArguments | ✅ | Configuration settings specific to differential privacy. |
train_dataset | Dataset | ✅ | The dataset used for training the model. |
dataset_text_field | str | Optional | The field name in the dataset containing the text. Default is "text". |
eval_dataset | Optional[Dataset] | Optional | The dataset used for evaluation during training. |
max_seq_length | int | Optional | Maximum sequence length for input processing. Default is 1024. |
Example Usage
from transformers import TrainingArguments
from dynamofl.privacy import PrivacyArguments, DPSFTTrainer
from your_model_loading_function import model, tokenizer
train_args = TrainingArguments(output_dir="./model_output", num_train_epochs=3)
privacy_args = PrivacyArguments(target_epsilon=5.0, per_sample_max_grad_norm=1.0)
trainer = DPSFTTrainer(
model=model,
args=train_args,
tokenizer=tokenizer,
privacy_args=privacy_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field="text",
max_seq_length=1024,
)
trainer.train()
Personally Identifiable Information (PII)
Method find_pii()
Find PII entities in a string, list of strings or a HuggingFace dataset.
Supported Models
Model | Support |
---|---|
Transformers (token classification) | ✅ |
Flair | ✅ |
spaCy | ✅ |
Parameters
Param | Type | Required? | Description |
---|---|---|---|
text | Union[str, List[str], Dataset] | ✅ | Input text to analyze for PII entities. |
model_type | str | ✅ | The type of NER model to use. Supported models are: 'flair', 'spacy', 'transformers'. |
model_config | Dict[str, str] | ❌ | Configuration for NER models:
|
entity_types | List[str] | ❌ | List of PII categories to search for. |
unique_anonymization | List[str] | ❌ | List of PII categories to apply unique anonymization to. If True, all PII categories will be uniquely anonymized. |
chunk_size | int | ❌ | Chunk size for NER model. |
no_redact | List[str] | ❌ | A list of PII categories to not redact. These categories will be replaced with their original values. |
custom_entity_config | Dict[str, Union[str, List[str], float, Callable]] | ❌ | Configuration for custom entities:
|
dataset_config | Dict[str, Union[str, List[str]]] | ❌ | Configuration for HuggingFace and custom datasets:
|
entity_mapping | Dict[str, str] | ❌ | Mapping of entity types to custom entity types. |
dynamofl_pii_config | Dict[str, Union[str, bool]] | ❌ | Configuration for dynamofl's PII model:
|
Returns
A dictionary with the following keys:
- If the text is of type
str
:redacted_text
(str): The redacted string.redacted_entities
(dict): Redacted entities.redacted_entity_positions
(list of tuples): Positions of redacted entities.redacted_entity_counts
(dict): Counts of redacted entities.entity_types_summary
(dict): Summary of entity types.
- If the text is of type
List[str]
orDataset
(HuggingFace dataset):redacted_text
(str): The redacted string.redacted_entities
(dict): Redacted entities.redacted_entity_counts
(dict): Counts of redacted entities.entity_types_summary
(dict): Summary of entity types.