ARC-VA-DataProtection — PIRATES

Data Protection

Data Protection.jpg

Every feature we gather data for will classified as one of the following four types:

Personally Identifiable Information (PII)

PII 3.jpg

Such features can directly identify a volunteer. For e.g: Phone numbers, email ids etc.

During the data anonymization process, we will be dropping all PII features. In some cases we might keep one or two PIIs to uniquely identify a user, but these will never be stored ‘as is’ in the backend data stores. It will be securely hashed before storage, if we use them at all.

Quasi Identifers (QI)

QIs cannot identify a user directly, but multiple QIs can be combined and when combined with some external data sources, can uniquely identify a user. Most demographic information such as zip code, gender, age etc. are examples of QIs.

During the data anonymization process we will be bundling groups of QIs to make it impossible to uniquely identify a user. For e.g: Instead of recording a user’s exact age as 38, we might record the age to be 30 - 40.

Sensitive Identifiers (SIs)

While SIs don’t help in identifying a user, they do expose some sensitive information. For e.g: A user being HIV+. Such features are rarely useful and will never be stored in our backend data stores.

Non-Sensitive Features

Such features carry no sensitive or identification information about a user and can be safely used in an application.

Aggregate Data Protection

In addition to how we treat individual features, there are techniques we can apply to the data set as a whole to protect it from hacks and leaks.

Randomize ordering

randomize data.jpg

We will randomize ordering of data records to make inferences impossible

Differential Privacy

We will introduce a small amount of noise in the data set. This will help in throwing off hackers trying to reverse engineer our predictions to identify users.

Care will be taken to not let the noise introduced affect our ML.

Differencial Privacy.png

Federated Learning

FederatedLearning_FinalFiles_Flow+Chart1.jpg

Instead of fitting a ML model to our entire data set, we can fit models on small batches of data and stitch the results back up to get an overall prediction.

The process which combines results from smaller data subsets will also introduce some noise like in Differential Privacy.