How to use Python Faker Module for Data Privacy (Data Masking/Redaction)
Faker is a Python package that generates fake data such as names, addresses, phone numbers, dates, SSN and other personally identifiable information (PII).
This module can be very useful in generating realistic data for various use cases, including cyber deception. Deception involves fooling attackers towards wrong targets and Faker can effectively mitigate open-source intelligence (OSINT).
In this blog post, I will show you how to fake different types of PII using Faker. All you need is the knowledge of a Python class.
In most examples, you will find individual change of particular data. In my example, I will show how to replace all PII within a piece of text using Faker.
Installation
You can install it in your environment using pip.
(env)$ pip install Faker
Initiate a Class
from faker import Faker
import spacy
fake = Faker()
class DataRedaction:
def __init__(self, text):
self.text = text
# Load the small English model
self.nlp = spacy.load("en_core_web_sm")
Here, we import the module, create an instance of the Faker()
class and create a new class where we pass texts to redact. We also load en_core_web_sm
, which is a small English language model for spaCy, an open-source natural language processing library.
Create Methods for Faking PII
Here are some simple examples using which you can test how it works.
from faker import Faker
# create an instance of Faker
fake = Faker()
# the paragraph containing sensitive data
paragraph = "My name is John Smith and my address is 123 Main St. My email is user@example.com"
# replace the name
paragraph = re.sub(r"John Smith", fake.name(), paragraph)
# replace the address
paragraph = re.sub(r"123 Main St", fake.address(), paragraph)
# replace the email address
paragraph = re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", fake.email(), paragraph)
print(paragraph)
This code uses the Python package faker
to generate fake personal data and then replaces the sensitive data in a given paragraph with the fake data.
First, an instance of the Faker
class is created to generate fake data. Then, the paragraph variable is defined with some sensitive data.
Next, the code uses regular expressions and the re
module to replace specific sensitive data. The sub()
function is used to replace occurrences of the name “John Smith” with a fake name generated by the fake.name()
method, the address “123 Main St” with a fake address generated by the fake.address()
method, and the email address with a fake email generated by the fake.email()
method.
from faker import Faker
# create an instance of Faker
fake = Faker()
# the paragraph containing sensitive data
paragraph = "My credit card number is 1234-5678-1234-5678 and my phone number is 555-555-5555. My email is user@example.com"
# replace the credit card number
paragraph = re.sub(r"\d{4}-\d{4}-\d{4}-\d{4}", fake.credit_card_number(card_type=None), paragraph)
# replace the phone number
paragraph = re.sub(r"\d{3}-\d{3}-\d{4}", fake.phone_number(), paragraph)
# replace the email address
paragraph = re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", fake.email(), paragraph)
print(paragraph)
Here, we used Faker
to change the credit card number and mobile number.
Okay, now let’s get back to our target code.
Customized Target Code
Our primary intention is not only develop a code that replaces sensitive data with fake data, but also make it more believable.
Change Date in Text
@staticmethod
def find_dates(text):
date_formats = [
r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b", # dd/mm/yyyy or dd-mm-yyyy
r"\b\d{1,2} \b\w{3} \d{2,4}\b", # dd MMM yyyy
r"\b\d{1,2} \b\w{3} \d{4}\b", # dd MMM yyyyy
r"\b\w{3} \d{1,2}, \d{4}\b", # MMM dd, yyyy
]
dates = []
for format in date_formats:
dates.extend(re.findall(format, text, re.IGNORECASE))
return dates
# change date within text
def changeDate(self, given_start_date=datetime.datetime(1980, 1, 1), given_end_date=datetime.datetime(2023, 12, 31)):
dates = self.find_dates(self.text)
# Replace each date with a new date generated by Faker
for date_string in dates:
old_date = date_string
new_date = fake.date_between(start_date=given_start_date, end_date=given_end_date).strftime('%Y-%m-%d')
if 'T' in old_date:
new_date = new_date + 'T12:00:00'
self.text = self.text.replace(old_date, new_date)
return self.text
This code defines a class method called find_dates
which uses regular expressions to search for various date formats in the given input text. The class also has a method called changeDate
which takes in two optional arguments for start and end date and replaces all dates found in the text
attribute of the class instance with randomly generated dates using the Faker library.
The find_dates
method first creates a list of regular expressions to match various date formats, including dd/mm/yyyy, dd-mm-yyyy, dd MMM yyyy, dd MMM yyyyy, and MMM dd, yyyy. It then searches for all matches of these formats in the input text using the re.findall
method and returns a list of all the matches.
The changeDate
method uses the find_dates
method to find all dates in the text
attribute and replaces each date with a new date generated by the Faker library within the given start and end date range. The strftime('%Y-%m-%d')
method is used to format the generated date to the yyyy-mm-dd format. If the original date contains a ‘T’ (i.e., it’s in ISO format), the generated date is also appended with a T12:00:00 to ensure that it is also in ISO format.
Finally, the method returns the modified text
attribute of the class instance.
Change Name
# change name entities -> person name, company name
def changeName(self):
# to not mistake important terms as person or company names
excludewordlist = ["SSN"]
# Process the text with spacy
doc = self.nlp(self.text)
# Anonymize people and company names
for ent in doc.ents:
if ent.label_ in ["PERSON", "ORG", "GPE", "FAC"]:
if ent.label_ == "PERSON":
anonymized_name = fake.name()
elif ent.label_ in ["ORG", "FAC"]:
anonymized_name = fake.company() + " Inc."
if ent.text.upper() not in excludewordlist:
self.text = self.text.replace(ent.text, anonymized_name)
return self.text
Here, the code defines a method called changeName()
which uses the spacy
library to identify and anonymize named entities such as people’s names and company names in a given text.
First, the method initializes a list of words to be excluded from anonymization, such as important terms that should not be changed, for example, “SSN” which is a crucial identifier for an individual.
The spacy
library is then used to process the input text and identify named entities with specific labels such as “PERSON”, “ORG”, “GPE” and “FAC”. For each named entity, the method generates a new random name using the fake
instance of the Faker
class. If the entity is labeled as “PERSON”, a random name is generated. If it is labeled as “ORG” or “FAC”, a company name is generated along with the “Inc.” suffix.
The method then checks if the named entity text is not in the exclusion list and replaces it with the generated anonymized name. Finally, the method returns the updated text with anonymized named entities.
Change Address
# change address
def changeAddress(self):
addresses = parse(self.text, country='US')
# print(addresses)
for address in addresses:
anonymized_address = fake.address()
self.text = self.text.replace(address, anonymized_address)
return self.text
This code defines a method changeAddress
that is responsible for changing any addresses found in the text. It does this by first using the parse
function from the usaddress
library to extract any addresses found in the text. Then, for each address found, it replaces it with a new address generated by the fake.address()
function from the Faker
library. Finally, the method returns the modified text.
Change Email
The changeEmail()
method changes the email to a believable one. Well, you may be surprised to see the length of the code.
We can literally generate a fake email with a couple lines of codes. However, within a text, it should be believable. For example, Mike should not have Emily293@example.com
email.
What I did here is track the last person in corresponding or earlier sentence and use his name in the email address.
# revised version of change emails
def changeEmail(self):
# Split the text into sentences
sentences = sent_tokenize(self.text)
names = []
new_sentence_list = []
# get the regex of email
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
for sentence in sentences:
sentence_names = []
# Tokenize the text into words
tokens = nltk.word_tokenize(sentence)
# Tag the tokens with their part-of-speech
tagged = nltk.pos_tag(tokens)
# Use the named entity recognizer to extract entities from the tagged tokens
entities = ne_chunk(tagged)
# Iterate through the entities and extract the person names
for entity in entities:
if hasattr(entity, 'label') and entity.label() == 'PERSON':
name = ' '.join(c[0] for c in entity.leaves())
sentence_names.append(name)
if sentence_names:
names.extend(sentence_names)
sentence_emails = re.findall(email_pattern, sentence)
# print(names,sentence_emails)
if sentence_emails:
for email in sentence_emails:
if names:
print(names)
# fake_email = f"{names[-1].lower()}.{fake.free_email_domain()}"
if not "USA" in names[-1]:
if " " in names[-1]:
print(names[-1])
firstname = names[-1].split()[0]
any_num = random.randint(0, 1000)
fullPrefix = firstname+str(any_num)
fake_email = f"{fullPrefix}.{fake.free_email_domain()}"
sentence = sentence.replace(email, fake_email)
names.pop()
else:
if " " in names[-2]:
print(names[-2])
firstname = names[-2].split()[0]
any_num = random.randint(0, 1000)
fullPrefix = firstname+str(any_num)
fake_email = f"{fullPrefix}.{fake.free_email_domain()}"
sentence = sentence.replace(email, fake_email)
names.pop()
names.pop()
else:
anonymized_email = fake.email()
sentence = sentence.replace(email, anonymized_email)
new_sentence_list.append(sentence)
self.text = " ".join(new_sentence_list)
return self.text
This code is a method named changeEmail
that belongs to a larger class. The purpose of this method is to replace any email addresses in the text with fake email addresses, while also trying to infer the name of the person associated with the email address and including that name in the fake email address if possible.
Here is an explanation of how the code works:
-
The method starts by splitting the input text into individual sentences using the
sent_tokenize
function from thenltk
library. -
Next, the method initializes two empty lists:
names
andnew_sentence_list
.names
will be used to keep track of any person names that are identified in the text, whilenew_sentence_list
will store the updated sentences with fake email addresses. -
A regular expression pattern is defined to match email addresses. This pattern is then used to find all email addresses in each sentence.
-
The method then iterates through each sentence, and for each sentence, it tokenizes the text into individual words using the
word_tokenize
function fromnltk
, and then tags each word with its part of speech using thepos_tag
function fromnltk
. -
The
ne_chunk
function fromnltk
is then used to extract named entities from the tagged tokens. In particular, the method looks for any entities that are labeled as “PERSON”, which would correspond to person names. -
For each identified person name, the method adds it to the
names
list. -
If any email addresses were found in the sentence, the method checks if the
names
list is non-empty. If it is, then the method tries to infer the first name of the person associated with the email address (by taking the last name in thenames
list and assuming that it corresponds to the person’s last name), and generates a fake email address using that first name and a random number as a prefix. If the last name in thenames
list contains the string “USA”, then the method assumes that the person’s name was not correctly identified and tries to use the second-to-last name in the list instead. -
If the
names
list is empty, or if the email address could not be associated with a person name for some reason, then the method generates a completely random fake email address using thefake.email()
function from thefaker
library. -
The updated sentence with the fake email address is then added to the
new_sentence_list
. -
After iterating through all sentences, the method joins the updated sentences back into a single string and returns it.
Overall, this method uses a combination of regular expressions and natural language processing techniques to identify email addresses and person names in the input text, and then generates fake email addresses using a combination of the identified names and random numbers.
Change Mobile Number
# change mobile numbers
def changeMobileNumber(self):
# Search for phone number patterns in the text
phone_number_pattern = re.compile(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}')
phone_numbers = re.findall(phone_number_pattern, self.text)
# Replace the phone numbers with fake ones generated by Faker
for phone_number in phone_numbers:
fake_phone_number = fake.phone_number()
self.text = self.text.replace(phone_number, fake_phone_number)
return self.text
This code defines a method named changeMobileNumber
which replaces the existing mobile numbers in the input text with fake ones generated by the fake
object from the faker
library.
The first step is to search for phone number patterns in the input text using the regular expression pattern r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
. This pattern matches phone numbers in the format (xxx) xxx-xxxx
, xxx-xxx-xxxx
, xxx.xxx.xxxx
, or xxx xxx xxxx
. The re.findall()
method is then used to find all phone numbers in the text that match this pattern.
Next, the method iterates over all the found phone numbers and replaces them with fake phone numbers generated by the fake.phone_number()
method from the faker
library.
Finally, the updated text is returned by the method.
Change SSN
# change SSN
def changeSSN(self):
ssn_regex = re.compile(r"\d{3}-\d{2}-\d{4}")
self.text = ssn_regex.sub(lambda x: fake.ssn(), self.text)
return self.text
Here, the changeSSN()
method changes SSN within the text.
Example Usage
You can access the whole working code here.
Now, create another python file in the same directory, copy the following code, and run.
from redaction import DataRedaction
# the paragraph containing sensitive data
paragraph = "My name is John Smith and my address is 123 Main St, Anytown USA. My email is johnsmith@example.com.\
My credit card number is 1234-5678-1234-5678 and SSN is 555-55-5555.\
My email is user@example.com. Today's date is 24-12-2023. Another date 2022-08-23. \
Also 2022/06/22 and 12/07/23 and 12/07/21. His mobile number is (123) 456-7890.\
Matt Henry is also responsible for the Uber Company. And Google, too."
if __name__ == '__main__':
mod = DataRedaction(paragraph)
modDate = mod.changeDate()
modName = mod.changeName()
modAddress = mod.changeAddress()
modEmail = mod.changeEmail2()
modMobileNum = mod.changeMobileNumber()
modSSN = mod.changeSSN()
print(modSSN)
Concluding Remarks
One of the key benefits of using Faker is that it can help protect users’ privacy by generating realistic but fake data. In many cases, developers need to test their applications with real data but cannot use actual user data due to privacy concerns. Using Faker to generate fake data can help to ensure that no real user data is being used in testing.
For example, if a developer needs to test a feature that displays a user’s name on a webpage, they could use Faker to generate a random name instead of using a real name from their database. This way, they can test the functionality of the feature without exposing any real user data.
Leave a comment