How to use Python Faker Module for Data Privacy (Data Masking/Redaction)
Faker is a Python package that generates fake data such as names, addresses, phone numbers, dates, SSN and other personally identifiable information (PII).
This module can be very useful in generating realistic data for various use cases, including cyber deception. Deception involves fooling attackers towards wrong targets and Faker can effectively mitigate open-source intelligence (OSINT).
In this blog post, I will show you how to fake different types of PII using Faker. All you need is the knowledge of a Python class.
In most examples, you will find individual change of particular data. In my example, I will show how to replace all PII within a piece of text using Faker.
Installation
You can install it in your environment using pip.
(env)$ pip install Faker
Initiate a Class
from faker import Faker
import spacy
fake = Faker()
class DataRedaction:
def __init__(self, text):
self.text = text
# Load the small English model
self.nlp = spacy.load("en_core_web_sm")
Here, we import the module, create an instance of the Faker()
class and create a new class where we pass texts to redact. We also load en_core_web_sm
, which is a small English language model for spaCy, an open-source natural language processing library.
Create Methods for Faking PII
Here are some simple examples using which you can test how it works.
from faker import Faker
# create an instance of Faker
fake = Faker()
# the paragraph containing sensitive data
paragraph = "My name is John Smith and my address is 123 Main St. My email is user@example.com"
# replace the name
paragraph = re.sub(r"John Smith", fake.name(), paragraph)
# replace the address
paragraph = re.sub(r"123 Main St", fake.address(), paragraph)
# replace the email address
paragraph = re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", fake.email(), paragraph)
print(paragraph)
from faker import Faker
# create an instance of Faker
fake = Faker()
# the paragraph containing sensitive data
paragraph = "My credit card number is 1234-5678-1234-5678 and my phone number is 555-555-5555. My email is user@example.com"
# replace the credit card number
paragraph = re.sub(r"\d{4}-\d{4}-\d{4}-\d{4}", fake.credit_card_number(card_type=None), paragraph)
# replace the phone number
paragraph = re.sub(r"\d{3}-\d{3}-\d{4}", fake.phone_number(), paragraph)
# replace the email address
paragraph = re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", fake.email(), paragraph)
print(paragraph)
Okay, now let’s get back to our target code.
Change Date in Text
@staticmethod
def find_dates(text):
date_formats = [
r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b", # dd/mm/yyyy or dd-mm-yyyy
r"\b\d{1,2} \b\w{3} \d{2,4}\b", # dd MMM yyyy
r"\b\d{1,2} \b\w{3} \d{4}\b", # dd MMM yyyyy
r"\b\w{3} \d{1,2}, \d{4}\b", # MMM dd, yyyy
]
dates = []
for format in date_formats:
dates.extend(re.findall(format, text, re.IGNORECASE))
return dates
# change date within text
def changeDate(self, given_start_date=datetime.datetime(1980, 1, 1), given_end_date=datetime.datetime(2023, 12, 31)):
dates = self.find_dates(self.text)
# Replace each date with a new date generated by Faker
for date_string in dates:
old_date = date_string
new_date = fake.date_between(start_date=given_start_date, end_date=given_end_date).strftime('%Y-%m-%d')
if 'T' in old_date:
new_date = new_date + 'T12:00:00'
self.text = self.text.replace(old_date, new_date)
return self.text
In this example, first we create a static method to find dates in text. Then we replace these dates using the changeDate()
method.
Change Name
# change name entities -> person name, company name
def changeName(self):
# to not mistake important terms as person or company names
excludewordlist = ["SSN"]
# Process the text with spacy
doc = self.nlp(self.text)
# Anonymize people and company names
for ent in doc.ents:
if ent.label_ in ["PERSON", "ORG", "GPE", "FAC"]:
if ent.label_ == "PERSON":
anonymized_name = fake.name()
elif ent.label_ in ["ORG", "FAC"]:
anonymized_name = fake.company() + " Inc."
if ent.text.upper() not in excludewordlist:
self.text = self.text.replace(ent.text, anonymized_name)
return self.text
The changeName()
method changes all person and organization’s names within the text.
Change Address
# change address
def changeAddress(self):
addresses = parse(self.text, country='US')
# print(addresses)
for address in addresses:
anonymized_address = fake.address()
self.text = self.text.replace(address, anonymized_address)
return self.text
This method changes addresses within the text.
Change Email
# revised version of change emails
def changeEmail(self):
# Split the text into sentences
sentences = sent_tokenize(self.text)
names = []
new_sentence_list = []
# get the regex of email
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
for sentence in sentences:
sentence_names = []
# Tokenize the text into words
tokens = nltk.word_tokenize(sentence)
# Tag the tokens with their part-of-speech
tagged = nltk.pos_tag(tokens)
# Use the named entity recognizer to extract entities from the tagged tokens
entities = ne_chunk(tagged)
# Iterate through the entities and extract the person names
for entity in entities:
if hasattr(entity, 'label') and entity.label() == 'PERSON':
name = ' '.join(c[0] for c in entity.leaves())
sentence_names.append(name)
if sentence_names:
names.extend(sentence_names)
sentence_emails = re.findall(email_pattern, sentence)
# print(names,sentence_emails)
if sentence_emails:
for email in sentence_emails:
if names:
print(names)
# fake_email = f"{names[-1].lower()}.{fake.free_email_domain()}"
if not "USA" in names[-1]:
if " " in names[-1]:
print(names[-1])
firstname = names[-1].split()[0]
any_num = random.randint(0, 1000)
fullPrefix = firstname+str(any_num)
fake_email = f"{fullPrefix}.{fake.free_email_domain()}"
sentence = sentence.replace(email, fake_email)
names.pop()
else:
if " " in names[-2]:
print(names[-2])
firstname = names[-2].split()[0]
any_num = random.randint(0, 1000)
fullPrefix = firstname+str(any_num)
fake_email = f"{fullPrefix}.{fake.free_email_domain()}"
sentence = sentence.replace(email, fake_email)
names.pop()
names.pop()
else:
anonymized_email = fake.email()
sentence = sentence.replace(email, anonymized_email)
new_sentence_list.append(sentence)
self.text = " ".join(new_sentence_list)
return self.text
The changeEmail()
method changes the email to a believable one. Well, you may be surprised to see the length of the code.
We can literally generate a fake email with a couple lines of codes. However, within a text, it should be believable. For example, Mike should not have Emily293@example.com
email.
What I did here is track the last person in corresponding or earlier sentence and use his name in the email address.
Change Mobile Number
# change mobile numbers
def changeMobileNumber(self):
# Search for phone number patterns in the text
phone_number_pattern = re.compile(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}')
phone_numbers = re.findall(phone_number_pattern, self.text)
# Replace the phone numbers with fake ones generated by Faker
for phone_number in phone_numbers:
fake_phone_number = fake.phone_number()
self.text = self.text.replace(phone_number, fake_phone_number)
return self.text
Here, the changeMobileNumber()
method changes any mobile numbers within the given text.
Change SSN
# change SSN
def changeSSN(self):
ssn_regex = re.compile(r"\d{3}-\d{2}-\d{4}")
self.text = ssn_regex.sub(lambda x: fake.ssn(), self.text)
return self.text
Here, the changeSSN()
method changes SSN within the text.
Example Usage
You can access the whole working code here.
Now, create another python file in the same directory, copy the following code, and run.
from redaction import DataRedaction
# the paragraph containing sensitive data
paragraph = "My name is John Smith and my address is 123 Main St, Anytown USA. My email is johnsmith@example.com.\
My credit card number is 1234-5678-1234-5678 and SSN is 555-55-5555.\
My email is user@example.com. Today's date is 24-12-2023. Another date 2022-08-23. \
Also 2022/06/22 and 12/07/23 and 12/07/21. His mobile number is (123) 456-7890.\
Matt Henry is also responsible for the Uber Company. And Google, too."
if __name__ == '__main__':
mod = DataRedaction(paragraph)
modDate = mod.changeDate()
modName = mod.changeName()
modAddress = mod.changeAddress()
modEmail = mod.changeEmail2()
modMobileNum = mod.changeMobileNumber()
modSSN = mod.changeSSN()
print(modSSN)
Concluding Remarks
One of the key benefits of using Faker is that it can help protect users’ privacy by generating realistic but fake data. In many cases, developers need to test their applications with real data but cannot use actual user data due to privacy concerns. Using Faker to generate fake data can help to ensure that no real user data is being used in testing.
For example, if a developer needs to test a feature that displays a user’s name on a webpage, they could use Faker to generate a random name instead of using a real name from their database. This way, they can test the functionality of the feature without exposing any real user data.
Leave a comment