How to use Python Faker Module for Data Privacy (Data Masking/Redaction)

11 minute read

Faker is a Python package that generates fake data such as names, addresses, phone numbers, dates, SSN and other personally identifiable information (PII).

This module can be very useful in generating realistic data for various use cases, including cyber deception. Deception involves fooling attackers towards wrong targets and Faker can effectively mitigate open-source intelligence (OSINT).

In this blog post, I will show you how to fake different types of PII using Faker. All you need is the knowledge of a Python class.

In most examples, you will find individual change of particular data. In my example, I will show how to replace all PII within a piece of text using Faker.

Installation

You can install it in your environment using pip.

(env)$ pip install Faker

Initiate a Class

from  faker  import  Faker
import  spacy

fake  =  Faker()

class DataRedaction:
    def __init__(self, text):
        self.text = text
        
        # Load the small English model
        self.nlp = spacy.load("en_core_web_sm")

Here, we import the module, create an instance of the Faker() class and create a new class where we pass texts to redact. We also load en_core_web_sm, which is a small English language model for spaCy, an open-source natural language processing library.

Create Methods for Faking PII

Here are some simple examples using which you can test how it works.

from faker import Faker

# create an instance of Faker
fake = Faker()

# the paragraph containing sensitive data
paragraph = "My name is John Smith and my address is 123 Main St. My email is user@example.com"

# replace the name
paragraph = re.sub(r"John Smith", fake.name(), paragraph)

# replace the address
paragraph = re.sub(r"123 Main St", fake.address(), paragraph)

# replace the email address
paragraph = re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", fake.email(), paragraph)

print(paragraph)

This code uses the Python package faker to generate fake personal data and then replaces the sensitive data in a given paragraph with the fake data.

First, an instance of the Faker class is created to generate fake data. Then, the paragraph variable is defined with some sensitive data.

Next, the code uses regular expressions and the re module to replace specific sensitive data. The sub() function is used to replace occurrences of the name “John Smith” with a fake name generated by the fake.name() method, the address “123 Main St” with a fake address generated by the fake.address() method, and the email address with a fake email generated by the fake.email() method.

from faker import Faker

# create an instance of Faker
fake = Faker()

# the paragraph containing sensitive data
paragraph = "My credit card number is 1234-5678-1234-5678 and my phone number is 555-555-5555. My email is user@example.com"

# replace the credit card number
paragraph = re.sub(r"\d{4}-\d{4}-\d{4}-\d{4}", fake.credit_card_number(card_type=None), paragraph)

# replace the phone number
paragraph = re.sub(r"\d{3}-\d{3}-\d{4}", fake.phone_number(), paragraph)

# replace the email address
paragraph = re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", fake.email(), paragraph)

print(paragraph)

Here, we used Faker to change the credit card number and mobile number.

Okay, now let’s get back to our target code.

Customized Target Code

Our primary intention is not only develop a code that replaces sensitive data with fake data, but also make it more believable.

Change Date in Text

	@staticmethod
    def find_dates(text):
        date_formats = [
            r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",  # dd/mm/yyyy or dd-mm-yyyy
            r"\b\d{1,2} \b\w{3} \d{2,4}\b",  # dd MMM yyyy
            r"\b\d{1,2} \b\w{3} \d{4}\b",  # dd MMM yyyyy
            r"\b\w{3} \d{1,2}, \d{4}\b",  # MMM dd, yyyy
        ]

        dates = []
        for format in date_formats:
            dates.extend(re.findall(format, text, re.IGNORECASE))

        return dates
    
    # change date within text 
    def changeDate(self, given_start_date=datetime.datetime(1980, 1, 1), given_end_date=datetime.datetime(2023, 12, 31)):
        
        dates = self.find_dates(self.text)
        
        # Replace each date with a new date generated by Faker
        for date_string in dates:
            old_date = date_string
            new_date = fake.date_between(start_date=given_start_date, end_date=given_end_date).strftime('%Y-%m-%d')
            
            if 'T' in old_date:
                new_date = new_date + 'T12:00:00'

            self.text = self.text.replace(old_date, new_date)

        return self.text

This code defines a class method called find_dates which uses regular expressions to search for various date formats in the given input text. The class also has a method called changeDate which takes in two optional arguments for start and end date and replaces all dates found in the text attribute of the class instance with randomly generated dates using the Faker library.

The find_dates method first creates a list of regular expressions to match various date formats, including dd/mm/yyyy, dd-mm-yyyy, dd MMM yyyy, dd MMM yyyyy, and MMM dd, yyyy. It then searches for all matches of these formats in the input text using the re.findall method and returns a list of all the matches.

The changeDate method uses the find_dates method to find all dates in the text attribute and replaces each date with a new date generated by the Faker library within the given start and end date range. The strftime('%Y-%m-%d') method is used to format the generated date to the yyyy-mm-dd format. If the original date contains a ‘T’ (i.e., it’s in ISO format), the generated date is also appended with a T12:00:00 to ensure that it is also in ISO format.

Finally, the method returns the modified text attribute of the class instance.

Change Name

	# change name entities -> person name, company name
    def changeName(self):
        
        # to not mistake important terms as person or company names
        excludewordlist = ["SSN"]
        
        # Process the text with spacy
        doc = self.nlp(self.text)

        # Anonymize people and company names
        for ent in doc.ents:
            if ent.label_ in ["PERSON", "ORG", "GPE", "FAC"]:
                if ent.label_ == "PERSON":
                    anonymized_name = fake.name()
                elif ent.label_ in ["ORG", "FAC"]:
                    anonymized_name = fake.company() + " Inc."
                if ent.text.upper() not in excludewordlist:
                    self.text = self.text.replace(ent.text, anonymized_name)
        return self.text

Here, the code defines a method called changeName() which uses the spacy library to identify and anonymize named entities such as people’s names and company names in a given text.

First, the method initializes a list of words to be excluded from anonymization, such as important terms that should not be changed, for example, “SSN” which is a crucial identifier for an individual.

The spacy library is then used to process the input text and identify named entities with specific labels such as “PERSON”, “ORG”, “GPE” and “FAC”. For each named entity, the method generates a new random name using the fake instance of the Faker class. If the entity is labeled as “PERSON”, a random name is generated. If it is labeled as “ORG” or “FAC”, a company name is generated along with the “Inc.” suffix.

The method then checks if the named entity text is not in the exclusion list and replaces it with the generated anonymized name. Finally, the method returns the updated text with anonymized named entities.

Change Address

	# change address
    def changeAddress(self):
        
        addresses = parse(self.text, country='US')
        # print(addresses)
        for address in addresses:
            anonymized_address = fake.address()
            self.text = self.text.replace(address, anonymized_address)
            
        return self.text

This code defines a method changeAddress that is responsible for changing any addresses found in the text. It does this by first using the parse function from the usaddress library to extract any addresses found in the text. Then, for each address found, it replaces it with a new address generated by the fake.address() function from the Faker library. Finally, the method returns the modified text.

Change Email

The changeEmail() method changes the email to a believable one. Well, you may be surprised to see the length of the code.

We can literally generate a fake email with a couple lines of codes. However, within a text, it should be believable. For example, Mike should not have Emily293@example.com email.

What I did here is track the last person in corresponding or earlier sentence and use his name in the email address.

	# revised version of change emails
    def changeEmail(self):
        
        # Split the text into sentences
        sentences = sent_tokenize(self.text)

        names = []
        new_sentence_list = []
        
        # get the regex of email
        email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
        
        for sentence in sentences:
            sentence_names = []
            
            # Tokenize the text into words
            tokens = nltk.word_tokenize(sentence)

            # Tag the tokens with their part-of-speech
            tagged = nltk.pos_tag(tokens)

            # Use the named entity recognizer to extract entities from the tagged tokens
            entities = ne_chunk(tagged)

            # Iterate through the entities and extract the person names
            for entity in entities:
                if hasattr(entity, 'label') and entity.label() == 'PERSON':
                    name = ' '.join(c[0] for c in entity.leaves())
                    sentence_names.append(name)

            if sentence_names:
                names.extend(sentence_names)
        
            sentence_emails = re.findall(email_pattern, sentence)
            # print(names,sentence_emails)
            if sentence_emails:
                for email in sentence_emails:
                    if names:
                        print(names)
                        # fake_email = f"{names[-1].lower()}.{fake.free_email_domain()}"
                        if not "USA" in names[-1]:
                            if " " in names[-1]:
                                print(names[-1])
                                firstname = names[-1].split()[0]
                                any_num = random.randint(0, 1000)
                                fullPrefix = firstname+str(any_num)
                                fake_email = f"{fullPrefix}.{fake.free_email_domain()}"
                                sentence = sentence.replace(email, fake_email)
                                names.pop()
                        else:
                            if " " in names[-2]:
                                print(names[-2])
                                firstname = names[-2].split()[0]
                                any_num = random.randint(0, 1000)
                                fullPrefix = firstname+str(any_num)
                                fake_email = f"{fullPrefix}.{fake.free_email_domain()}"
                                sentence = sentence.replace(email, fake_email)
                                names.pop()
                                names.pop()
                        
                    else:
                        anonymized_email = fake.email()
                        sentence = sentence.replace(email, anonymized_email)
            new_sentence_list.append(sentence)
                        
        self.text = " ".join(new_sentence_list)
            
        return self.text

This code is a method named changeEmail that belongs to a larger class. The purpose of this method is to replace any email addresses in the text with fake email addresses, while also trying to infer the name of the person associated with the email address and including that name in the fake email address if possible.

Here is an explanation of how the code works:

The method starts by splitting the input text into individual sentences using the sent_tokenize function from the nltk library.
Next, the method initializes two empty lists: names and new_sentence_list. names will be used to keep track of any person names that are identified in the text, while new_sentence_list will store the updated sentences with fake email addresses.
A regular expression pattern is defined to match email addresses. This pattern is then used to find all email addresses in each sentence.
The method then iterates through each sentence, and for each sentence, it tokenizes the text into individual words using the word_tokenize function from nltk, and then tags each word with its part of speech using the pos_tag function from nltk.
The ne_chunk function from nltk is then used to extract named entities from the tagged tokens. In particular, the method looks for any entities that are labeled as “PERSON”, which would correspond to person names.
For each identified person name, the method adds it to the names list.
If any email addresses were found in the sentence, the method checks if the names list is non-empty. If it is, then the method tries to infer the first name of the person associated with the email address (by taking the last name in the names list and assuming that it corresponds to the person’s last name), and generates a fake email address using that first name and a random number as a prefix. If the last name in the names list contains the string “USA”, then the method assumes that the person’s name was not correctly identified and tries to use the second-to-last name in the list instead.
If the names list is empty, or if the email address could not be associated with a person name for some reason, then the method generates a completely random fake email address using the fake.email() function from the faker library.
The updated sentence with the fake email address is then added to the new_sentence_list.
After iterating through all sentences, the method joins the updated sentences back into a single string and returns it.

Overall, this method uses a combination of regular expressions and natural language processing techniques to identify email addresses and person names in the input text, and then generates fake email addresses using a combination of the identified names and random numbers.

Change Mobile Number

	# change mobile numbers
    def changeMobileNumber(self):
        # Search for phone number patterns in the text
        phone_number_pattern = re.compile(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}')
        phone_numbers = re.findall(phone_number_pattern, self.text)

        # Replace the phone numbers with fake ones generated by Faker
        for phone_number in phone_numbers:
            fake_phone_number = fake.phone_number()
            self.text = self.text.replace(phone_number, fake_phone_number)
            
        return self.text

This code defines a method named changeMobileNumber which replaces the existing mobile numbers in the input text with fake ones generated by the fake object from the faker library.

The first step is to search for phone number patterns in the input text using the regular expression pattern r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'. This pattern matches phone numbers in the format (xxx) xxx-xxxx, xxx-xxx-xxxx, xxx.xxx.xxxx, or xxx xxx xxxx. The re.findall() method is then used to find all phone numbers in the text that match this pattern.

Next, the method iterates over all the found phone numbers and replaces them with fake phone numbers generated by the fake.phone_number() method from the faker library.

Finally, the updated text is returned by the method.

Change SSN

	# change SSN
    def changeSSN(self):
        ssn_regex = re.compile(r"\d{3}-\d{2}-\d{4}")
        self.text = ssn_regex.sub(lambda x: fake.ssn(), self.text)
        return self.text

Here, the changeSSN() method changes SSN within the text.

Example Usage

You can access the whole working code here.

Now, create another python file in the same directory, copy the following code, and run.

from redaction import DataRedaction

# the paragraph containing sensitive data
paragraph = "My name is John Smith and my address is 123 Main St, Anytown USA. My email is johnsmith@example.com.\
                My credit card number is 1234-5678-1234-5678 and SSN is 555-55-5555.\
                    My email is user@example.com. Today's date is 24-12-2023. Another date 2022-08-23. \
                        Also 2022/06/22 and 12/07/23 and 12/07/21. His mobile number is (123) 456-7890.\
                            Matt Henry is also responsible for the Uber Company. And Google, too."
                            

if __name__ == '__main__':
    mod = DataRedaction(paragraph)
    modDate = mod.changeDate()
    modName = mod.changeName()
    modAddress = mod.changeAddress()
    modEmail = mod.changeEmail2()
    modMobileNum = mod.changeMobileNumber()
    modSSN = mod.changeSSN()
    print(modSSN)

Concluding Remarks

One of the key benefits of using Faker is that it can help protect users’ privacy by generating realistic but fake data. In many cases, developers need to test their applications with real data but cannot use actual user data due to privacy concerns. Using Faker to generate fake data can help to ensure that no real user data is being used in testing.

For example, if a developer needs to test a feature that displays a user’s name on a webpage, they could use Faker to generate a random name instead of using a real name from their database. This way, they can test the functionality of the feature without exposing any real user data.

Share on

Twitter Facebook LinkedIn

Shanto Roy