Develop Safer Software with Secure Datasets

You’ve heard a lot about writing secure code from us (and rightly so!), but it’s only one part of application security. Keeping your software safe also involves other aspects, such as making sure the datasets you use are managed securely.

After all, even the most secure code can be put at risk by unsafe data. Your defenses might be robust, but if sensitive datasets are mismanaged, they can become a big problem and affect your organization’s security posture.

Feature image of data security on SecureFlag background

The Role of Data in Software Development

Developers often need access to data that mimics production environments to verify application behavior and debug issues.

In practice, however, that need for “realistic” data can be risky. For example, copying production data into development or staging environments without proper controls is not a good idea.

Sure, it makes the development process quicker, but it introduces major security and privacy risks, especially if personally identifiable information (PII) is involved.

If you’re working with datasets that include PII, you need to ask questions such as:

Do I have permission to use this data?
Was it collected and stored lawfully?
What rules govern how I can use, share, or retain it?
Is it necessary to access the full dataset, or can I work with a sanitized version?

Data Regulations

Depending on where your users are and where your application operates, you may be subject to one or more data protection laws. Below are a few you’ve probably heard of and should take seriously (otherwise, huge fines could come your way):

GDPR (EU): Requires organizations to process personal data lawfully, transparently, and for a specific purpose, collecting only what’s needed and embedding privacy from the outset.
HIPAA (US): Applies to healthcare data and mandates strict controls over who can access, share, and store personal health information (PHI).
CCPA/CPRA (California): Gives users the right to know what data is collected and to opt out of data selling.
POPIA (South Africa), LGPD (Brazil), PIPEDA (Canada): Each of these applies similar principles around lawful processing, consent, and data protection.

You don’t need to be a legal expert, but you should be aware of how data is classified, what protections are required, and whether proper permissions exist to use a dataset in a given context.

Working with Data in Development and Testing

Developers need realistic datasets to test features and debug effectively, but, as mentioned, it’s important to avoid using actual production data directly. Otherwise, it’s possible that private information can be exposed, even in internal environments like dev or staging.

The safer approach is to work with datasets that look and behave like production data but don’t contain personal or confidential information. This way, you can still develop and test features without putting anyone’s data at risk.

Cleaning Data Before Use

Datasets need to be carefully cleaned, which often means removing sensitive details and replacing identifiers with safer alternatives before they’re used.

An example of this is if you’re working in the healthcare industry and testing a patient dashboard feature, the original dataset might include full names, patient IDs, medical histories, and more.

However, before that data is worked with in development, it needs to be cleaned by removing or anonymizing personally identifiable details and possibly replacing them with realistic but fake data.

Data Anonymization and Masking

When we talk about anonymization, we’re referring to the process of removing identifying information from a dataset so that individuals can’t be re-identified. How is this done?

Removing or replacing names, emails, or ID numbers.
Randomizing dates of birth or zip codes.
Masking fields like credit card numbers or phone numbers.
Obscuring or removing sensitive fields entirely.

Pseudonymization

There’s also pseudonymization, where identifiers like names are replaced with codes, such as “user1234”. This removes obvious identifying information, but the data can still be linked back to the original users if you have the mapping key.

It’s important to note that pseudonymized data usually still falls under regulations like GDPR, whereas properly anonymized data often does not. Even in development, using pseudonymized data may require the same compliance checks as the original production data.

Synthetic Data

Generating synthetic data is another approach that can be used. It resembles real datasets but doesn’t expose any personal information, and allows developers to test and debug applications with realistic patterns while still keeping user information safe.

These techniques allow developers to build, test, and debug effectively, without putting user privacy or regulatory compliance at risk. If your organization provides anonymized dataset versions for development and UAT, use those as your default.

Risks from Dataset Misuse

The Uber 2016 breach is a good example of how a simple development misstep can put massive amounts of user data at risk.

Attackers gained access to 57 million records after AWS credentials were accidentally committed to a private GitHub repo, which they then used to access S3 buckets containing confidential data.

Below are some other typical security issues related to dataset misuse in development.

Accidental exposure: Sensitive data ends up in logs, screenshots, or demo environments.
Unsecured environments: Dev or UAT systems don’t have the same access controls and encryption as production.
Lack of oversight: Developers are given access to datasets without proper documentation on data origin, permissions, or restrictions.

Making Dataset Security Part of Your Workflow

We always hear about “shifting security left” to catch problems earlier in the software development life cycle. The same idea goes for datasets. It’s not enough to protect data only in production, as you want to keep it safe at every step of development.

Developers can make data security a part of their everyday work when they:

Question the source: If you’re handed a dataset, ask where it came from, who owns it, and what it contains.
Avoid shortcuts: Don’t pull production data into dev, even if “everyone else does it.”
Automate sanitation: Build data masking and anonymization steps into your CI/CD pipeline if needed.
Log responsibly: Be cautious about what gets logged, as personal data should never appear in logs, even temporarily.
Secure your environments: Even if your dev machine or staging server isn’t customer-facing, it still needs to be secured like it is.

Secure Your Datasets with SecureFlag

As you can see, secure development isn’t only about code, but also about how you work with data throughout the software lifecycle.

We’re expanding our training content on the SecureFlag platform to help developers strengthen their data security practices, such as:

How to prepare data safely without risking exposure.
Techniques for anonymizing data to protect privacy.
Best practices for cleaning sensitive datasets.

These learning paths and labs are on their way to help you learn secure data handling, so your applications stay safe, compliant, and trustworthy from the first commit to production.

Want to learn more about how we can help? Get in touch today!