Test data management (TDM) is usually ignored despite its undeniable contribution to the overall success of the testing process. In complex testing projects with many test scenarios to cover, test data management is a highly critical area to optimize for.
QA teams need diverse and comprehensive test data to achieve higher test coverage, and that brings up the need to have a separate place where that data is properly stored, managed, maintained, and set up for future testing. That is where test data management shines through.
In this article, we will explore the concept of test data management in-depth, along with test data management best practices, strategy, and tools that you can use for this activity.
Test data management (TDM) is the process of planning, creating, and maintaining the datasets used in testing activities, ensuring that they are the right data for the right test case, in the right format, and available at the right time.
Test data is the set of input values used during the testing process of an application (software, web, mobile application, API, etc). These values represent what a user would enter the system in a real-world scenario. Testers usually can write a test script to automatically and dynamically identify the right type of values to put into the system and see how it responds to those data.
For example, test data for the testing of a login page usually has 2 columns: a Username column and a Password column. A test script or automation testing tool can open the Login page, identify the Username field, the Password field, then input the values:
Username |
Password |
user_123 |
Pass123! |
testuser@email |
Secret@321 |
admin_user |
AdminPass# |
jane_doe |
JaneDoePass |
You can have hundreds to thousands of such credential pairs representing unique test scenarios.
But having a huge database does not immediately mean all of it is high-quality. There are 4 major criteria to evaluate test data quality, including:
Read More: A Guide To Data-driven Testing using Selenium and Katalon
Here are some reasons why you should have your test data management process in place:
Read More: Database Testing: A Complete Guide
Data masking is the technique used to protect sensitive information in non-production environments by replacing, encrypting, or otherwise “masking” confidential data while retaining the original data's format and functionality. Data masking creates a sanitized version of the data for testing and development purposes without exposing sensitive information.
The way data is masked depends on the algorithms QA teams chose. After cloning the data, there are quite a lot of ways to “play” with it and turn it into a completely new set of data in which the original identity of the users is protected. For example, we can:
Data Masking Technique |
Definition + Examples |
Substitution |
Definition: Replace actual sensitive data with fictional or anonymized values. You can leverage Generative AI for this approach; however, note that creating entirely new data is resource-intensive.
Example: Replace actual names with randomly generated names (e.g., John Doe). |
Shuffling |
Definition: Randomly shuffle the order of data records to break associations between sensitive information and other data elements. This approach is faster and easier to achieve compared to the Substitution.
Example: Shuffle the order of employee records, disconnecting salary information from individuals. |
Encryption |
Definition: Use encryption algorithms to transform sensitive data into unreadable ciphertext. Only authorized users with decryption keys can access the original data. This is a highly secured approach to take.
|
Tokenization |
Definition: Replace sensitive data with randomly generated tokens. Tokens map to the original data, allowing reversible access by authorized users.
Example: Replace social security numbers with unique tokens (e.g., Token123). |
Character Masking |
Definition: Mask specific characters within sensitive data, revealing only a portion of the information.
Example: Mask all but the last four digits of a social security number (e.g., XXX-XX-1234). |
Dynamic Data Masking |
Definition: Dynamically control and limit the exposure of confidential data in real-time during query execution. In other words, sensitive data is masked at the moment of retrieval, just before being presented to the user (usually the masking logic is based on user roles).
Example: Mask salary information in query results for users without financial access rights. |
Randomization |
Definition: Introduce randomness to the values of sensitive data for creating diverse test datasets.
Example: Randomly adjust salary values within a specified percentage range for a group of employees. |
Data subsetting is a technique to create a smaller yet representative subset of a production database for use in testing and development environments. There are several benefits to this technique:
Synthetic data generation is the process of creating artificial datasets that simulate real-world data without containing any sensitive or confidential information. This approach is usually reserved only for when obtaining real data is challenging (i.e. financial, medical, legal data) or risky data (i.e. employee personal information).
In such cases, generating entirely new sets of data for testing purposes is a more practical approach. These synthetic datasets aim to simulate the original dataset as closely as possible, and that means capturing its statistical properties, patterns, and relationships.
To create new test data, you can leverage Generative AI. Simply provide the AI with clear-cut prompts for how you want your dataset to be. If you want to go above and beyond, you can custom-train an AI with real-world data samples (make sure to let it know the statistical properties you want to achieve).
Of course, do not expect instant results when training an AI. However, with enough dedication, you can create a powerful engine fine-tuned to every specific test data needs of your organization.
Read More: A Guide To Do Synthetic Data Generation
Katalon is a well-known automation testing platform that comes with readily available test data management features that you can leverage right away. As a comprehensive platform, you can do test planning, management, execution, and analysis for web, desktop, mobile, and API testing on Katalon, with TDM best practices already built in!
The first step of TDM is to generate data. Here’s how you can achieve synthetic data generation with Katalon. Make sure that you have the latest version of Katalon installed, which you can download using the link below:
Download Katalon and Witness its Power
Once you are in Katalon, open any test case or create one if you are starting from scratch. After that, write a clear prompt to instruct GPT as to what test script you want it to create. You should use actionable language, provide necessary context, and specify the results. See the example test steps below for reference:
After that, select the prompt and right-click, select StudioAssist, and then select “Generate Code.” The code will soon be generated based on your instructions. You can freely make any adjustments you want with it.
Tricentis Tosca is a comprehensive enterprise-grade automation testing tool for web, API, mobile, and desktop applications. It has a distinctive model-based testing methodology, enabling users to scan an application’s UI or APIs to create a business-oriented model for test development and maintenance.
Tricentis comes with the Test Data Management web application that allows you to view, modify, or delete records in your test data repositories. The TDm module is automatically installed as part of the Test Data Service component in the Tricentis Tosca Server setup.
With IBM Test Data Management Solution, you can browse, edit, and compare data, ensuring test results alignment with original data. With support for complex data models and heterogeneous relationships, it ensures data integrity for application testing and migration.
Additionally, IBM TDM also provides data privacy features to mask sensitive information, maintaining validity for testing purposes. There are interfaces for designing, testing, and automating test data management processes, enabling self-service for end users.