Automating Data Mapping In Clinical Trials With SDTM

Pavlo Chaikivskyi

Senior Engineer

April 29, 2025 7 min read

Innovative data standards hold the key to modern clinical trials.

The automation of clinical trials has been a goal of the pharmaceutical industry for many years. The use of Electronic Data Capture (EDC) and other technologies has streamlined the process of collecting and managing clinical trial data. However, the process of tabulating and analyzing data from clinical trials remains a manually intensive and error-prone task. As a flexible model that can be used for a variety of clinical trial designs and data types, the Study Data Tabulation Model (SDTM) lends a helping hand and provides a standardized way to organize clinical trial data. In this article, we will discuss the intricacies of clinical trial automation within the SDTM standards.

What Is SDTM?

SDTM is a data standard developed by the Clinical Data Interchange Standards Consortium (CDISC). They are a standards development group that deals with medical research data and strives to maximize the impact of clinical research data. SDTM provides a common structure for the organization and tabulation of clinical trial data, thereby setting unified data standards for pharmaceutical companies. As a proven format for the dissemination of clinical trial data, SDTM provides regulatory reviewers from the US Food and Drug Administration (FDA) a clear description of the structure, attributes, and contents of each dataset in the model. This common structure allows for the exchange of data between different clinical trial software applications and databases. Once you’ve obtained all the relevant data for your research, it must be transformed into the specific table format suitable for FDA review.

Explore how Avenga developed a set of mature telehealth applications that were integrated into a single innovative service delivery system. Read more

What Are The Benefits Of SDTM Automation In Clinical Trials?

The advantages of SDTM automation in clinical trials are many and varied. Here are some of them:

Increased efficiency – cut down on the overall time and cost of clinical trials by streamlining data management.

Improved data quality – raise the quality of clinical trial data by ensuring its integrity and traceability.

Reduced risk of human error – lower the risk of human error in data collection and management, and avoid errors in the analysis of trial data.

Higher transparency – increase transparency in clinical trials by providing a clear audit trail of data collection and management processes.

Enhanced patient safety – safeguard patients by ensuring that critical trial data is managed accurately.

Last but not least, SDTM automation in clinical trials can facilitate communication between different sponsors, sites, and Clinical Research Organizations (CROs), leading to better coordination and overall management of clinical trials. Given these strengths of SDTM automation, the question is – how to implement it? Below are three approaches to SDTM data mapping.

SDTM Automation: A Hybrid Mapping Method

A manual method, a hybrid process comprising application and manual parts, or a complete application technique might be used to map raw source variables into SDTM variables. In the past, requirements were written into an Excel or Word file. Then, a programmer manually transferred the relevant specifications into a Statistical Analysis System (SAS) or Structured Query Language (SQL) code as part of an all-manual procedure. The hybrid method reads specifications from an Excel file into a dataset, which is then used to dynamically build code in a program in order to map the source variables. When using a complete application method, such as SAS Clinical Data Integration, the application stores all of the requirements, reads them, and develops all of the SDTM mapping and derivations

The manual process of clinical data mapping increases the time of the data standardization, hence enlarging resources spent on the project. In this way, automation seeks to decrease the additional time necessary to create high-quality CDISC-compliant data for FDA submission. That’s why it is crucial to work on tools that would save resources spent on the SDTM creation.

What Are The Main Difficulties Of Automating A SQL Script Creation?

The SDTM specifications are usually inserted into a standardized source mapping Excel workbook file, with a worksheet for each domain. Creating a standardized Excel file may be the most difficult component of the data flow because numerous scenarios must be properly mapped by the user. Today, we will consider some of the possibilities of creating SDTM mapping in a way that it is possible to transform it into an executable script.

The ability to create SQL scripts with plain language statements has the potential to appeal to users who are unfamiliar with query languages such as SQL. Text to SQL mapping is a Semantic Parsing issue, which is defined as converting natural language input into a machine-interpretable representation. Semantic Parsing is a well-studied subject in Natural Language Processing (NLP) that has a lengthy history.

As a result, Semantic Parsing attracts the interest of people who want to make the process of SDTM creation less time-consuming and more effective. All of these approaches might be eventually integrated to make a broader task of translating natural language to a fully functional application. To address the Semantic Parsing problem, different approaches have been developed. Meanwhile, the difficulty of creating SQL is more complex than the typical Semantic Parsing problem. A brief natural language inquiry may necessitate the combining of numerous tables or the use of multiple filtering criteria. That’s why more context-based techniques are required.

Discover how innovative data management practices catalyze digital transformation in drug discovery.

NLP in SQL Query Generation

Annotated complex questions and SQL queries comprise the datasets that are meant for semantic processing of natural language phrases to SQL queries. The sentences are inquiries for a certain area, while the answers are drawn from existing databases. As a result, the specific inquiry is linked with a SQL query. The SQL query executes and retrieves the response from the existing data.

There are various Semantic Parsing datasets available now for SQL query mapping. They differ in numerous ways. For example, WikiSQL and Spider, the most recent datasets, are cross-domain and context-independent. In addition, they have many questions and extensive inquiries. As the dataset size is critical for successful model assessment, unexpectedly complicated questions in the test sets can be used to assess the model’s generalization capacity. Despite the fact that the WikiSQL dataset comprises a huge number of questions and SQL queries, these SQL queries are basic and focused on single tables. The Spider dataset has fewer questions and SQL queries than WikiSQL. These questions, however, are more complicated and the SQL queries incorporate various SQL phrases, such as table join and a nested query.

The Rule-Based Approach In The Automatic Code Creation

Another way to map raw source variables to SDTM variables heavily relies upon these steps:

Reading SDTM mappings

Merging relationships in Excel worksheets

Converting the Excel worksheets to datasets

Deriving functions and standard expressions to read the datasets

Creating the SDTM mappings

Merging the data

Appending the data

The advantage of functions is that it directly maps the Excel requirements, which should result in fewer conflicts and faster development. With this approach, it would be possible to generate from 50 up to 70 % of the variables for different SDTM domains. To collect all of the derivations for the other output variables, post-processing of the SDTM dataset would be required. Provided metadata should contain all of the important information about the source variable, format, and new variable, and it might be used for validation of the mappings.

Learn how Avenga created an innovative drug ordering system for a global provider of cloud-based software and services. Read more

Closing Remarks

The process of getting the final SDTM mappings can be automated. There can be different approaches used for that purpose, and they can possibly give a different percentage of automation based on the initial conditions of mapping and its complexity. Yet, automating data mapping in clinical trials with SDTM can speed up the process of data analysis and help ensure a higher accuracy of data management.

Interested in learning more about our digital solutions for pharma & life sciences?

Don’t hesitate to contact us.

Get in touch

Featured Insights

Your business results matter

Achieve them with minimized risk through our bespoke innovation capabilities. Fill in the form below.

First name

Last name

Business email

How can we help you?

* Required fields