Automation with R: Nine Use Cases

Data is one of the most valuable assets in business today. Companies ranging from small to large have developed new and innovative ways to gather, analyze, and even monetize data. While the data management and analysis process can be complex, there are tools that simplify this process. One such tool we’ve implemented at A-G Associates is R.

R is a free, open-source, statistical programming language. As such, it is often associated with statisticians running calculations, testing hypotheses, or even creating new algorithms for machine learning. However, R can do more. At A-G Associates we’ve used R to develop increasingly more efficient ways to complete our work, sometimes prior to running any statistical tests at all. Repetitive data-related tasks are easily done in R, enabling reprioritization of time and attention to where it’s truly needed. Below are some examples of how R has been integrated into workflows to improve everything from data collection and management to quality control and automation.

Automating compilation of data from many sources. R can read similarly formatted data from multiple files and output the relevant data into the format needed. For example, during one Substance Abuse and Mental Health Services Administration (SAMHSA) project, we utilized R to scan public comments from hundreds of HTML files, outputting all comments and helpful metadata fields into an Excel file for thematic analysis. This allowed us to work with data from hundreds of files without manually opening each file.
Matching data from various sources. A Certified Community Behavioral Health Clinic grantee has a list of electronic health record patient data and data from a regional health information exchange database. They want to connect the datasets to examine patient outcomes. With a field present in both datasets, R matches individuals almost instantaneously through a join. If exact matching by field(s) is not possible, as no perfect, unique identifier is present, R can match probabilistically to find likely matches using whatever identifying fields are present, though imperfect.
Scheduling automations. For clients who require a task to be completed or updated on a cadence, an R code can be written and set to run itself automatically at the specified time; once set up, these codes require only needed or requested edits, saving many hours of labor down the road. Additionally, the R code could email the results to a pre-specified list of email addresses at that time, allowing for notification and review of the drafted product before providing the update to the client.
Outputting a formatted, data-driven PowerPoint presentation. A Department of Defense client required a PowerPoint Presentation to be updated biweekly with new data. We authored an R code to perform the necessary calculations and output a formatted PowerPoint presentation with counts, tables, graphs, and other output. While the initial draft of this code was somewhat time-intensive, an updated deck is quickly and easily output upon plugging in the new data and re-running the code: over 60 data-heavy slides in under 20 minutes.
Outputting formatted data request letters. Like Word, R can output formatted letters (PDF or .docx) with data piped into tables or other output. For example, I previously wrote a code that output a case request letter with the distinct list of requested cases in a table for each of more than one hundred counties that could then be emailed or mailed.
Writing and executing SAS code within R. Many government clients who have historically used SAS for their data processes are migrating to R. As they transition, some processes remain in SAS and thus working in SAS is sometimes necessary. We used R to automate the creation and execution of tailored SAS code for one Department of Defense data process.
Outputting interactive data quality reports. Many clients require regular ingestion and processing of new data, including quality control to identify and address any data quality concerns. For this purpose, we’ve created R code to read in, prepare, and examine the new data and then output the results into an HTML file (or other format, such as PDF, if preferable) with tabs for each step or section of quality control completed within the code, displaying whatever level of granularity is needed. Additionally, the output can include tables, data visualization, and even a button to download data, if using HTML output.
Comparing data sets. Sometimes, quality control will include comparing a new or updated data set to an old, finding any discrepancies between the two. R has many ways to do this almost instantaneously, including different types of joins and other functions, such as comparedf(), which can provide a list of every single difference between two sets of data for review.
Reading in data within R with APIs. As more and more clients begin collecting data throughout their processes, they often utilize software to store that data, such as REDCap or Greenspace. Given the appropriate authorization, we can query that software through an application program interface (API) call within R for use in reporting and analysis. This allows for the importing of new or updated data directly in R through running a few lines of code, rather than manually downloading new data every time it’s needed. Government clients such as the Centers for Disease Control and Prevention also make public-use data sets available via API.

We hope this list will inspire optimized business processes. Some of my favorite projects utilize multiple of the above strategies to greatly reduce the labor hours required for repetitive processes. For example, a few years ago I wrote an R code to pull data from hundreds of PDF toxicology reports, organize the data into the format needed, match the data to cases present in a REDCap database, and write that data to REDCap via the API. To validate the code, we ran it on toxicology reports that had previously been manually entered by staff, comparing the code's results to abstractors' manual data entry. This program resulted in fewer errors and reduced the labor hours required for this process by about 95%, freeing up abstractors’ time for more complex tasks moving forward. I like to think it improved the team’s morale, as well!

Madison Merzke is a data scientist at A-G Associates with over four years of innovative experience in data collection, abstraction, management, analysis, visualization, and reporting. Her analytics skillset includes advanced statistical and machine learning methods to draw insights from biomedical data sets. She received her Bachelor of Public Health and Master of Applied Statistics degrees from the University of Kentucky.