The problem in genomics for clinicians and researchers is not data storage (thanks a lot to Amazon Web Services). It’s data analysis.
There is no shortage in data today and there will be more in the future. The only ones who will endure this data tsunami will be those who are able to manage it well and extract key insights. Amazon S3, AWS’ object storage service, and the services that are built around it, are key components of how we manage and quantify data needs and processes. This translates into better and more efficient processes as well as insights.
As a steward of large and varied datasets, any efficiency gained from data quantification and categorization can pay huge dividends. In my previous roles, I started by accounting for what we had and what we didn’t have. Although this may seem simple, when dealing with large amounts of data and files it can be difficult and time-consuming. Even smaller datasets can be difficult, especially if the data is constantly in motion, as many of us do. Overtime heartache is something I have experienced many times during my career.
Inova was one of the first places I went when I started my current job. It was a data-quantification mission and an accounting mission. Our institute was in existence for nearly three years before I was hired. We enroll participants in our research studies, and generate large genomic and biological data. The AWS API was the best way to manage and quantify our data. It allows us to create an inventory of all data stored on AWS.
Although our AWS skillsets were not yet fully developed, we were able quickly to quantify our data using a small set of Python-based scripts. The scripts provided useful information about data objects such as file location, file size (in bytes), and when it was last modified. We could quickly see the movement of our data, what was coming in from outside vendors and how much was derived. We could also see the variability in our data in terms of file sizes and file count. Our data was “alive,” constantly growing and moving.
This was initially a manual process. We only ran the scripts at specific times. Sometimes it took hours to run. It seemed that certain times were more efficient than others. As our data grew, and more people were working with it, we decided to automate. We decided to run the scripts at 2 AM.
Step 1 was analysis; Step 2 was accounting. We didn’t know the context of our files. Once our processes and data requirements were normalized, we started to integrate the data into our internal data warehouse. We needed to know more information about the objects that we were storing. This data was not available in the AWS object store.
We began by adding internal metadata for our AWS data objects. We wanted to answer the question “From which study participants were biological data file derived?” To accomplish this, we added columns to our data warehouse table to store these objects and created custom Extract Transfer Load jobs (ETL) to match our objects with participants.
Our team and the health system were new to genomic data at this size and scale. However, we could now answer questions about our data files in relation to participants or the entire family. Based on the data vendors, we could now measure the sizes of the data objects.
Step 3 is to use the information for business decisions. We were able to make better estimates about data storage costs and how to move and share data by looking at the context of our participants. We were able estimate the data storage cost for the st