October 27, 2015
Source: CIMMYT
by Gideon Kruseman
Both private and public sector research organizations must adopt data management strategies that keep up with the advent of big data if we hope to effectively and accurately conduct research. CIMMYT and many other donor-dependent research organizations operate in fund declining environments, and need to make the most of available resources. Data management strategies based on the data lake concept are essential for improved research analysis and greater impact.
We create 2.5 quintillion bytes of data daily–so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, drones taking images of breeding trials, posts on social media sites, cell phone GPS signals, and more, along with traditional data sources such as surveys and field trial records. This data is big data, data characterized by volume, velocity, and variety.
Twentieth century data management strategies focused on ensuring data was made available in standard formats and structures in databases and/or data warehouses–a combination of many different databases across an entire enterprise. The major drawback of the data warehouse concept is the perception that it is too much trouble to put the data into the storage system with too little direct benefit, acting as a disincentive to corporate-level data repositories. The result is that within many organizations, including CIMMYT, not all data is accessible.
Today’s technology and processing tools, such as cloud computing and open-source software (for example, R and Hadoop), have enabled us to harness big data to answer questions that were previously out of reach. However, with this opportunity comes the challenge of developing alternatives to traditional database systems–big data is too big, too fast, or doesn’t fit the old structures.
Diagram courtesy of Gideon Kruseman
One alternative storage and retrieval system that can handle big data is the data lake. A data lake is a store-everything approach to big data, and a massive, easily accessible, centralized repository of large volumes of structured and unstructured data.
Advocates of the data lake concept believe any and all data can be captured and stored in a data lake. It allows for more questions and better answers thanks to new IT technologies and ensures flexibility and agility.However, without metadata–data that describes the data we are collecting–and a mechanism to maintain it, data lakes can become data swamps where data is murky, unnavigable, has unknown origins, and is ultimately unreliable. Every subsequent use of data means scientists and researchers start from scratch. Metadata also allows extraction, transformation, and loading (ETL) processes to be developed that retrieve data from operational systems and process it for further analysis.
Metadata and well-defined ETL procedures are essential for a successful data lake. Much of CIMMYT’s data is still in a data swamp, due to the lack of these two essential structures. Genomic data is an exception that requires intense data management due to the sheer amount CIMMYT collects. In addition, gene sequencing data is well structured and easier to manage compared to other CIMMYT data, such as field trial and survey data.
In rare cases when the structure and meaning of the data are obvious, the fact that data is still in the swamp is not a major issue because we can define its structure easily, creating a schema-on-read once there is an analytical purpose for the data. Unlike ETL, data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes in. The downside to creating a schema-on-read is that it is often time-consuming and difficult, if done separately from data accession, requiring one to separately define both the metadata and relevant ETL procedure to ensure data can be used in a variety of analyses. Despite support of schema-on-read by data lake advocates, the effort and information required to create it often render the process useless. A data lake strategy with metadata and ETL procedures as its cornerstone is essential for CIMMYT to maximize data use, re-use and to conduct accurate and impactful analyses. Next steps include educating scientists and management and developing structures and procedures that support a CIMMYT data lake. Ideally this should result in a flexible metadata database with a user-friendly interface to peruse the metadata of the data and the related ETL procedures, making the use, re-use, and integration of data easier and quicker, and research for development more efficient and effective.