Integrating Long-Tail Data: How Far Are We?
September 5, 2021
EDI’s Kristin Vanderbilt and Corinna Gries co-edited a Special Issue of Ecological Informatics “Integrating Long-Tail Data: How Far Are We?” that explores how far the informatics community has come toward lessening the time researchers must spend integrating small, heterogeneous datasets prior to analyzing them.
Such datasets, often viewed as being from the “long-tail of science”, may differ by temporal and spatial scales, units, semantics, data collection methodology, sampling design, and data organization. Harmonizing data across these dimensions can be daunting, and the Special Issue (SI) describes some recently implemented or proposed mechanisms for accelerating this process. Below we briefly introduce each paper and the particular challenge it sought to ameliorate. These papers clearly illustrate the important role that repositories have in facilitating data synthesis efforts.
Semantics are often a challenge when data are integrated. When terms used in different datasets to describe variables or methods have not been drawn from a controlled vocabulary, they must be mapped to a common set of terms so that data can be understood. In the SI, Lenters et al. (2020) demonstrate how providing a metadata template to plant trait data providers, allowing them to link their idiosyncratic variable names to terms in standardized plant trait ontologies, facilitated the integration of 15 palm datasets with 140,000 plant traits.
Use of data standards, which provide semantic and structural consistency among datasets, is a promising way to make data easier to integrate. Ely et al. (2021) and Bond-Lamberty et al. (2021), describe how they developed standardized data formats for leaf-level gas exchange data and soil-to-atmosphere carbon flux data. By engaging many community data providers and data users while defining the standard, consensus was reached on reporting formats that include guidance for variable names and definitions, file structure, units and metadata content.
Data newly contributed to a data repository can take advantage of existing data standards, but what about datasets that have been contributed to repositories over the last decades? Is there any way to make them easier to integrate? O’Connell et al. (2021) describe the labor intensive process needed to standardize sage grouse monitoring data sourced from several state agencies and how they developed a database tool to make incorporation of data updates easier. O’Brien et al. (2021) describe a different approach, where two collaborating repositories, EDI and NEON, developed an analysis-ready design pattern into which community survey data can be refactored. The workflow to transform the data into the design pattern utilizes repository infrastructure and can be automated to include new data.
Machine-actionability and potential for automated processing of data also influences how readily data can be harmonized. Huber et al. (2021) surveyed multiple repositories and reported that each has developed its own custom tools to support data ingestion into the various analytical platforms scientists use. The authors argue that repositories should cooperate to develop generic tools that are independent of the coding language used. They offer a roadmap to reach this goal with specific implementation suggestions.
Finally, any data synthesis is limited by the availability of data. One barrier to submission of data to repositories is a scientist’s concern about getting credit. Agarwal et al. (2021) discuss options for citing datasets when the number to be cited outstrips the space available in the journal. Approaches are discussed that allow the data provider to receive credit while offering the data user a compact citation for a group of datasets.
It is clear from this set of papers that repositories have a significant role to play in making data easier to harmonize and integrate. Repositories can:
- Recommend terminologies for data contributors to use.
- Require that data conform to particular data standards.
- Develop tools to support use of their machine-actionable data and workflows.
- Reformat data for a particular domain to make it easier to reuse and integrate.
- Make recommendations regarding data citation to encourage deposition and reuse of data.
Papers in Special Issue
Agarwal, D.A., Damerow, J., Varadharajan, C., Christianson, D.S., Pastorello, G.Z., Cheah, Y.-W., Ramakrishnan, L., 2021. Balancing the needs of consumers and producers for scientific data collections. Ecological Informatics 62, 101251. https://doi.org/10.1016/j.ecoinf.2021.101251
Bond-Lamberty, B., Christianson, D.S., Crystal-Ornelas, R., Mathes, K., Pennington, S.C., 2021. A reporting format for field measurements of soil respiration. Ecological Informatics 62, 101280. https://doi.org/10.1016/j.ecoinf.2021.101280
Ely, K.S., Rogers, A., Agarwal, D.A., Ainsworth, E.A., Albert, L.P., Ali, A., Anderson, J., Aspinwall, M.J., Bellasio, C., Bernacchi, C., Bonnage, S., Buckley, T.N., Bunce, J., Burnett, A.C., Busch, F.A., Cavanagh, A., Cernusak, L.A., Crystal-Ornelas, R., Damerow, J., Davidson, K.J., De Kauwe, M.G., Dietze, M.C., Domingues, T.F., Dusenge, M.E., Ellsworth, D.S., Evans, J.R., Gauthier, P.P.G., Gimenez, B.O., Gordon, E.P., Gough, C.M., Halbritter, A.H., Hanson, D.T., Heskel, M., Hogan, J.A., Hupp, J.R., Jardine, K., Kattge, J., Keenan, T., Kromdijk, J., Kumarathunge, D.P., Lamour, J., Leakey, A.D.B., LeBauer, D.S., Li, Q., Lundgren, M.R., McDowell, N., Meacham-Hensold, K., Medlyn, B.E., Moore, D.J.P., Negrón-Juárez, R., Niinemets, Ü., Osborne, C.P., Pivovaroff, A.L., Poorter, H., Reed, S.C., Ryu, Y., Sanz-Saez, A., Schmiege, S.C., Serbin, S.P., Sharkey, T.D., Slot, M., Smith, N.G., Sonawane, B.V., South, P.F., Souza, D.C., Stinziano, J.R., Stuart-Haëntjens, E., Taylor, S.H., Tejera, M.D., Uddling, J., Vandvik, V., Varadharajan, C., Walker, A.P., Walker, B.J., Warren, J.M., Way, D.A., Wolfe, B.T., Wu, J., Wullschleger, S.D., Xu, C., Yan, Z., Yang, D., 2021. A reporting format for leaf-level gas exchange data and metadata. Ecological Informatics 61, 101232. https://doi.org/10.1016/j.ecoinf.2021.101232
Huber, R., D’Onofrio, C., Devaraju, A., Klump, J., Loescher, H.W., Kindermann, S., Guru, S., Grant, M., Morris, B., Wyborn, L., Evans, B., Goldfarb, D., Genazzio, M.A., Ren, X., Magagna, B., Thiemann, H., Stocker, M., 2021. Integrating data and analysis technologies within leading environmental research infrastructures: Challenges and approaches. Ecological Informatics 61, 101245. https://doi.org/10.1016/j.ecoinf.2021.101245
Lenters, T.P., Henderson, A., Dracxler, C.M., Elias, G.A., Kamga, S.M., Couvreur, T.L.P., Kissling, W.D., 2021. Integration and harmonization of trait data from plant individuals across heterogeneous sources. Ecological Informatics 62, 101206. https://doi.org/10.1016/j.ecoinf.2020.101206
O’Brien, M., Smith, C.A., Sokol, E.R., Gries, C., Lany, N., Record, S., Castorani, M.C.N., 2021. ecocomDP: A flexible data design pattern for ecological community survey data. Ecological Informatics 64, 101374. https://doi.org/10.1016/j.ecoinf.2021.101374
O’Donnell, M.S., Edmunds, D.R., Aldridge, C.L., Heinrichs, J.A., Monroe, A.P., Coates, P.S., Prochazka, B.G., Hanser, S.E., Wiechman, L.A., Christiansen, T.J., Cook, A.A., Espinosa, S.P., Foster, L.J., Griffin, K.A., Kolar, J.L., Miller, K.S., Moser, A.M., Remington, T.E., Runia, T.J., Schreiber, L.A., Schroeder, M.A., Stiver, S.J., Whitford, N.I., Wightman, C.S., 2021. Synthesizing and analyzing long-term monitoring data: A greater sage-grouse case study. Ecological Informatics 63, 101327. https://doi.org/10.1016/j.ecoinf.2021.101327
Vanderbilt, K., Gries, C., 2021. Integrating long-tail data: How far are we? Ecological Informatics 64, 101372. https://doi.org/10.1016/j.ecoinf.2021.101372