Creating Metadata for a Data Package

Metadata describe the structure and context of other data. They are vital to the discovery and reuse of data, and are a required element of a data package.

This document introduces the concept of metadata and the Ecological Metadata Language (EML) format, used by the EDI Data Repository. It also details the tools available for creating and editing EML, and provides a brief introduction to best practices for creating EML.

What is metadata?

Generally, metadata for observational data focuses on:

  • WHAT is the content of the data?
  • WHO collected the data?
  • WHEN were the data collected?
  • WHERE were they collected?
  • HOW were the data collected?

It is beneficial to begin compiling metadata at the start of the research life cycle and have a completed metadata for each dataset by the end of it. The information that is supposed to be retained in metadata is prone to quickly degrade if not recorded.

Figure and caption from Michener et al. 1997[2]: Example of the normal degradation in information content associated with data and metadata over time ("information entropy"). Accidents or changes in technology (dashed line) may eliminate access to remaining raw data and metadata at any time.

The Ecological Metadata Language (EML)

The Ecological Metadata Language (EML) is an XML metadata standard: a metadata schema that is maintained by a group of interested parties and optimized for the ecological and environmental sciences. The EML specification defines the content ("elements"), characteristics ("attributes"), and relationships ("hierarchy") that composes an EML file.[1]

Some key elements of an EML file are presented in relation to these categories of focus:

WHAT is the content of the data?

In the EML document, the <title>, <abstract>, and <keywords> elements are used to describe the content of a dataset. Further, <attribute> and <unit> elements are used in conjunction with <dataTable>, <spatialVector>, and <otherEntity> elements to describe specific data objects in detail.

WHO collected the data?

The <creator> element describes the personnel who lent intellectual input towards the creation of the data package. The <associatedParty> element can be used for personnel who made some contribution but will not receive attribution in a citation (i.e. field crew, lab technicians, temporary help).

WHEN were the data collected?

The <temporalCoverage> element stores the start and end date of the data.

WHERE were the data collected?

The <geographicCoverage> element is used to explicitly describe the location where the research occurred. This element allows for the verbal description of locations as well as bounding coordinates.

HOW were the data collected?

The <methods> element should be used to describe how data were collected. The information on "how" or "provenance" may be supplemented with data processing scripts and documentation of other data used.

EML best practices

For more on EML best practices see the EDI Data Package Best Practices.

Creating EML

Since the EML standard was designed to handle an enormous variety of data scenarios, EML is complex and the learning curve to creating it can be steep. EDI develops and maintains a couple tools, ezEML and EMLassemblyline, to make this process easier and allow data providers to focus on the content of their metadata. Each tool serves a slightly different use case and data providers can transfer metadata between tools to meet their needs. Both the ezEML and EMLassemblyline tools allow users to work on metadata incrementally and return to a saved state at a later time.

See the resources section at the bottom of this page for more tools to create EML metadata.

ezEML

ezEML is a form-based online application designed to streamline the creation of EML-formatted metadata. Despite the complexity of EML, many data scenarios require only a relatively small subset of fields to be filled out. In such scenarios, ezEML can greatly simplify the process of creating EML, especially for users who are new to EML or use it only infrequently.

ezEML can be used as a "wizard" leading the user through EML document creation step by step, or it can be used in a more user-directed fashion. Among other things, ezEML supports:

There are a variety of options to help new users learn ezEML:

EMLassemblyline

EMLassemblyline is an R package for automating the creation of EML metadata within programmatic workflows. EMLassemblyline supports the same set of metadata elements as ezEML and is extensible through the EML R package. While it is optimized for automation, it also works well for creating a single EML metadata file.

Among other things, EMLassemblyline supports:

  • Importing an existing EML file to work with it in EMLassemblyline
  • Checking the EML for correctness and completeness

Learn more about using EMLassemblyline on the project website.

Editing EML

EML documents can be updated or edited using the software with which they were created (ezEML, EMLassemblyline). While this is likely the easiest way to make most edits to an EML file, it is sometimes necessary to manually edit EML in order to incorporate elements outside of the applications' scopes. Also, at times it's simpler because minor edits can be completed faster manually. While any text editor can be used to edit an EML file, XML-specific software have added capabilities to streamline the editing process and support schema validation.

oXygen XML Editor

Oxygen XML Editor

The oXygen XML Editor is a commercial product that provides a comprehensive suite of XML authoring and development features:

jEdit

jEdit is an open source and free editor providing the basics most users would need:

  • Good general text editor
  • XML plug-in
  • Schema validation
  • Auto-complete

Resources

Additional resources for creating and managing EML metadata:

  • EML - An R package for constructing EML. EMLassemblyline is a wrapper to this package.
  • metapype-eml - A Python package for constructing EML. ezEML is based off this.
  • LTER Core-Metabase - A PostgreSQL RDB model for research groups managing large volumes of EML metadata.

References

[1]Matthew B. Jones, Margaret O'Brien, Bryce Mecum, Carl Boettiger, Mark Schildhauer, Mitchell Maier, Timothy Whiteaker, Stevan Earl, Steven Chong. 2019. Ecological Metadata Language version 2.2.0. KNB Data Repository. doi:10.5063/F11834T2

[2]Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. (1997), NONGEOSPATIAL METADATA FOR THE ECOLOGICAL SCIENCES. Ecological Applications, 7: 330-342. https://doi.org/10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2