A Proposed Enhancement to Data Package Provenance

June 28, 2022

Kyle Zollo-Venecek


Create more descriptive and machine-actionable provenance linkages between data packages by leveraging the EML schema's support for semantic annotations in conjunction with the currently used methods for documenting provenance.

Provenance metadata is used to track the origin or history of data. In EML, representing the provenance for derived datasets (i.e. datasets compiled from other, pre-established datasets) involves creating a list of the source datasets.

data package pointing to two data packages in brackets

Model of current provenance methods: a data package can broadly link back to other data packages from which it is derived

The individual source datasets are represented within the EML document of the derived dataset as a series of <dataSource> elements embedded within <methodStep> elements, that are all within the <methods> element. A <distribution> element is used to create a direct link to the source dataset. Source dataset creator and contacts can also be included in their respective elements:

      <para>This method step describes provenance-based metadata as specified in the LTER EML Best Practices.</para>
      <para>This provenance metadata does not contain entity specific information.</para>
      <title>LAGOS-NE-LIMNO v1.087.1: A module for LAGOS-NE, a multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of U.S. Lakes: 1925-2013</title>
          <givenName>Patricia A</givenName>
        <organizationName>Michigan State University</organizationName>
        <userId directory="">0000-0003-1668-9271</userId>
          <onlineDescription>This online link references an EML document that describes data used in the creation of this derivative data package.</onlineDescription>
          <url function="information"></url>

While this generally answers the question of what a derived dataset is compiled from, it lacks precision and provides no information regarding how source datasets are used to create a derived dataset. While quite a bit of detail could go into an explanation of precisely how source datasets are used to create a derived dataset, we propose minor supplements to the standing methods for recording provenance that allow links between individual data entities to be drawn. By linking specific data entities between datasets, the origin of derived datasets can be made more explicit (Table A is derived from Table C and D) and replicated data can be accurately documented (Table B is the same as Table F).

advanced provenance model

Model of proposed provenance methods: preserves the broad linkages back to other data packages, but also allows for more specific entity-entity relationships

Within the current EML 2.2.0 schema, these links can be represented by taking advantage of <annotation> elements to form Semantic Triples. Annotations in EML (and in general) are used to represent relations between a Subject and an Object using a well-defined Predicate (property). The Subject-Predicate-Object entity is known as a Semantic Triple. Every <annotation> element must contain a <propertyURI> and <valueURI> element, representing a Predicate (property) and an Object, respectively. The Subject of the triple is determined by the placement of the <annotation> element. There are three places within the EML document where an <annotation> element can be while still referring to the same object: (1) as a child of the Subject, (2) in the <annotations> element, (3) in the <additionalMetadata> element.

<dataTable id="table1">
    <propertyURI label="isBasedOn"></propertyURI>
    <valueURI label="Soil chemistry and nutrients table 1991"></valueURI>

  <annotation references="table1">
    <propertyURI label="isBasedOn"></propertyURI>
    <valueURI label="Soil chemistry and nutrients table 1991"></valueURI>

    <propertyURI label="isBasedOn"></propertyURI>
    <valueURI label="Soil chemistry and nutrients table 1991"></valueURI>

The <annotation> element references the Subject (data entity) differently depending on where it is located in the document. If the <annotation> element is the direct child of the <dataTable>, than the reference is implicit. If the element is within the <annotations> or <additionalMetadata> element, the relationship must be made explicit by referencing the Subject's id in the references attribute or as the content of the <describes> element, respectively.


We envision these enhancements to the provenance linking methods to allow for better records of a data package's history. While we currently do not have any tools to support the implementation of this new method of provenance, we welcome all to test it out by manually editing EML metadata. Your feedback, general or specific, would be very appreciated. Please reach out to us at