Provenance is the origin or history of data and is an important piece of metadata to make research transparent and reproducible. Several elements in metadata are involved in documenting data provenance. First and foremost is a detailed description of the methods by which the data were collected and created. Data processing scripts (R, Python, etc.) provide very detailed provenance information and can be published in a data package. For the special case of a derived dataset, i.e. data that is compiled from one or more other 'original' datasets, a detailed list of those original datasets should be included. Listing such datasets will give the original data creator proper credit, even when more datasets are used than can reasonably be cited in a resulting paper.
Creating provenance metadata
Provenance metadata for data sources both internal and external to the EDI Data Repository can be created using ezEML and EMLassemblyline. The EDI Data Portal and EDIutils only support the creation of metadata internal to the EDI Repository.
To create provenance metadata using ezEML:
- From the Methods tab, click the Add Method Step button.
- In the Data Sources textbox, enter as much information about the source dataset(s) as possible. At minimum, provide the DOI or a URL linking to the data source. The name and email for the data creator and contact are valuable information that should be provided, if available.
- Click Save and Continue
Provenance metadata submitted via ezEML is converted by data curators to the EML formatted provenance <methodStep> before publishing in the EDI Data Repository. Data curators processing an ezEML submission should see the EDI Data Portal section below for an example of the EML provenance format.
To create provenance metadata using EMLassemblyline:
- Run the
template_provenance()function to create an empty provenance template.
- For data sources originating from the EDI Data Repository, populate the template dataPackageID field with the EDI data package identifier and specify "EDI" in the systemID field. Use the other fields of this template when creating provenance for data sources external to EDI Repository.
- Run the
make_eml()function to add the provenance metadata to the EML for the derived data.
EDI Data Portal
To create provenance metadata from the EDI Data Portal for the original dataset already residing in the EDI Repository:
- Navigate to the Provenance section at the bottom of a data package landing page. This section displays provenance information and includes a link to generate provenance metadata for the data package.
- This links to the Provenance Generator. The Provenance Metadata XML tab contains text for the <methodStep> element. Copy the entire <methodStep> element.
- Open the EML for the derived data package in an XML editor and navigate to the <methods> element.
- Paste the copied provenance <methodStep> element at the end of the list of <methodSteps>. Repeat for all data sources.
See Editing EML for more on XML editors.
To create provenance metadata from the EDIutils R Package:
- Run the
get_provenance_metadata()function with the corresponding source data package identifier.
- Add the returned <methodStep> element into an EML R object of a derived data package or write it to file for other use cases.
For a language-agnostic solution, see the REST API documentation for Get Provenance Metadata.
- End-to-End Provenance - Software tools to collect and use provenance information.