"Robot" Based Events are No Longer Recorded in the Audit Event Log
December 6, 2022
As of 23 November 2022, the EDI data repository no longer registers "robot"-based events in the Audit "event log" database table. This planned omission aims to reduce the growth of the "event log" table, recover disk storage (from about 30 million records), and improve query performance. Not registering robot events will also affect the tallies counted in the "resource reads" table. From an end-user’s perspective, the "total reads" returned in the Audit Manager's docid and packageid REST API read methods use the data stored in the "resource reads" table to return the number of "read" (download) events for either a data package series or a specific data package revision. The "total reads" value includes the sum of both robot and non-robot read events. With this change, the "total reads" value for a data package will now only increase if a non-robot event occurs (along with the "non-robot reads" value). We do not plan to remove the "total reads" value from this REST API method at this time due to concerns about breaking backward compatibility. However, we believe that the more important value is that of the "non-robot reads" - the value of user downloads.
Alignment of the LTER Controlled Vocabulary and Environment Ontology
November 17, 2022
In a step towards improved data search and discovery, EDI has mapped the US LTER Controlled Vocabulary to the Environment Ontology (ENVO). This alignment enables LTER keywords to be translated into equivalent ENVO concepts of greater semantic expressivity. These improved semantics can help the EDI search engine better understand the meaning and intent of user queries, as well as to suggest related data of potential interest (i.e. semantic search). The complete mapping is available at the LTER Vocabulary GitHub.
EDI at the LTER All Scientists' Meeting (19-23 September 2022)
September 1, 2022
We organized and will lead and present information management and science sessions at the 2022 LTER All Scientists’ Meeting. Topics include new developments at EDI, synthesis science through harmonized data, annotating ecological data for reuse and more. For more information, please also check the ASM meeting schedule.
New ezEML feature: Check Data Tables
August 31, 2022
We’ve added an important new ezEML feature: Check Data Tables. ezEML has long had the Check Metadata feature that helps you determine if you’ve completed all of the required and recommended EML metadata for a data package. Check Data Tables drills down further, checking not just the metadata specifications for data tables, but the data tables’ contents, as well. The "Check Data Tables" feature examines your data tables’ CSV files to check if the data tables’ contents match up with their descriptions in the metadata (e.g., do they have the expected numbers and types of columns, do the entries in the columns have the expected type and form, are the categorical codes correct, etc.). All errors are displayed, letting you fix any problems before you submit your data package to the EDI data repository.
Webinar "EDI Summer Fellows present", 16 August 2022
August 11, 2022
Join us for 5-minute eLightning talks by our Summer Fellows that highlight their data publishing experience for specific host sites, spanning diverse ecosystems across the country, from Maine’s Mount Desert Island to Puerto Rico’s Luquillo Mountains to the Palmyra Atoll in the central Pacific.
EDI Data Repository scheduled maintenance
August 9, 2022
The EDI Data Repository will undergo scheduled maintenance Wednesday, 17 August and Thursday, 18 August that will result in our systems being unavailable. The University of New Mexico Center for Advanced Research Computing, where our production infrastructure is managed, has deemed this maintenance critical and necessary to continue with uninterrupted service into the future. We will do our best to maintain access to data in a "read-only" state (no data package uploads, no data package evaluations, and no archive downloads during this period). We cannot guarantee seamless access to data until we are back on our production systems on Friday, 19 August. We plan to perform regular patching on Tuesday, 16 August evening and transition into a "read-only" state by Wednesday morning. If all goes as expected, EDI will be back to using our production infrastructure by Friday morning. We will keep you posted on updates and changes to this schedule as we learn more. We apologize for any inconvenience this event will cause.
DeX in production
July 6, 2022
Our new data exploration tool called “DeX” for use with published data is in production now. It provides a text and graphical summary of each variable/column in a data table within a dataset, a data sub-setting tool, and simple graphing capabilities. To access the tool, go to https://edirepository.org “Find Data” and “EDI Dataset Search.” Once you have located a data package, hitting the “Explore Data” button next to each of the data tables will show you the profile/summary for that table. The menu on the top of the page provides access to the subset and plot functions. For more information see also "A Quick Overview of EDI’s Data Explorer (DeX)".
New version of Data Package Audit Report generator released
July 5, 2022
The EDI software development team has released a new version of the Data Package Audit Report generator that includes a function to download results as a Comma-Separated Values (CSV) file. The downloaded file contains more detailed records than what is displayed in the website view. The CSV file download begins immediately instead of being delayed while the complete result set is generated on the server. This change helps prevent timeouts from the server if the result set is large and takes excessive time to create. Download times may still be on the order of minutes, but we are confident that the CSV file will be complete. However, one change to be aware of is that the set of audit records in the CSV file is no longer guaranteed to be ordered by date and time. This modification to the search query now ensures that the download stream will begin quickly - a key factor in keeping the download connection active and viable. On another note regarding the audit records, erroneous or misleading records that were generated by known web-crawler "robots" will no longer be included in the audit report display or download file. We will, however, continue to record such events until we decide whether all "robot" records should be removed from the EDI audit system.
New Resource for Information Managers: Adding Physical Metadata
June 3, 2022
Physical metadata such as file size, MD5 checksum, and number of rows in a table, are important pieces of information for verifying the integrity of files after uploads and downloads. When a resource is uploaded to ezEML or processed by EMLassemblyline, this information is automatically calculated. However, neither application can obtain this information from a file that is not accessible. The responsibility is instead placed on the data provider to determine and manually enter this physical metadata.
Audit Service Regression Causes EDI Data Portal System Errors
May 18, 2022
Starting on Tuesday 10 May 2022, the PASTA Audit Service began experiencing unexpected system shut downs that resulted in user errors when attempting to access data packages from the EDI Data Portal. These events required the Audit Service to be restarted multiples times during the remainder of the week and through the weekend. The cause of these shutdowns was determined to be the introduction of a new software feature into the Audit Services, which created a very large amount of XML data to be generated. This resulted in the Audit Service running out of system memory and ultimately the loss of the service. EDI software developers diagnosed the issue on Monday 16 May and deployed a patch on Monday evening. Unfortunately, regression testing did not catch this problem because of differences in the volume of the Audit Service database between our production system and development system. We regret any inconvenience this may have caused and thank all of our users for their patience in this matter.
2022 Data Management Fellowship Program
May 11, 2022
We have awarded 15 fellowships for our ecological data management training program for this summer. The Fellows will receive training in ecological data management and gain hands-on experience through participation in data preparation and publishing with scientists and information managers from specific host research projects. See below for a list of host projects and mentors.
EDIutils has moved to rOpenSci
April 23, 2022
We are happy to announce the EDIutils R package (an API Client for the Environmental Data Initiative Repository) has been accepted by rOpenSci. As a result, the project GitHub has moved and installation has changed to remotes::install_github("ropensci/EDIutils"). We will be submitting to CRAN in the coming weeks. Stay tuned!
Data Repositories Enriching the Global Research Infrastructure
April 20, 2022
Domain repositories have long provided a suite of services to the communities built around them including metadata management and data discovery tools, data access and preservation, and identification of resources created by and of interest to the community. As a result, domain repositories are central to important and thriving research communities. Recently a global research infrastructure for identifying and tracking many kinds of research objects has emerged. This includes Crossref, originally for journal articles and books, DataCite, originally for datasets but expanding to other research objects, as well as ORCID for identification of people and ROR for organizations. Crossref and DataCite initially focused on the creation of research object identifiers (DOIs). As researchers start to use these identifiers, it is clear that the connections enabled by these identifiers add considerable value. In order to enable these connections the growing, global research infrastructure needs content beyond minimal identification and citation metadata. Domain repositories are uniquely situated, with the deep knowledge of their communities, to extend their services by providing connections and content to deliver additional metadata to enrich the global research infrastructure. This talk by Ted Habermann, of Metadata Game Changers, shares several recent examples that demonstrate the power of enriching the global research infrastructure.
March 10, 2022
Research sites (e.g., LTER sites) and teams of researchers who are using ezEML to capture metadata may find that certain content is used repeatedly across a number of documents. Examples of such repeated content can include Creators, Contacts, Keywords, Intellectual Rights, Geographic Coverage, Project, etc. Users can avoid the tedious task of re-entering this information for each new dataset by creating and publishing one or more “templates” that are prepopulated with this standard content. Since templates exist outside of any individual user’s ezEML account, they are accessible to everyone. Everyone who uses a template will get the current version, which helps alleviate problems arising from different versions residing in different users’ accounts.
A Quick Overview of EDI’s Data Explorer (DeX)
January 28, 2022
The EDI software team is excited to announce DeX, a tool for exploring and subsetting tabular data, which is now in beta testing on the EDI staging Data Portal (https://portal-s.edirepository.org/nis). DeX provides three views into tabular data found in the EDI Data Repository: 1) a statistical profiler that analyzes the data table and displays detailed information about each attribute; 2) a filter and subsetting application that allows you to download the subsetted data, along with a new EML metadata document describing the subset; and 3) a simple-to-use scatter and line plotting application that gives you a visual glimpse into data trends. DeX is currently available on either of our development or staging Data Portals and works with CSV-based data tables (soon to work with a wider set of tabular formats). To see DeX in action, look for a data package in the staging Data Portal containing a CSV data file and click on the “Data Explorer – experimental” link at the end of the data entity record information (see below):
Normalization of Creator Names in EDI’s Data Portal
November 2, 2021
The Advanced Search feature of EDI’s Data Portal lets you select a dataset Creator name from a drop-down list of all dataset creators in our repository. The search then displays all the datasets that have that name as one of its creators.
Updates to User-contributed Journal Citation Interface on the EDI Data Portal
November 2, 2021
The Environmental Data Initiative (EDI) has recently updated its user-contributed journal citation interface on its Data Portal to include more granular information regarding the type of citation being submitted. The addition of the Relation Type form field allows you to select the relationship between the data package and the journal manuscript where the data package is mentioned using one of three relationship types: “IsCitedBy” – this data package is formally cited in the manuscript, “IsDescribedBy” – this data package is explicitly described within the manuscript, or “IsReferencedBy” – this data package is implicitly described within the manuscript. This information is conveyed to DataCite through an update of the Digital Object Identifier (DOI) metadata and provides greater exposure to the data package through DataCite’s event data and CrossRef, an official DOI registrar of the International DOI Foundation for academic journals. The EDI Data Portal allows any user with an EDI provisioned account to add a journal citation to any data package, regardless of data package ownership, thereby greatly increasing related information about the data package – a win-win for the entire community!
Harmonizing Ecological Community Survey Data for Reuse: An Update
September 6, 2021
The idea of harmonizing data is not new, and for some research domains has been successful. Our body of long-term observations of organisms in ecological communities is growing, and many datasets have been used already in synthesis and meta analyses – but only after considerable effort to bring them into alignment. A goal of EDI has been to develop recommendations for data harmonization, and to convert “raw data” in specific domains into a common data model to prepare them for analysis and accelerate synthesis or meta analyses.
Integrating Long-Tail Data: How Far Are We?
September 5, 2021
EDI’s Kristin Vanderbilt and Corinna Gries co-edited a Special Issue of Ecological Informatics “Integrating Long-Tail Data: How Far Are We?” that explores how far the informatics community has come toward lessening the time researchers must spend integrating small, heterogeneous datasets prior to analyzing them.
Updating schema.org Metadata for Data Packages in the EDI Data Portal to Provide Rich Semantic Information That can be Utilized by Search Engines and Google Scholar
May 3, 2021
The EDI technical team is now updating the schema.org metadata that accompanies every data package landing page on the EDI Data Portal with new recommendations from the ESIP SOSO project (https://github.com/ESIPFed/science-on-schema.org). EDI initially released schema.org metadata for each data package in Fall 2018. The dataset schema.org metadata is encoded as a JSON-LD data structure that is embedded within script tags on the data package metadata landing page. Along with the sitemaps.org metadata that acts as an SEO content table of index, the schema.org metadata provides rich semantic information about the data package that can be utilized by search engines (e.g., Google, Microsoft, Yandex, and even domain specific tools like EarthCube’s Gleaner and DataONE schema.org indexers) and associated applications. For example, data packages that are archived in the EDI data repository are discoverable through Google’s Dataset Search interface (https://bit.ly/3nDhT8j) because of the detailed information provided to Google’s search engine indexer via the schema.org metadata:
Rendering of Markdown and LaTex equations in EML
April 30, 2021
The EDI Data Portal now supports the provisional rendering of Markdown and LaTex equations in most TextType elements of the Ecological Metadata Language (e.g., “abstract”, “intellectualRights”, and the method step “description”). EDI recently updated these two features on the Data Portal’s Data Package Metadata web page through the use of “showdown.js” (https://showdownjs.com/) for Markdown and “MathJax.js” (https://www.mathjax.org/) for LaTex formatted math equations. Markdown provides a convenient way to add structural highlights to text elements, including the use of different heading styles, bold and italicized text, bulleted and numbered lists, and much more. The “showdown.js” package supports most of the commonly used GitHub flavored Markdown (https://github.github.com/gfm/) syntax and is processed by the client’s web browser. For example, the following snippet from the EDI Data Portal shows both a Markdown heading style and a bulleted list from a rendered EML document:
EDI Supports Temporary Data Embargoes Upon Request
March 1, 2021
You may not know that the Environmental Data Initiative provides an embargo service to temporarily block access to data tables (and other types of data) in your EDI uploaded data packages, but it does. We provide this service to satisfy requirements of many journals who request that data accompanying a manuscript be archived in a recognized data repository and assigned a valid Digital Object Identifier (DOI) before the manuscript is reviewed. It is often the case that the journal or manuscript author prefers that the data be off limits to the general public until the manuscript is fully published. For this reason, EDI will apply a temporary embargo on the data elements of your data package (metadata, however, are not permitted to be embargoed) at your request – all you have to do is ask through our support email address or directly on our Slack channel. We will remove the embargo once you let us know that the manuscript has been published. The best part is that the DOI remains the same before and after the embargo. Because EDI is a strong proponent of publicly accessible data, we will periodically reach out to owners of embargoed data to confirm the continued need of the embargo. Lastly, please let us know if and how we can improve this service.
Journal Citations Associated with EDI Data Packages
May 19, 2020
The EDI technical team recently modified the view of user contributed journal citations associated with a specific version of a data package so that the citation is now displayed on all versions of the data package, not just the version for which the citation is relevant. This enhancement allows users who browse the data package metadata landing page to see all citations related to data package series regardless of what version of the data package they are viewing. User contributed journal citations provide an easy way for the author of a data paper (or others) to increase the impact factor of the data package by directly linking the published manuscript to the supporting data package found in the EDI data repository. In fact, older manuscripts that utilize an EDI data package may be added to the list of journal citations even if the data package DOI was not available at the time of publication. We are currently exploring options for updating the DataCite DOI metadata of the data package with these citations so that downstream services may take advantage of this crowd-sourced information.
The Summer Fellowship Program of the Environmental Data Initiative
March 25, 2020
The Environmental Data Initiative (EDI) assists researchers from field stations, individual laboratories, and research projects of all sizes to archive and publish their environmental data. EDI’s very successful Summer Fellowship Program for Data Management Training is one component of our Outreach and Training program. For the third consecutive year, EDI is reviewing applications from interested undergraduate and graduate students to become an EDI summer fellow. This year we are seeking nine fellows to be trained in the data publishing process and to support 9 research sites in their efforts to manage their data. EDI’s aim is to ensure that these young professionals learn state-of-the-art data stewardship practices.
Cite: A Lightweight Citation Service for Data Packages in the EDI Data Repository
February 12, 2020
The EDI technical team has released Cite, a lightweight web-service that generates citations for data packages archived in the EDI data repository. Cite is simple to use and requires only the EDI data package identifier appended to the end of the Cite URL: “https://cite.edirepository.org/cite/”. For example, the URL “https://cite.edirepository.org/cite/edi.460.1”, when entered into a web-browser query field, returns the following ESIP-stylized citation:
Google Scholar Highlights EDI Data Packages as First-order Citations in User Profiles and in Scholarly Articles
February 10, 2020
Data is becoming increasingly citable as first-order objects, including data archived in the EDI repository. One indication is that data package publications are indexed in personal Google Scholar user profiles, along with other scholarly articles, as for example in the profile of Paul Hanson (Research Professor at the Center for Limnology, University of Wisconsin-Madison).
R functions for interacting with the EDI Repository API
December 12, 2018
The EDI Repository API facilitates automated data processing and publication workflows, thereby enabling reproducible and efficient data package management. Four new R functions have been added to the EDIutils R library supporting package reservation (pkg_reserve_id.R), evaluation (pkg_evaluation.R), upload (pkg_upload.R), and update (pkg_update.R). The full suite of PASTA+ API calls from the R environment will be available soon!
EDI behind the curtains
December 6, 2018
EDI technical staff recently upgraded our virtualization infrastructure to the latest version of VMware’s ESXi software (from version 5 to 6.5). All of EDI’s servers, including those that run the PASTA+ data repository software, operate as virtual clients across six ESXi host systems. These virtual hosts are configured to operate between 6 and 12 clients at one time, with some room left over to shuffle systems and for testing. The ESXi host systems are located on the campus of the University of New Mexico and connect directly to a dedicated 10Gb/s connection using UNM’s Science DMZ research network. Wide-area Internet connectivity to UNM includes 100Gb/s connections to the DOE Energy Sciences Network (ESNet) and the Western Regional Network, both through the Albuquerque Gigapop. EDI’s data storage capacity is currently at 30TB, with an equivalent 30TB mirror storage device for near-time backups and smaller SSD disks that are used for off-site backup purposes. EDI also uses the AWS Glacier storage as a long-term “cold data” archive.
EDI is 40th DataONE Member Node
April 18, 2017
The contents of the EDI Data Repository is now discoverable through DataONE. EDI became the 40th DataONE Member Node when it registered its version of the DataONE Generic Member Node (GMN) software stack to synchronize EDI data content through the DataONE Federation.