Re: [CODE4LIB] data sets in multiple respositories

Joe Hourclé Tue, 12 Mar 2024 18:03:37 -0700

> 
> On Mar 11, 2024, at 9:01 AM, Eric Lease Morgan 
> <[email protected]> wrote:
> 
> To what degree is it unethical or unprofessional to deposit data sets in 
> multiple respositories?


> A long time ago, in a galaxy far far away, the preservation of books and 
> journals was ensured when multiple libraries included books and journals in 
> their collections. This philosopy of preservation was well-articulated with 
> the advent of LOCKSS when they said, "Lot's of copies keep stuff safe." See: 
> https://www.lockss.org/
> 
> Now-a-days, we relegate the preservation of the scholarly record -- whether 
> that be books, journals, or data sets -- to centralized networked services. 
> Hmmm.
> 
> For decades I have been using the Internet to provide access to library 
> collections and services, and one of things this experience has taught me is, 
> links WILL break. Thus, if I deposit my data sets in multiple Internet 
> locations, then the probability of losing access to the data sets decreases. 
> Yet, like the publishing of articles in multiple journals is seen as 
> unethical, would the publishing of data sets in multiple locations be seen in 
> the same light? One problem with multiple deposits would be generation of 
> multiple DOI's, which begs the question, "Which DOI is the authoritative one?"
> 
> Put more simply, it is okay for me to deposit my data sets in my university's 
> institutional repository as well as something like Zenodo?

Many years ago, I published an alignment of FRBR with scientific data:

https://doi.org/10.1002/meet.2008.14504503102

Although it has some issues with “Active Data” (constantly growing or otherwise 
being modified), and issues of granularity (which to be honest, I don’t think 
FRBR ever really handled the issue of dealing with collections too well), I 
think we need to ask “Is this actually a duplicate?”

Some domain repositories will insist on the data being put into a specific 
format for use by their community… so although it may the same “data”, it’s 
actually a different Expression or Manifestation of that data.  (If the datum 
are still the same, but the packaging is different (eg, saved GeoTIFF vs. 
NetCDF vs. FITS vs. CDF) it’s a new Manifestation.  If you had to re-grid the 
data to align with a different reference system, it’s a new Expression, too)

If it’s a bitwise duplicate (same exact file packaging, no additional metadata, 
etc), then it’s the same Manifestation of the same data, but a different Item.  
So maybe it’s a duplicate… but the access is different, so it’s still useful.

In all of these cases, I would make use of the Alternate Identifier in Zenodo 
to link to other copies/variants of the data.  In some cases, I might also look 
to see if ARKs (Archive Resource Keys) would be appropriate to declare that 
it’s the same digital object in multiple locations:

https://arks.org/about/

To get back to the ‘submitting to multiple journals’ comparison… there are 
overlay journals that republish articles in their field of interest.  Yes, 
there is technically one journal that’s authoritative, but domain repositories 
are a bit special as they often provide a service by indexing the data in a 
specific way to make it findable and usable by their specific community.  They 
may also add value over time by adding/updating metadata that’s useful for 
their community (findable, usable, documenting use caveats, etc).  In this way, 
even though they ‘Data Object’ may stay the same, the ‘Information Object’ (per 
OAIS) is no longer the same as what was published in the other repositories.

It’s like if you had two copies of the same physics textbook, one of which was 
marked up by Richard Feynman.  They may have the same ISBN, but the marked up 
one may have additional value to a given community.

Because of that potential for extra value, I don’t fault them for creating a 
new DOI.  But I do believe that they need to track the alternate identifiers / 
locations for the data.  Especially when you’re dealing with multi-TB 
collections so people don’t waste time downloading two copies and then realize 
the time & bandwidth they just wasted.

-Joe
(Currently unaffiliated)

PS.  There were a few people who argued that data isn’t a Creative Work, and 
therefore had no business being aligned with FRBR…. but if you ever hear the 
stories about how scientists calibrate their instruments, you would agree that 
most data is a Creative Work.  (Even the raw data in some cases, when you find 
out what they have to do to get 20 year old instruments to continue to produce 
data … or brand new instruments that took a long calibration image as part of 
commissioning just when a solar flare happened and damaged the detector)

Re: [CODE4LIB] data sets in multiple respositories

Reply via email to