[ceph-users] Re: NIH Datasets

Linas Vepstas Mon, 07 Apr 2025 12:34:54 -0700

Thanks Šarūnai and all who responded.

I guess general discussion will need to go off-list. But first:

To summarize, the situation seems to be this:
* As a general rule, principle investigators (PI) always have a copy
of their "master dataset", which thus is "safe" as long as they don't
lose control over it.
* Certain data sets are popular and are commonly shared.
* NCBI publishes data sets, with the goal of making access easy,
transparent, fast, documented, and shoulders the burden of network
costs, sysadmin, server maintenance, etc. and it is this "free, easy,
managed-for-you" infrastructure that is at risk.
* Unlike climate data, some of the NIH data is covered by HIPAA (e.g.
cancer datasets) because it contains personal identifying information.
I have no clue how this is dealt with. Encryption? Passwords?
Restricted access? Who makes the decision about who is allowed, and
who is not allowed to work with, access, copy or mirror the data? WTF?
I'm clueless here.

 What are the technical problems to be solved? As long as PI's have a
copy of a master dataset, the technical issues are:
-- how to find it?
-- what does it contain?
-- is there enough network bandwidth?
-- can it be copied in full?
-- if it can be, where's the mirrors / backups?
-- If the PI's lab is shut down, who pays for the storage and network
connectivity for the backups?
-- How to protect against loss of backup copies?
-- How to gain access to backup copies?

The above issues sit at the "library science" level: yes, technology
can help, but it's also social and organizational. So it's not really
about "how can we build a utopian decentralized data store" in some
abstract way that shards data across multiple nodes (which is what
IPFS seemed to want to be). Instead, its four-fold:

 * How is the catalog of available data maintained?
 * How is the safety of backup copies ensured?
 * How do we cache data, improve latency, improve bandwidth?
 * How are the administrative burdens shared? (sysadmin, cost of
servers, bandwidth)

This is way far outside of the idea of "let's just harness a bunch of
disks together on the internet", but it is the actual problem being
faced.

-- Linas

On Mon, Apr 7, 2025 at 8:07 AM Šarūnas Burdulis
<saru...@math.dartmouth.edu> wrote:
>
> On 4/4/25 11:39 PM, Linas Vepstas wrote:
> > OK what you will read below might sound insane but I am obliged to ask.
> >
> > There are 275 petabytes of NIH data at risk of being deleted. Cancer
> > research, medical data, HIPAA type stuff. Currently unclear where it's
> > located, how it's managed, who has access to what, but lets ignore
> > that for now. It's presumably splattered across data centers, cloud,
> > AWS, supercomputing labs, who knows. Everywhere.
>
> Similar to climate research data back in 2017... It was all accessible
> via FTP or HTTP though. A Climate Mirror initiative was created and a
> distributed copy worldwide was made eventually. Essentially, a list of
> URLs was provided and some helper scripts to slurp multiple copies of
> data repositories.
>
> https://climatemirror.org/
> https://github.com/climate-mirror
>
>
> --
> Šarūnas Burdulis
> Dartmouth Mathematics
> math.dartmouth.edu/~sarunas
>
> · https://useplaintext.email ·
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NIH Datasets

Reply via email to