[ceph-users] Re: NIH Datasets

Tim Holloway Thu, 10 Apr 2025 16:01:05 -0700

Sounds like a discussion for a discord server. Or BlueSky or somethingthat's very definitely NOT what used to be known as twitter.

My viewpoint is a little different. I really didn't consider HIPAAstuff, although since technically that is info that shouldn't beaccessible to anyone but authorized staff at NIH - and there's the rub,if the very persons/offices involved are purged. At that point, whatwe'd really be doing is simply hiding it until a saner regime comesalong and wants it back.

But it's not just NIH that's being tossed down the Memory Hole. NASA,NOAA, and other agencies are also being "cleansed". We should properlybe safeguarding ALL of that. Reminds me of Isaac Asimov's Foundation -an agency to preserve human knowledge over the dark ages.

Also, the idea of having fixed homes for complete documents I feel islimiting. I'm minded of how the folding@home project distributed work torandom volunteers. And again, how ceph can break an object into PGs andsplatter them to replicas on multiple servers. It's less important for agiven document server to be 100% online as it is to have the ability fornodes to check in and out and maintain a gestalt.

As for the management of all this, I'd say that the top-level domain ofmy theoretical namespace would be a select committee in charge of themaster servers. sub-domains would be administered by grant from the toplevel and have their own administrators. And so forth until you havelibrarian administrators. Existing examples can be seen in some of thelarger git archives, such as for Linux. The Wikipedia can also provideexamples of how to administer tamper-resistant information.

So, in short, I'm proposing a sort of world-wide web of documents.Something that can live in the background of ordinary user computers,perhaps. But most importantly, reliable, accessible and secure.


  Tim


On 4/7/25 15:33, Linas Vepstas wrote:

Thanks Šarūnai and all who responded.

I guess general discussion will need to go off-list. But first:

To summarize, the situation seems to be this:
* As a general rule, principle investigators (PI) always have a copy
of their "master dataset", which thus is "safe" as long as they don't
lose control over it.
* Certain data sets are popular and are commonly shared.
* NCBI publishes data sets, with the goal of making access easy,
transparent, fast, documented, and shoulders the burden of network
costs, sysadmin, server maintenance, etc. and it is this "free, easy,
managed-for-you" infrastructure that is at risk.
* Unlike climate data, some of the NIH data is covered by HIPAA (e.g.
cancer datasets) because it contains personal identifying information.
I have no clue how this is dealt with. Encryption? Passwords?
Restricted access? Who makes the decision about who is allowed, and
who is not allowed to work with, access, copy or mirror the data? WTF?
I'm clueless here.

  What are the technical problems to be solved? As long as PI's have a
copy of a master dataset, the technical issues are:
-- how to find it?
-- what does it contain?
-- is there enough network bandwidth?
-- can it be copied in full?
-- if it can be, where's the mirrors / backups?
-- If the PI's lab is shut down, who pays for the storage and network
connectivity for the backups?
-- How to protect against loss of backup copies?
-- How to gain access to backup copies?

The above issues sit at the "library science" level: yes, technology
can help, but it's also social and organizational. So it's not really
about "how can we build a utopian decentralized data store" in some
abstract way that shards data across multiple nodes (which is what
IPFS seemed to want to be). Instead, its four-fold:

  * How is the catalog of available data maintained?
  * How is the safety of backup copies ensured?
  * How do we cache data, improve latency, improve bandwidth?
  * How are the administrative burdens shared? (sysadmin, cost of
servers, bandwidth)

This is way far outside of the idea of "let's just harness a bunch of
disks together on the internet", but it is the actual problem being
faced.

-- Linas


On Mon, Apr 7, 2025 at 8:07 AM Šarūnas Burdulis
<saru...@math.dartmouth.edu> wrote:

On 4/4/25 11:39 PM, Linas Vepstas wrote:

OK what you will read below might sound insane but I am obliged to ask.

There are 275 petabytes of NIH data at risk of being deleted. Cancer
research, medical data, HIPAA type stuff. Currently unclear where it's
located, how it's managed, who has access to what, but lets ignore
that for now. It's presumably splattered across data centers, cloud,
AWS, supercomputing labs, who knows. Everywhere.

Similar to climate research data back in 2017... It was all accessible
via FTP or HTTP though. A Climate Mirror initiative was created and a
distributed copy worldwide was made eventually. Essentially, a list of
URLs was provided and some helper scripts to slurp multiple copies of
data repositories.

https://climatemirror.org/
https://github.com/climate-mirror


--
Šarūnas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NIH Datasets

Reply via email to