[ceph-users] Re: NIH Datasets

Linas Vepstas Mon, 07 Apr 2025 16:13:24 -0700

Hi Tim,

Agree w/ all you say. My two cents an old-timer. These ideas have been
around for decades. And there have been many plans, attempts, screeds,
projects (I can rattle off a random list, but so can search engines
and wikipedia). All with good intentions, heart in the right place.
All got mired or stuck in one way or another.


What does it take to gain critical mass and move forward? The
traditional road to success is:
-- Build a prototype that works.
-- Make sure it solves an actual problem that actual people really have.
-- Make it easy to understand.
-- Accept patches rapidly.
-- Lots of luck, and getting slashdotted.

Maybe one of the many existing projects could be adapted, re-formed,
re-aimed. Or maybe they're all dead in the water because they failed
one of more of the above five bullets.

-- Linas

On Mon, Apr 7, 2025 at 3:11 PM Tim Holloway <t...@mousetech.com> wrote:
>
> Sounds like a discussion for a discord server. Or BlueSky or something
> that's very definitely NOT what used to be known as twitter.
>
> My viewpoint is a little different. I really didn't consider HIPAA
> stuff, although since technically that is info that shouldn't be
> accessible to anyone but authorized staff at NIH - and there's the rub,
> if the very persons/offices involved are purged. At that point, what
> we'd really be doing is simply hiding it until a saner regime comes
> along and wants it back.
>
> But it's not just NIH that's being tossed down the Memory Hole. NASA,
> NOAA, and other agencies are also being "cleansed". We should properly
> be safeguarding ALL of that. Reminds me of Isaac Asimov's Foundation -
> an agency to preserve human knowledge over the dark ages.
>
> Also, the idea of having fixed homes for complete documents I feel is
> limiting. I'm minded of how the folding@home project distributed work to
> random volunteers. And again, how ceph can break an object into PGs and
> splatter them to replicas on multiple servers. It's less important for a
> given document server to be 100% online as it is to have the ability for
> nodes to check in and out and maintain a gestalt.
>
> As for the management of all this, I'd say that the top-level domain of
> my theoretical namespace would be a select committee in charge of the
> master servers. sub-domains would be administered by grant from the top
> level and have their own administrators. And so forth until you have
> librarian administrators. Existing examples can be seen in some of the
> larger git archives, such as for Linux. The Wikipedia can also provide
> examples of how to administer tamper-resistant information.
>
> So, in short, I'm proposing a sort of world-wide web of documents.
> Something that can live in the background of ordinary user computers,
> perhaps. But most importantly, reliable, accessible and secure.
>
>    Tim
>
>
> On 4/7/25 15:33, Linas Vepstas wrote:
> > Thanks Šarūnai and all who responded.
> >
> > I guess general discussion will need to go off-list. But first:
> >
> > To summarize, the situation seems to be this:
> > * As a general rule, principle investigators (PI) always have a copy
> > of their "master dataset", which thus is "safe" as long as they don't
> > lose control over it.
> > * Certain data sets are popular and are commonly shared.
> > * NCBI publishes data sets, with the goal of making access easy,
> > transparent, fast, documented, and shoulders the burden of network
> > costs, sysadmin, server maintenance, etc. and it is this "free, easy,
> > managed-for-you" infrastructure that is at risk.
> > * Unlike climate data, some of the NIH data is covered by HIPAA (e.g.
> > cancer datasets) because it contains personal identifying information.
> > I have no clue how this is dealt with. Encryption? Passwords?
> > Restricted access? Who makes the decision about who is allowed, and
> > who is not allowed to work with, access, copy or mirror the data? WTF?
> > I'm clueless here.
> >
> >   What are the technical problems to be solved? As long as PI's have a
> > copy of a master dataset, the technical issues are:
> > -- how to find it?
> > -- what does it contain?
> > -- is there enough network bandwidth?
> > -- can it be copied in full?
> > -- if it can be, where's the mirrors / backups?
> > -- If the PI's lab is shut down, who pays for the storage and network
> > connectivity for the backups?
> > -- How to protect against loss of backup copies?
> > -- How to gain access to backup copies?
> >
> > The above issues sit at the "library science" level: yes, technology
> > can help, but it's also social and organizational. So it's not really
> > about "how can we build a utopian decentralized data store" in some
> > abstract way that shards data across multiple nodes (which is what
> > IPFS seemed to want to be). Instead, its four-fold:
> >
> >   * How is the catalog of available data maintained?
> >   * How is the safety of backup copies ensured?
> >   * How do we cache data, improve latency, improve bandwidth?
> >   * How are the administrative burdens shared? (sysadmin, cost of
> > servers, bandwidth)
> >
> > This is way far outside of the idea of "let's just harness a bunch of
> > disks together on the internet", but it is the actual problem being
> > faced.
> >
> > -- Linas
> >
> >
> > On Mon, Apr 7, 2025 at 8:07 AM Šarūnas Burdulis
> > <saru...@math.dartmouth.edu> wrote:
> >> On 4/4/25 11:39 PM, Linas Vepstas wrote:
> >>> OK what you will read below might sound insane but I am obliged to ask.
> >>>
> >>> There are 275 petabytes of NIH data at risk of being deleted. Cancer
> >>> research, medical data, HIPAA type stuff. Currently unclear where it's
> >>> located, how it's managed, who has access to what, but lets ignore
> >>> that for now. It's presumably splattered across data centers, cloud,
> >>> AWS, supercomputing labs, who knows. Everywhere.
> >> Similar to climate research data back in 2017... It was all accessible
> >> via FTP or HTTP though. A Climate Mirror initiative was created and a
> >> distributed copy worldwide was made eventually. Essentially, a list of
> >> URLs was provided and some helper scripts to slurp multiple copies of
> >> data repositories.
> >>
> >> https://climatemirror.org/
> >> https://github.com/climate-mirror
> >>
> >>
> >> --
> >> Šarūnas Burdulis
> >> Dartmouth Mathematics
> >> math.dartmouth.edu/~sarunas
> >>
> >> · https://useplaintext.email ·
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NIH Datasets

Reply via email to