[ceph-users] Re: NIH Datasets

Tim Holloway Tue, 08 Apr 2025 11:07:43 -0700

I don't think Linus is only concerned with public data, no.

The United States Government has had in places for many years effectivemeans of preserving their data. Some of those systems may be old andcreaky, granted, and not always the most efficient, but they suffice.

The problem is that the current administration and their unelectedhenchmen are running rampant over them. Firing people with criticalknowledge, and ordering the destruction of carefully amassed data.

So what we're looking at is more akin to wikileaks than to simple dataarchival. That is, for the good of the nation - and the world - thatdata should be preserved, not purged. And yes, that might includeobtaining information by shady means because the administration wouldrather see some data completely expunged in the same way that theTaliban destroyed the Buddha statues.

A Ceph-like system would definitely be a good model, but for the sake ofreliability and accessibility, we'd want something that could endurenodes popping in and dropping out, and to lessen the chances of afull-out invasion, that data would best be anonymously replicated overmany nodes. Ceph loves very large datasets, so it should be sufficientwere we to keep all our eggs in one basket. but again, that data hasenemies, so extra measures are needful that Ceph wasn't designed to handle.

I think such a project is doable, but it should borrow from many othercommunity service architectures as well. It's not something that iscurrently available off-the-shelf, but neither were a lot of thetechnologies we now depend on every day.


On 4/8/25 13:26, Alex Gorbachev wrote:

I was trying to analyze the original request, which seems to be something
of the following set:

- The goal is to archive a large amount of (presumably public) data on a
community run globally sharded or distributed storage.

- Can Ceph be used for this?  Seems no, at least not in a sense of running
lots of OSDs at different locations by different people loosely coupled
into one global public data repo.  Perhaps, there are some other ideas from
people who have done this kind of thing.

- Are there restrictions on obtaining the data?  If it's public and
accessible now, it should be able to be copied.  If not, what are the
restrictions on obtaining and copying the data?

- Organization: how will the storage and maintenance of data be organized
(and funded)?  A foundation, a SETI-at-home like network, a blockchain (to
preserve data veracity)?

- Legal support?

--
Alex Gorbachev




On Tue, Apr 8, 2025 at 9:41 AM Anthony D'Atri <a...@dreamsnake.net> wrote:

The intent is the US administration’s assault against science, Linas
doesn’t *want* to do it, he wants to preserve for the hope of a better
future.

On Apr 8, 2025, at 9:28 AM, Alex Gorbachev <a...@iss-integration.com>

wrote:

Hi Linas,

Is the intent of purging of this data mainly due to just cost concerns?

If

the goal is purely preservation of data, the likely cheapest and least
maintenance intensive way of doing this is a large scale tape archive.
Such archives (purely based on a google search) exist at LLNL and OU, and
there is a TAPAS service from SpectraLogic.

I would imagine questions would arise about custody of the data, legal
implications etc.  The easiest is for the organization already hosting

the

data to just preserve it by archiving, and thereby claim a significant

cost

reduction.

--
Alex Gorbachev




On Sun, Apr 6, 2025 at 11:08 PM Linas Vepstas <linasveps...@gmail.com>
wrote:

OK what you will read below might sound insane but I am obliged to ask.

There are 275 petabytes of NIH data at risk of being deleted. Cancer
research, medical data, HIPAA type stuff. Currently unclear where it's
located, how it's managed, who has access to what, but lets ignore
that for now. It's presumably splattered across data centers, cloud,
AWS, supercomputing labs, who knows. Everywhere.

I'm talking to a biomed person in Australias that uses NCBI data
daily, she's in talks w/ Australian govt to copy and preserve the
datasets they use. Some multi-petabytes of stuff. I don't know.

While bouncing around tech ideas, IPFS and Ceph came up. My experience
with IPFS is that it's not a serious contender for anything. My
experience with Ceph is that it's more-or-less A-list.

OK. So here's the question: is it possible to (has anyone tried) set
up an internet-wide Ceph cluster? Ticking off the typical checkboxes
for "decentralized storage"? Stuff, like: internet connections need to
be encrypted. Connections go down, come back up. Slow. Sure, national
labs may have multi-terabit fiber, but little itty-bitty participants
trying to contribute a small collection of disks to a large pool might
only have a gigabit connection, of which maybe 10% is "usable".
Barely. So, a hostile networking environment.

Is this like, totally insane, run away now, can't do that, it won't
work idea, or is there some glimmer of hope?

Am I misunderstanding something about IPFS that merits taking a second
look at it?

Is there any other way of getting scalable reliable "decentralized"
internet-wide storage?

I mean, yes, of course, the conventional answer is that it could be
copied to AWS or some national lab or two somewhere in the EU or Aus
or UK or where-ever, That's the "obvious" answer. I'm looking for a
non-obvious answer, an IPFS-like thing, but one that actually works.
Could it work?

-- Linas


--
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NIH Datasets

Reply via email to