Hi Taavi, thanks for your reply, it’s super helpful! I’ll give Cloud VPS a
try.

> Documentation about the Ceph cluster powering Cloud VPS is on a separate
Wikitech page:
> <https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph>.

Curious, if Cloud VPS already has a working Ceph cluster, might it be
possible to run Ceph’s built-in object store?
https://docs.ceph.com/en/pacific/radosgw/

— Sascha

Am Mi., 22. Dez. 2021 um 18:41 Uhr schrieb Taavi Väänänen <h...@taavi.wtf>:

> Hi!
>
> On 12/22/21 18:29, Sascha Brawer wrote:
> > What storage options does the Wikimedia cloud have? Can external
> > developers (i.e. people not employed by the Wikimedia foundation) write
> > to Cinder and/or Swift? Either from Toolforge or from Cloud VPS?
>
> I've left more detailed replies inline. tl;dr: Currently Toolforge
> doesn't really have any other options than NFS. Cloud VPS additionally
> gives you the option to use Cinder (extra volumes you can attach to a VM
> and move from a VM to another).
>
> > See below for context. (Actually, is this the right list, or should I
> > ask elsewhere?)
> >
> > For Wikidata QRank [https://qrank.toolforge.org/
> > <https://qrank.toolforge.org/>], I run a cronjob on the toolforge
> > Kubernetes cluster. The cronjob mainly works on Wikidata dumps and
> > anonymized Wikimedia access logs, which it reads from the NFS-mounted
> >   /public/dumps/public directory. Currently, the job produces 40
> > internal files with a total size of 21G; these files need to be
> > preserved between individual cronjob runs. (In a forthcoming version of
> > the cronjob, this will grow to ~200 files with a total size of ~40G).
> > For storing these intermediate files, Cinder might be a good solution.
> > However, afaik Cinder isn’t available on Toolforge. Therefore, I’m
> > currently storing the intermediate files in the account’s home directory
> > on NFS. Presumably (but not sure, but speculating because I’ve seen NFS
> > crumbling elsewhere) Wikimedia’s NFS server will be easily overloaded;
> > in any case, Wikimedia’s NFS server seems to protect itself by
> > throttling access. Because of the throttling, the cronjob is slow when
> > working with its intermediate files.
> > * Will Cinder be made available to Toolforge users? When?
>
> We're interested in it, but no-one has time or interest to work on
> making it a reality yet. This is tracked on Phabricator:
> <https://phabricator.wikimedia.org/T275555>.
>
> As a reminder: if anyone is interested in working on this or other parts
> of the WMCS infrastructure, please talk to us!
>
> > * Or should I move from Toolforge to Cloud-VPS, so I can store my
> > intermediate files on Cinder?
>
> ~40G is in the range where Cinder/Cloud VPS might indeed be a better
> solution than NFS. While we don't currently have any official numbers on
> what is acceptable on NFS and what's not, for context the Toolforge
> project NFS cluster has currently about 8T of storage for about 3,000
> tools.
>
> > * Or should I store my intermediate files in some object storage? Swift?
> > Ceph? Something else?
>
> WMCS currently doesn't offer direct access to any object storage
> service. This is something we're likely to work on in the mid-term (next
> 6-12 months is the last estimate I've heard). This project is currently
> stalled on some network design work:
> <https://phabricator.wikimedia.org/T289882>.
>
> > * Is access to Cinder and Swift subject to the same throttling as
> > NFS? Or will moving away from NFS increase the available I/O throughput?
>
> No, NFS is subject to completely separate throttling and Ceph-backed
> storage methods (local VM disks and Cinder volumes) have much higher
> amount of bandwidth available.
>
> > The final output of the QRank system is a single file, currently ~100M
> > in size but eventually growing to ~1G. When the cronjob has computed a
> > fresh version of its output, it deletes any old outputs from previous
> > runs (with the exception of the previous last two versions, which are
> > kept around internally for debugging). Typical users are other bots or
> > external pipelines who need a signal for prioritizing Wikidata entities,
> > not end users on the web. Users typically check for updates with HTTP
> > HEAD, or with conditional HTTP GET requests (using the standard
> > If-Modified-Since and If-None-Match headers). Currently, I’m serving the
> > output file with a custom-written HTTP server that runs as a web service
> > on Toolforge behind Toolforge’s nginx instance. My server reads its
> > content from the NFS-mounted home directory that’s getting populated by
> > the cronjob. Now, it’s not exactly a great idea to serve large data
> > files from NFS, but afaik it’s the only option available in the
> > Wikimedia cloud, at least for Toolforge users. Of course I might be
> wrong.
>
> > * Should I move from Toolforge to Cloud-VPS, so I can serve my final
> > output files from Cinder instead of NFS?
> > * Or should I rather store my final output files in some object storage?
> > Swift? Ceph? Something else?
> > * Or is NFS just fine, even if the size of my data grows from 100M to
> 1G+?
>
> When we offer object storage, yes, storing your files in it is a good
> idea. I think you should be fine NFS for now (please don't quote me on
> that). Cloud VPS is an option too if you prefer it.
>
> >
> > The cronjob also uses ~5G of temporary files in /tmp, which it deletes
> > towards the end of each run. The temp files are used for external
> > sorting, so all access is sequenyoutial. I’m not sure where these
> temporary
> > files currently sit when running on Toolforge Kubernetes. Given their
> > volume, I presume that the tmpfs of the Kubernetes nodes will eventually
> > run out of memory and then fall back to disk, but I wouldn’t know how to
> > find this out. _If_ the backing store disk for tmpfs eventually ends up
> > being mounted on NFS, it sounds wasteful for the poor NFS
> > server;, especially since the files get deleted at job completion. In
> > that case, I’d love to save common resources by using a local disk. (It
> > doesn’t have to be an SSD; a spinning hard drive would be fine, given
> > the sequential access pattern). But I’m not sure how to set this up on
> > Toolforge Kubernetes, and I couldn’t find docs on wikitech. Actually,
> > this might be a micro-optimization, so perhaps not worth the trouble.
> > But then, I’d like to be nice with the precious shared resources in the
> > Wikimedia cloud.
>
> Good question, I'm not sure either if tmpfs for Kubernetes containers is
> on Ceph (SSDs) or on RAM. At least it's not on NFS.
>
> > Sorry that I couldn’t find the answers online. While searching, I came
> > across the following pointers:
> > – https://wikitech.wikimedia.org/wiki/Ceph
> > <https://wikitech.wikimedia.org/wiki/Ceph>: This page has a warning
> that
> > it’s probably “no longer true”. If the warning is correct, perhaps
> > the page could be deleted entirely? Or maybe it could link to the
> > current docs?
> > – https://wikitech.wikimedia.org/wiki/Swift
> > <https://wikitech.wikimedia.org/wiki/Swift>: This sounds perfect, but
> > the page doesn’t mention how the files are getting populated, what the
> > ACLs are managed, and if Wikimedia’s Swift cluster is even accessible to
> > external developers.
> > – https://wikitech.wikimedia.org/wiki/Media_storage
> > <https://wikitech.wikimedia.org/wiki/Media_storage>: This seems
> > current (I guess?), but the page doesn’t mention if/how external
> > Toolforge/Cloud-VPS users may upload objects, or if this is just for the
> > current users.
>
> Those pages document the media storage systems used to store uploads for
> the production MediaWiki projects (Wikipedia and friends). Those are not
> accessible from WMCS and should be treated as completely separate
> systems, and any future WMCS (object) storage services will not use them.
>
> Documentation about the Ceph cluster powering Cloud VPS is on a separate
> Wikitech page:
> <https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph>.
>
> --
> Taavi (User:Majavah)
> volunteer Toolforge/Cloud VPS admin
>
_______________________________________________
Cloud mailing list -- cloud@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/

Reply via email to