Hi Taavi, thanks for your reply, it’s super helpful! I’ll give Cloud VPS a try.
> Documentation about the Ceph cluster powering Cloud VPS is on a separate Wikitech page: > <https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph>. Curious, if Cloud VPS already has a working Ceph cluster, might it be possible to run Ceph’s built-in object store? https://docs.ceph.com/en/pacific/radosgw/ — Sascha Am Mi., 22. Dez. 2021 um 18:41 Uhr schrieb Taavi Väänänen <h...@taavi.wtf>: > Hi! > > On 12/22/21 18:29, Sascha Brawer wrote: > > What storage options does the Wikimedia cloud have? Can external > > developers (i.e. people not employed by the Wikimedia foundation) write > > to Cinder and/or Swift? Either from Toolforge or from Cloud VPS? > > I've left more detailed replies inline. tl;dr: Currently Toolforge > doesn't really have any other options than NFS. Cloud VPS additionally > gives you the option to use Cinder (extra volumes you can attach to a VM > and move from a VM to another). > > > See below for context. (Actually, is this the right list, or should I > > ask elsewhere?) > > > > For Wikidata QRank [https://qrank.toolforge.org/ > > <https://qrank.toolforge.org/>], I run a cronjob on the toolforge > > Kubernetes cluster. The cronjob mainly works on Wikidata dumps and > > anonymized Wikimedia access logs, which it reads from the NFS-mounted > > /public/dumps/public directory. Currently, the job produces 40 > > internal files with a total size of 21G; these files need to be > > preserved between individual cronjob runs. (In a forthcoming version of > > the cronjob, this will grow to ~200 files with a total size of ~40G). > > For storing these intermediate files, Cinder might be a good solution. > > However, afaik Cinder isn’t available on Toolforge. Therefore, I’m > > currently storing the intermediate files in the account’s home directory > > on NFS. Presumably (but not sure, but speculating because I’ve seen NFS > > crumbling elsewhere) Wikimedia’s NFS server will be easily overloaded; > > in any case, Wikimedia’s NFS server seems to protect itself by > > throttling access. Because of the throttling, the cronjob is slow when > > working with its intermediate files. > > * Will Cinder be made available to Toolforge users? When? > > We're interested in it, but no-one has time or interest to work on > making it a reality yet. This is tracked on Phabricator: > <https://phabricator.wikimedia.org/T275555>. > > As a reminder: if anyone is interested in working on this or other parts > of the WMCS infrastructure, please talk to us! > > > * Or should I move from Toolforge to Cloud-VPS, so I can store my > > intermediate files on Cinder? > > ~40G is in the range where Cinder/Cloud VPS might indeed be a better > solution than NFS. While we don't currently have any official numbers on > what is acceptable on NFS and what's not, for context the Toolforge > project NFS cluster has currently about 8T of storage for about 3,000 > tools. > > > * Or should I store my intermediate files in some object storage? Swift? > > Ceph? Something else? > > WMCS currently doesn't offer direct access to any object storage > service. This is something we're likely to work on in the mid-term (next > 6-12 months is the last estimate I've heard). This project is currently > stalled on some network design work: > <https://phabricator.wikimedia.org/T289882>. > > > * Is access to Cinder and Swift subject to the same throttling as > > NFS? Or will moving away from NFS increase the available I/O throughput? > > No, NFS is subject to completely separate throttling and Ceph-backed > storage methods (local VM disks and Cinder volumes) have much higher > amount of bandwidth available. > > > The final output of the QRank system is a single file, currently ~100M > > in size but eventually growing to ~1G. When the cronjob has computed a > > fresh version of its output, it deletes any old outputs from previous > > runs (with the exception of the previous last two versions, which are > > kept around internally for debugging). Typical users are other bots or > > external pipelines who need a signal for prioritizing Wikidata entities, > > not end users on the web. Users typically check for updates with HTTP > > HEAD, or with conditional HTTP GET requests (using the standard > > If-Modified-Since and If-None-Match headers). Currently, I’m serving the > > output file with a custom-written HTTP server that runs as a web service > > on Toolforge behind Toolforge’s nginx instance. My server reads its > > content from the NFS-mounted home directory that’s getting populated by > > the cronjob. Now, it’s not exactly a great idea to serve large data > > files from NFS, but afaik it’s the only option available in the > > Wikimedia cloud, at least for Toolforge users. Of course I might be > wrong. > > > * Should I move from Toolforge to Cloud-VPS, so I can serve my final > > output files from Cinder instead of NFS? > > * Or should I rather store my final output files in some object storage? > > Swift? Ceph? Something else? > > * Or is NFS just fine, even if the size of my data grows from 100M to > 1G+? > > When we offer object storage, yes, storing your files in it is a good > idea. I think you should be fine NFS for now (please don't quote me on > that). Cloud VPS is an option too if you prefer it. > > > > > The cronjob also uses ~5G of temporary files in /tmp, which it deletes > > towards the end of each run. The temp files are used for external > > sorting, so all access is sequenyoutial. I’m not sure where these > temporary > > files currently sit when running on Toolforge Kubernetes. Given their > > volume, I presume that the tmpfs of the Kubernetes nodes will eventually > > run out of memory and then fall back to disk, but I wouldn’t know how to > > find this out. _If_ the backing store disk for tmpfs eventually ends up > > being mounted on NFS, it sounds wasteful for the poor NFS > > server;, especially since the files get deleted at job completion. In > > that case, I’d love to save common resources by using a local disk. (It > > doesn’t have to be an SSD; a spinning hard drive would be fine, given > > the sequential access pattern). But I’m not sure how to set this up on > > Toolforge Kubernetes, and I couldn’t find docs on wikitech. Actually, > > this might be a micro-optimization, so perhaps not worth the trouble. > > But then, I’d like to be nice with the precious shared resources in the > > Wikimedia cloud. > > Good question, I'm not sure either if tmpfs for Kubernetes containers is > on Ceph (SSDs) or on RAM. At least it's not on NFS. > > > Sorry that I couldn’t find the answers online. While searching, I came > > across the following pointers: > > – https://wikitech.wikimedia.org/wiki/Ceph > > <https://wikitech.wikimedia.org/wiki/Ceph>: This page has a warning > that > > it’s probably “no longer true”. If the warning is correct, perhaps > > the page could be deleted entirely? Or maybe it could link to the > > current docs? > > – https://wikitech.wikimedia.org/wiki/Swift > > <https://wikitech.wikimedia.org/wiki/Swift>: This sounds perfect, but > > the page doesn’t mention how the files are getting populated, what the > > ACLs are managed, and if Wikimedia’s Swift cluster is even accessible to > > external developers. > > – https://wikitech.wikimedia.org/wiki/Media_storage > > <https://wikitech.wikimedia.org/wiki/Media_storage>: This seems > > current (I guess?), but the page doesn’t mention if/how external > > Toolforge/Cloud-VPS users may upload objects, or if this is just for the > > current users. > > Those pages document the media storage systems used to store uploads for > the production MediaWiki projects (Wikipedia and friends). Those are not > accessible from WMCS and should be treated as completely separate > systems, and any future WMCS (object) storage services will not use them. > > Documentation about the Ceph cluster powering Cloud VPS is on a separate > Wikitech page: > <https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph>. > > -- > Taavi (User:Majavah) > volunteer Toolforge/Cloud VPS admin >
_______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/