Hi, Here are the raw logs of today's meeting. I'll write down an executive summary tomorrow.
Cheers <loicd> The Ceph User Committee monthly meeting (first edition) is about to begin, in 2 minutes :-) The agenda is: <loicd> * Meetups https://wiki.ceph.com/Community/Meetups <loicd> * Goodies https://ceph.myshopify.com/collections/all <loicd> * Documentation of the new Firefly feature (tiering, erasure code) http://ceph.com/docs/master/dev/ <loicd> * Careers http://ceph.com/community/careers/ <loicd> and we have kraken with us for entertainment ;-) <loicd> !norris CephUserCommittee <kraken> CephUserCommittee is the only person on the planet that can kick you in the back of the face. <loicd> here we go <loicd> This is the first meeting of the Ceph User Committee http://ceph.com/community/the-ceph-user-committee-is-born/ . All are welcome and the proposed agenda announced on the ceph mailing list ( http://www.spinics.net/lists/ceph-users/msg08743.html ) is flexible. <loicd> does someone want to add to the agenda ? <loicd> (I'll timeout questions after 1minute if there is no answer) <loicd> First topic : * Documentation of the new Firefly feature (tiering, erasure code) http://ceph.com/docs/master/dev/ <loicd> The Ceph User Committee could be a source of inspiration for developers <loicd> For erasure code here is what happens : <loicd> (I'm writing based on my experience as a user and developer) <loicd> I'm not sure where people land when asking themselves : "let's try erasure code" <loicd> Google second link for https://www.google.com/search?q=erasure+code+ceph <loicd> is https://ceph.com/docs/master/dev/erasure-coded-pool/ <nhm> loicd: One of the questions that has come up in the past for both Erasure coding and Tiering is ease of use. <janos> i don't imagine it's too beneficial for smaller users <loicd> nhm: I think it's easy to use but ... I'm biased <loicd> janos: how do you mean ? <nhm> loicd: developers are always biased. :) None of us are good test subjects. We need fresh users that have never tried it. <janos> it seems in general that it's an increase in CPU usage, without much benefit unless you are really needing space saving <loicd> nhm: right ! <Vacum_> We will definitely try it as soon as FireFly is available <janos> i'm not saying it's a bad feature or anything. just asking about use-case <loicd> janos: it's an interesting perspective <janos> it's acool parity idea, no doubt <loicd> janos: if you have 3 machines (with 1 osd on each), then erasure code saves you space the same way RAID5 would (only with different machines instead of a single machine with 3 disks). <Vacum_> janos: as soon as you have a lot of seldomly or nearly never read data, its worth the cpu/io <-> space tradeoff <janos> Vacum_, good point <nhm> janos: space saving is definitely the big plus. Arguably it may at some point be faster for the same availability for large object writes, but will almost always be slower for reads and small object writes. <loicd> If CPU / performances is an issue, erasure code is indeed not a good choice <janos> is the load entirely on the OSD hosts? or do Mon's get involved much? <Vacum_> janos: together with the Tiered Storage and its rules, you automatically benefit from EC for "cold" objects <janos> beyond finding things <loicd> janos: the load is on the OSDs <janos> cool <loicd> Vacum_: right <janos> Vacum_, yeah i was thinking that after your first comment <janos> this little conversation definitely increeased my interest level <loicd> as a user, I don't see what I would not use erasure code as a second tier, because it reduces space without impacting performances <Vacum_> regarding EC: are there any plans for "glued objects". like adding a bunch of small objects together into one large blob, then EC that blob? -*- loicd notes for the record : create a ticket to clarify this use case in https://ceph.com/docs/master/dev/erasure-coded-pool/ <loicd> Vacum_: not that I know. Erasuer code is not fit for small objects as it stands. But the solution to deal with it is yet to be defined, I think. <kraken> http://i.imgur.com/6E6n1.gif <Vacum_> loicd: regarding the documentation and the "10 DCs" example. It does not show the tradeoff of this solution: to read one object, you have to read from 6 DCs! <loicd> Vacum_: right -*- loicd notes for the record: "10 DCs" example. It does not show the tradeoff of this solution: to read one object, you have to read from 6 DCs! https://ceph.com/docs/master/dev/erasure-coded-pool/ <Vacum_> loicd: honestly, a bit lazy right now :) are there examples for Tiered Storage based on "coldness" too? <loicd> :-) <Vacum_> a combined example might be great. EC with TS <loicd> I think https://ceph.com/docs/master/dev/cache-pool/ is what you're looking for <loicd> Vacum_: ^ <loicd> it's not about EC but if you think of the second tier as EC, it's the same really <Vacum_> loicd: cache pools actually duplicate the data, right? <loicd> yes <Vacum_> oh <Vacum_> When I read about Tiered Storage on Ceph and possible rules, I imagined it would move data from one pool to the other <loicd> it does <loicd> data is duplicated and moved when the cache pool is full <loicd> or when it is dirty <loicd> Users got confused by the syntax change between 0.78 (which is the current version) and what is in master (which is what they get when compiling from sources) <Vacum_> mh <loicd> Vacum_: your line of question suggests the documentation should clarify this ;-) <Vacum_> :) -*- loicd notes for the record : clarify the relationships between tiering and erasure code because at the moment it looks like tiering is exclusively for caching <aarontc> (loicd: worth updating topic during CUCMM?) <loicd> aarontc: yes, it would be worth having a bot to archive also ;-) <loicd> The unit tests show working examples https://github.com/ceph/ceph/blob/master/src/test/erasure-code/test-erasure-code.sh but users don't find them most of the time <loicd> Unless there is more about tiering / erasure code, I propose we move to the next topic <aarontc> +1 <Vacum> +1 <loicd> fake /topic Tracker http://tracker.ceph.com/ <loicd> People seem to have problems registering (have to try again) <loicd> How harmfull is it ? Some people don't report problems because of that. <aarontc> Would it be possible to allow anonymous bug reports? Would that be desirable if possible? <Vacum> loicd: yep, when I call http://tracker.ceph.com/account/register I get an Internal error <loicd> +1 <Vacum> loicd: I created a ticket for this already on the tracker, after you created the account for me :) <janos> has there been any significant move toward production-supported CephFS? <loicd> :-) -*- loicd adds the topic to the agenda <loicd> I don't think there is more we can do regarding the tracker, except raise it for people to work on. I'm not sure who's tasked with this though. <Vacum> hasn't been touched yet, my ticket <loicd> houkouonchi-work: do you know who's in charge of bug fixing the tracker ? <aarontc> loicd: patrick is my guess <loicd> Vacum: could past the ticket URL ? <Vacum> http://tracker.ceph.com/issues/7609 <loicd> scuttlemonkey: are you cursed with this ? ;-) -*- loicd notes for the record : +3 on http://tracker.ceph.com/issues/7609 and figure out who needs help with this <emkt> yeah also would like to know the status of a production ready cephfs :-) <loicd> ok, let's move to this topic then :-) -*- aarontc would also like to know about production CephFS <loicd> fake /topic has there been any significant move toward production-supported CephFS? -*- Fruit aol <scuttlemonkey> wha? <janos> run away! <loicd> scuttlemonkey: :-) <scuttlemonkey> haha <loicd> my understanding, from the last CDS, is that CephFS is not production ready and no promise has been made <scuttlemonkey> no specific promise beyond "this year" <loicd> when I look at the work done, I'm truly impressed <janos> is there a solid list of show-stoppers to make it prod-ready? <loicd> but I have *no clue* how much is left to e done <scuttlemonkey> I think the rough estimate that was given (napkin sketch) was sometime in Q3 <scuttlemonkey> but that's predicated on a lot of other things happening <scuttlemonkey> from a stability standpoint we're actually looking pretty good (at last observation) <Fruit> an fsck tool has yet to be developed I think? <scuttlemonkey> ^ <loicd> out of curiosity, what use case do you have that needs CephFS ? <scuttlemonkey> loicd: there are surprising number of them <scuttlemonkey> several folks want to get busy using the hdfs->cephfs shim <loicd> are they listed somewhere ? <scuttlemonkey> hmmm <scuttlemonkey> don't think anyone has aggregated all the use cases <gregsfortytwo> fyi a real fsck is unlikely to make it before production, but there are a lot of manual repair tools we think are blocking it <aarontc> I want it to store filese! <gregsfortytwo> in addition to way more testing <aarontc> err, files! -*- loicd notes for the record : open a wiki page for people to list their CephFS use case <Vacum> aarontc: only store? :) <aarontc> Vacum: touche. Also I'd like to retrieve them, and list them, and modify them. <scuttlemonkey> aarontc: awww, and here I thought you'd just add an 's' ....fileses precious! <emkt> i am looking to use cephfs for hadoop and files for user as well as archiving <janos> haha <Fruit> web content for existing non-ceph-aware applications. that's thousands just there <loicd> ahah <aarontc> scuttlemonkey: I did consider it :) <loicd> Fruit: so the benefit would be to have a self-repairable posix compliant distributed FS in this case, right ? So you don't have to worry about backups ? Or so you don't have to worry about scale out issues ? <aarontc> CephFS is (for my uses) the best API to utilize Ceph - all my existing applications support filesystems <janos> for me it's largely scale out issues <aarontc> for me it's capacity... RAID arrays can only get so large <Fruit> loicd: the benefit would be a filesystem without a SPOF <Vacum> then rbd could already be sufficient? <scuttlemonkey> as far as use cases go I think the most popular are: hadoop, reexporting as cifs/nfs, backing existing tools that use FS, distributing images to be local for hypervisor nodes in openstack, SAN/NAS stuff....probably a few others I'm forgetting <janos> actually both you mentioned. backups + scale out. availability <aarontc> (and precious fileses) -*- loicd notes for the record seed the CephFS use case list with the content of the conversation <scuttlemonkey> always those :) <loicd> scuttlemonkey: hadoop, you means HDFS compatible access ? <scuttlemonkey> right <scuttlemonkey> hdfs replacement <loicd> Using custom rados classes to offload processing to the OSD is not very popular I assume. <aarontc> loicd: I would be interested in that, actually <loicd> http://noahdesu.github.io/2013/02/21/writing-cls-lua-handlers.html etc. <scuttlemonkey> loicd: yeah, I think it's largely too early for the RADOS classes <scuttlemonkey> people haven't quite gotten to the point where they can realize the hotness :) <aarontc> (if it makes sense - reprocessing images in batches, or assembling frames into videos) <loicd> right <Vacum> eh, that all only makes sense if one object is stored in one piece on one osd, right? <Vacum> as soon as it gets chunked: no dice <Vacum> ie image/video processing <aarontc> (Are files in CephFS stored as one file per object?) <nhm> scuttlemonkey: don't forgot all the HPC folks that want CephFS. ;) <loicd> Vacum: true. You get control over that though. <Vacum> (are they with radosgw?) <nhm> supercomputer scratch/project/home storage <aarontc> Vacum: they are not with radosgw, IIRC <scuttlemonkey> nhm: true <loicd> nhm: why do they specifically ? <scuttlemonkey> although I don't understand the workloads there...so saying "HPC applications" isn't helpful unless I can explain it <aarontc> scuttlemonkey: I have a buddy who could get more details for us if that's something worth pursuing <nhm> loicd: primarily as a potential alternative if for some reason they don't want to use Lustre. <loicd> ok -*- loicd notes that we're past half of the meeting, 27 minutes left <Vacum> IMO the currently really interesting pricepoints $/GB all tend to already have an extreme HDD / CPU proportion <aarontc> loicd: is it beer time yet? <scuttlemonkey> aarontc: might be interesting...even more so if he's already playing with Ceph and can give us examples of where it would be cool <nhm> scuttlemonkey: If you want I can go through them with you. I used to help write our storage RFPs at MSI. <Vacum> so adding more CPU load to the OSD machines could be a problem. -*- loicd confess having a glass of wine already ;-) <aarontc> scuttlemonkey: I'll try to get some details for you <scuttlemonkey> nhm: probably not a bad idea to start getting my head around cephFS stuff <nhm> aarontc: Where does your buddy work btw? I used to be at the Minnesota Supercomputing Institute. Maybe I know him. :) <loicd> nhm: I don't know whwat RFPs mean, could you expand ? <scuttlemonkey> gave a talk last night to about 30 people here in Ann Arbor <scuttlemonkey> the most interest I had was from students working on filesystem stuff <aarontc> nhm: he's a postgrad student at a university in the UK somewhere... details escape me at the moment <nhm> loicd: Request For Proposal, ie a request for vendors to bid on a system with defined requirements <loicd> thanks ;-) <loicd> anything else on the CephFS topic ? <loicd> (1 minute timeout on this question) <janos> not from me <aarontc> just that it's awesome and I can't wait to use it :) <loicd> aarontc: haha <Serbitar> i really want cephfs for hpc and for general university storage <emkt> not sure if it is with cephfs or in general.. asynchronous replication <kraken> ≖_≖ <loicd> Serbitar: could you explain your use case ? <Serbitar> really as an alternative to lustre, which while popular is vastly inferior design <loicd> ok :-) <Serbitar> we do a lot of different hpc jobs as it is both a learning and research resource <emkt> Serbitar: agree <nhm> Serbitar: What kind of applications are you guys running? <loicd> what kind of hpc job ? I'm curious. <loicd> :-) <Serbitar> unfortunately i only maintain the ysystem, my collegue is more familiar with the particular jobs <Serbitar> we do quote a lot of castep <loicd> castep ? <Serbitar> http://www.castep.org/ -*- loicd looking & learning <Serbitar> though that is probably not really storage dependant <nhm> Serbitar: ah, not too familiar with that one. No idea what it's IO looks like. <loicd> or not learning ... this is too complicated for me ;-) <Serbitar> its probably all ram based <Serbitar> but there was another job that someone started using that made our storage cruy -*- loicd notes for the record add http://www.castep.org/ to the use case list <Serbitar> cant remember what it was <Serbitar> one of the worst is gaussian <nhm> Serbitar: I was jsut going to say gaussian! <Serbitar> but it wasnt that, this particular task the user was trying to be conservative with his quota <Serbitar> so he had the job write its data down to disk, then he gzipped it <nhm> Serbitar: probably just lots of small random reads/writes. <Vacum> that would be something for the OSD classes: gzip on demand <nhm> Serbitar: especially if he was doing direct IO. <Vacum> ie identifiying if its worth it, if yes, store gzipped <loicd> Vacum: cls_gzip :-) <Serbitar> Vacum: very difficult if you are doing lots of random io <aarontc> loicd: I'd like to see more documentation with regard to decoding the log messages from Ceph daemons, and "better" explanations of all the configuration parameters <nhm> Vacum: I think so far we just rely on the underlying OSD filesystem to handle any compression. <Serbitar> possibly you could uses cache tiering to say "this slow data can be compressed now" <Serbitar> btrfs does zlib and lzo compression -*- loicd notes for the record more documentation with regard to decoding the log messages from Ceph daemons <Vacum> nhm: fair enough! <Serbitar> but my other use case, would be to replace our netapp -*- loicd notes "better" explanations of all the configuration parameters (config_opts.h) <Vacum> loicd: ah, regarding logging. the mon node frequently FLOODs its log with always the same message <loicd> Vacum: which one ? <Vacum> loicd: can that be made so it will back-off after like 20 in the same second and then only count, like syslog? <nhm> Vacum: afaik you can disable all logging if you want. <Vacum> nhm: I do not want to disable it. but yesterday we had the same message coming in like 1000 times per second -*- loicd notes : make it so logger it will back-off after like 20 in the same second and then only count, like syslog? <nhm> Vacum: that's a good idea probably <emkt> btrfs random io is not at par yet although it supports compression <Vacum> loicd: I'm at home right now ./ <Vacum> loicd: so no access to the logs <loicd> Vacum: I get what you're saying though ;-) <loicd> I guess that means we're out of the CephFS topic <janos> yeah <emkt> asynchronous replication :-) <loicd> fake /topic misc whishlist <jerker> for me now speed is more important than compression. ceph is already cheaper than the alternatives, measured in hardware. So speed and stability. :) SSD cache is very good. Is it stable for production? <loicd> emkt: like what radosgw does ? <Vacum> loicd: "7f432467e700 1 mon.csdeveubs-u01mon01@0(leader).paxos(paxos acti <Vacum> ve c 697530..698142) is_readable now=2014-04-02 17:24:28.278147 lease_expire=0.0 <nhm> emkt: indeed, that would be nice. :) <Vacum> 00000 has v0 lc 698142" <Vacum> that one :) <emkt> yeah for cephfs <Vacum> +1 for asynchronous replication on _rados_ level. perhaps based on crushmap rules? <loicd> jerker: firefly will havea tiering and provide SSD cache. emperor (the current version) does not. <nhm> jerker: firefly will be the first realse with the tiering layer, but you can use SSDs locally on the OSD for Ceph Jourals and/or bcache/flashcache <emkt> as in hdfs you have async replication whereas ceph has to wait for both replica to be acked <emkt> both or more based on settings* <jerker> nhm: yes i am for journals, have not tried bcache/flashcache yet. Am using 8 GB flash hybrid drives though. -*- loicd notes : asynchronous replication on _rados_ level <Fruit> random wishlist item: bandwidth reservations (probably really hard) <Fruit> bandwidth/iops <jerker> loicd: i am looking forward to it <loicd> Vacum: that's unlikely to happen soon though, it's complicated <Vacum> loicd: I imagine :) <Vacum> wishlist item: new release of rados-java :D <kraken> AbstractHibernateSchedulerExtractionCommand <loicd> Fruit: cold you expand on bandwidth reservations (probably really hard) ? <jerker> loicd: can one not do something ugly with normal iptables at the VM host? <jerker> loicd: for bandwidth limitation <Serbitar> do we know of anyone using cephs as a backstore for samba? <aarontc> Serbitar: I was doing that until my MDS exploded <Fruit> loicd: it would be nice to guarantee that pools have a certain amount of iops/throughput available even if other pools are hammering the storage system <jerker> Serbitar: i have tried, then my cephfs crashed. and then i reinstalled cluster. <loicd> jerker: I suspect not. But there are various levels of throttling you can tweak, depending on what you're after. <Serbitar> these two commends make me sda panda <nhm> we've had some folks interested in trying Samba on top of RBD -*- loicd notes : guarantee that pools have a certain amount of iops/throughput available even if other pools are hammering the storage system <nhm> But the samba/cephfs work is definitely more interesting. <jerker> nhm: samba/netatalk on top of RBD is stable, a bit slow though, need more IOPS <aarontc> loicd: +1 on that, I would like to see better insight into what is causing the ceph load, and ways to throttle certain clients/tasks/etc <janos> i'd like solid samba. i export to windows machines as shares <nhm> jerker: was that with RBD cache? <jerker> nhm: RBD cache in KVM <loicd> janos: solid as in fast ? or do you find samba on top of RBD + fs fragile for some reason ? <emkt> loic: asynchronous repliaction for cephfs too.. as i am interested in using it with hadoop <janos> it feels clunky to me. it works, but would be nice to be more direct <janos> i do that combo right now <mjevans> So in other news... ceph 0.72 with debian's bleeding edge 3.14-rc7 kernel fails btrfs corruption even when a 'very small' ceph cluster has only 3 guest VMs running the phoronix-test-suite disk test on it. <nhm> jerker: Ok. I'd be curious what the IO patterns for samba look like. <jerker> nhm: I get 100% IO-utilization when running TIme Machine from a couple of Mac clients to netatalk/Ext4/SL6/KVM/RBD <loicd> emkt: I'm not sure if that makes sense to me ? Assuming async replication is available at the rados level, what would it mean for cephfs to have async replication ? <jerker> nhm: but not so much usable bandwidth as I would like <nhm> jerker: was it primarily IOPS or also slow throughput? <loicd> mjevans: is there a ticket for that ? <mjevans> loicd: I have no idea, but I'm already too behind schedule to even look up where to look up such a ticket... so XFS it is. <emkt> how does the hadoop-cephfs plugin communicate with ceph now is it using cephfs or talking to rados directly .. as i am playing around these days with this plugin -*- loicd notes there's only 7 minutes left, will pause in 2 minutes for conclusions <nhm> mjevans: probably for the best right now. RBD with BTRFS backed OSDs will fragment extremely quickly with small writes. <mjevans> loicd: also, supposedly, 3.15 has more btrfs fixes <jerker> nhm: I have not measured very much, but when reading 100 Mbit/s from the clients over netatalk I get about 100% IO-utilization from the virtual machine.. But the ceph cluster can handle a bit more, gigabit with two nodes, four osd. Not a large cluster :-) <loicd> mjevans: ok :-) <nhm> mjevans: and the btrfs defrag tools don't work well with snapshots. Josef said they'll probably have it fixed some time this summer. <Vacum> loicd: wishlist :) : "hierarchical near" backfilling, based on crush location. ie 4 replicas, 2 in each rack. instead backfill from primary: backfill from OSD in the same rack <loicd> Vacum: +1 ! <emkt> Vacum +1 -*- loicd notes : "hierarchical near" backfilling, based on crush location. ie 4 replicas, 2 in each rack. instead backfill from primary: backfill from OSD in the same rack <tracphil> Is anyone using Ceph with Cloudstack and XenServer? <loicd> wido: is <loicd> tracphil: ^ <janos> Vacum, i like that idea <loicd> a few seconds left for misc / whishlist before we move to the conclusion <nhm> Vacum: that ties in to the read from near replica requests we've gotten periodically. <Serbitar> similar to Vacum's idea is it possible monitor hot blocks and keep copies on more hosts for performance <Vacum> nhm: near replica reads are available with firefly <nhm> Vacum: ha, shows how up to date I am. ;) <loicd> fake /topic first Ceph User Committee meeting conclusion <loicd> that did not go as I imagined it would *at all* <loicd> it was much better ;-) <janos> haha <janos> i liked it! <emkt> :-) <janos> thank you for managing it <Vacum> loicd: a bit on the "feature" side, but fine :) <loicd> I'll skip the t-shirt / meetup thing which is not interesting in this context <emkt> really useful and thanks loicd for arranging it <Vacum> loicd: any plans on repeating this? frequency? <loicd> let's surf on what triggers discussion. I'll work on writing a summary based on the log and post it tomorrow on the ceph user list. <Serbitar> when is the next one <loicd> we'll do it monthly <Vacum> sounds great! <loicd> so... that would be may 3rd ? -*- jerker was not really aware i was in the middle of the meeting :) <Vacum> thats a saturday <loicd> saturdya <loicd> may 2nd better ? <Vacum> IMO yes <loicd> jerker: ahaha <loicd> ok, let say may 2nd then <loicd> anything we should do differently next time ? <loicd> ok then, I guess we're adjourned. Thanks a thousand time everyone ! <Vacum> thank you! <aarontc> loicd: I think having a topic would help people coming and going be aware that there is a meeting going on and they are welcome to participate :) <janos> thank you loicd <emkt> let it be adhoc,random and chaotic like this.. so that it will be creative like this one :) <Fruit> loicd: thanks! <loicd> cool <jerker> Thanks! <Vacum> loicd: perhaps announce the next one in the topic a week before? <loicd> Vacum: +1 -*- loicd notes announce the next one in the topic a week before? <Vacum> Well, that way I won't forget it myself :) <aarontc> +1 On 03/04/2014 18:09, Loic Dachary wrote: > Hi, > > Here is the agenda: > > * Meetups https://wiki.ceph.com/Community/Meetups > * Goodies https://ceph.myshopify.com/collections/all > * Documentation of the new Firefly feature (tiering, erasure code) > http://ceph.com/docs/master/dev/ > * Careers http://ceph.com/community/careers/ > > Location : irc.oftc.net#ceph > > Thursday April 3rd, 2014 > > 18:00-19:00 UTC > 14:00-15:00 US-Eastern > 12:00-13:00 US-Mountain > 11:00-12:00 US-Pacific > 20:00-21:00 Europe-Central > > Cheers > > On 29/03/2014 17:57, Loic Dachary wrote: >> Hi Ceph, >> >> The Ceph User Committee monthly meeting will be on irc.oftc.net#ceph >> The minutes will be compiled and sent to the ceph-devel mailing list. >> >> Thursday April 3rd, 2014 >> >> 18:00-19:00 UTC >> 14:00-15:00 US-Eastern >> 12:00-13:00 US-Mountain >> 11:00-12:00 US-Pacific >> 20:00-21:00 Europe-Central >> >> The agenda is not yet defined, please send your suggestions and idea. >> >> Here is what I would like to discuss: >> >> * Documentation of the new Firefly feature (tiering, erasure code) >> >> Cheers >> > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Loïc Dachary, Artisan Logiciel Libre
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com