Ran the DFS IO write tests: - Increasing the journal log size did not make any difference for me ... i guess the number i had set was sufficient. For the rest of the tests I kept it at a generous 10GB. - Separating out the journal from the data disk did make a difference as expected. Unfortunately I currently do not have access to SSDs, so I had a separate disk for the journal for each data disk for now.
HDFS write numbers (7 disks/data node): Average execution time: 236 Best execution time : 219 Worst execution time : 254 Ceph write numbers with journals + data on same disk (7 disks) Average execution time: 494 Best execution time : 468 Worst execution time : 524 So ceph was about 2x slower for the average case when journal & data were on the same disk. Now separating out the journal from data disk ... HDFS write numbers (3 disks/data node) Average execution time: 466 Best execution time : 426 Worst execution time : 508 ceph write numbers (3 data disks/data node + 3 journal disks/data node) Average execution time: 610 Best execution time : 593 Worst execution time : 635 So ceph was about 1.3x slower for the average case when journal & data are separated .. a 70% improvement over the case where journal + data are on the same disk - but still a bit off from the HDFS performance. Not knowing of other ceph knobs I can play with, I'll have to leave it at that. I'll seem if I get some system profiling done to narrow down where we're spending time. thanks y'all On Wed, Jul 10, 2013 at 11:35 AM, Noah Watkins <noah.watk...@inktank.com>wrote: > On Wed, Jul 10, 2013 at 9:17 AM, ker can <kerca...@gmail.com> wrote: > > > > Seems like a good read ahead value that the ceph hadoop client can use > as a > > default ! > > Great, I'll add this tunable to the list of changes to be pushed into > next release. > > > I'll look at the DFS write tests later today .... any tuning suggestions > you > > can think of there. I was thinking of trying out increasing the journal > size > > and separating out the journaling to a separate disk. Anything else ? > > I expect that you will see a lot of improvement by moving the journal > to a separate physical device, so I would start there. > > As for journal size tuning, I'm not completely sure, but there may be > an opportunity to optimize for Hadoop workloads. I think ceph.com/docs > has some general guidelines. Maybe someone more knowledgeable than me > can chime in on the trade-offs > > > > > For hdfs dfsio read test: > > > > Average execution time: 258 > > Best execution time: 149 > > Worst exec time: 361 > > > > For ceph with default read ahead setting: > > > > Average execution time: 316 > > Best execution time: 296 > > Worst execution time: 358 > > > > For ceph with read ahead setting = 4193404 > > > > Average execution time: 285 > > Best execution time: 277 > > Worst execution time: 294 > > This is looking pretty good. I'd really like to work on that best > execution time for Ceph. I wonder if there are any Hadoop profiling > tools... narrowing down where time is being taken up would be very > useful. > > Thanks! > Noah > > > > > > I didn't set max bytes ... I guess the default is zero which means no > max ? > > I tried increasing the readahead max periods to 8 .. didn't look like a > good > > change. > > > > thanks ! > > > > > > > > > > On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <noah.watk...@inktank.com > > > > wrote: > >> > >> Hey KC, > >> > >> I wanted to follow up on this, but ran out of time yesterday. To set > >> the options in ceph.conf you can do something like > >> > >> [client] > >> readahead min = blah > >> readahead max bytes = blah > >> readahead max periods = blah > >> > >> then, make just sure that your client is pointing to a ceph.conf with > >> these settings. > >> > >> > >> On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <noah.watk...@inktank.com> > >> wrote: > >> > Yes, the libcephfs client. You should be able to adjust the settings > >> > without changing any code. The settings should be adjustable either by > >> > setting the config options in ceph.conf, or using the > >> > "ceph.conf.options" settings in Hadoop's core-site.xml. > >> > > >> > On Tue, Jul 9, 2013 at 4:26 PM, ker can <kerca...@gmail.com> wrote: > >> >> Makes sense. I can try playing around with these settings .... when > >> >> you're > >> >> saying client, would this be libcephfs.so ? > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins < > noah.watk...@inktank.com> > >> >> wrote: > >> >>> > >> >>> Greg pointed out the read-ahead client options. I would suggest > >> >>> fiddling with these settings. If things improve, we can put > automatic > >> >>> configuration of these settings into the Hadoop client itself. At > the > >> >>> very least, we should be able to see if it is the read-ahead that is > >> >>> causing performance problems. > >> >>> > >> >>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at > >> >>> _least_ this much. > >> >>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024 > >> >>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple > >> >>> of file layout period (object size * num stripes) > >> >>> > >> >>> -Noah > >> >>> > >> >>> > >> >>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins > >> >>> <noah.watk...@inktank.com> > >> >>> wrote: > >> >>> >> Is the JNI interface still an issue or have we moved past that ? > >> >>> > > >> >>> > We haven't done much performance tuning with Hadoop, but I suspect > >> >>> > that the JNI interface is not a bottleneck. > >> >>> > > >> >>> > My very first thought about what might be causing slow read > >> >>> > performance is the read-ahead settings we use vs Hadoop. Hadoop > >> >>> > should > >> >>> > be performing big, efficient, block-size reads and caching these > in > >> >>> > each map task. However, I think we are probably doing lots of > small > >> >>> > reads on demand. That would certainly hurt performance. > >> >>> > > >> >>> > In fact, in CephInputStream.java I see we are doing buffer-sized > >> >>> > reads. Which, at least in my tree, turn out to be 4096 bytes :) > >> >>> > > >> >>> > So, there are two issues now. First, the C-Java barrier is being > >> >>> > cross > >> >>> > a lot (16K times for a 64MB block). That's probably not a huge > >> >>> > overhead, but it might be something. The second is read-ahead. I'm > >> >>> > not > >> >>> > sure how much read-ahead the libcephfs client is performing, but > the > >> >>> > more round trips its doing the more overhead we would incur. > >> >>> > > >> >>> > > >> >>> >> > >> >>> >> thanks ! > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kerca...@gmail.com> > wrote: > >> >>> >>> > >> >>> >>> For this particular test I turned off replication for both hdfs > >> >>> >>> and > >> >>> >>> ceph. > >> >>> >>> So there is just one copy of the data lying around. > >> >>> >>> > >> >>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep > >> >>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash > >> >>> >>> rjenkins > >> >>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0 > >> >>> >>> crash_replay_interval 45 > >> >>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 > >> >>> >>> object_hash > >> >>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0 > >> >>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash > >> >>> >>> rjenkins > >> >>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0 > >> >>> >>> > >> >>> >>> From hdfs-site.xml: > >> >>> >>> > >> >>> >>> <property> > >> >>> >>> <name>dfs.replication</name> > >> >>> >>> <value>1</value> > >> >>> >>> </property> > >> >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins > >> >>> >>> <noah.watk...@inktank.com> > >> >>> >>> wrote: > >> >>> >>>> > >> >>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kerca...@gmail.com> > >> >>> >>>> wrote: > >> >>> >>>> > hi Noah, > >> >>> >>>> > > >> >>> >>>> > while we're still on the hadoop topic ... I was also trying > out > >> >>> >>>> > the > >> >>> >>>> > TestDFSIO tests ceph v/s hadoop. The Read tests on ceph > takes > >> >>> >>>> > about > >> >>> >>>> > 1.5x > >> >>> >>>> > the hdfs time. The write tests are worse about ... 2.5x the > >> >>> >>>> > time > >> >>> >>>> > on > >> >>> >>>> > hdfs, > >> >>> >>>> > but I guess we have additional journaling overheads for the > >> >>> >>>> > writes > >> >>> >>>> > on > >> >>> >>>> > ceph. > >> >>> >>>> > But there should be no such overheads for the read ? > >> >>> >>>> > >> >>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it > could > >> >>> >>>> be > >> >>> >>>> the case that reads are slower because there is less > opportunity > >> >>> >>>> for > >> >>> >>>> scheduling local reads. You can create a new pool with > >> >>> >>>> replication=3 > >> >>> >>>> and test this out (documentation on how to do this is on > >> >>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/). > >> >>> >>>> > >> >>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks, > >> >>> >>>> however > >> >>> >>>> Ceph will write all copies remotely, so there is some overhead > >> >>> >>>> for > >> >>> >>>> the > >> >>> >>>> extra remote object write (compared to Hadoop), but i wouldn't > >> >>> >>>> have > >> >>> >>>> expected 2.5x. It might be useful to run dd or something like > >> >>> >>>> that on > >> >>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the > >> >>> >>>> bottleneck. > >> >>> >>>> > >> >>> >>>> -Noah > >> >>> >>> > >> >>> >>> > >> >>> >> > >> >> > >> >> > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com