Ran the DFS IO write tests:

- Increasing the journal log size did not make any difference for me ... i
guess the number i had set was sufficient. For the rest of the tests I kept
it at a generous 10GB.
- Separating out the journal from the data disk did make a difference as
expected. Unfortunately I currently do not have access to SSDs, so I had a
separate disk for the journal for each data disk for now.

HDFS write numbers (7 disks/data node):

Average execution time: 236
Best execution time     : 219
Worst execution time   : 254

Ceph write numbers with journals + data on same disk (7 disks)

Average execution time: 494
Best execution time     : 468
Worst execution time   : 524

So ceph was about 2x slower for the average case when journal & data were
on the same disk.

Now separating out the journal from data disk ...

HDFS write numbers (3 disks/data node)
Average execution time: 466
Best execution time     : 426
Worst execution time   : 508

ceph write numbers (3 data disks/data node + 3 journal disks/data node)
Average execution time: 610
Best execution time     : 593
Worst execution time   : 635

So ceph was about 1.3x slower for the average case when journal & data are
separated .. a 70% improvement over the case where journal + data are on
the same disk - but still a bit off from the HDFS performance.    Not
knowing of other ceph knobs I can play with, I'll have to leave it at
that.  I'll seem if I  get some system profiling done to narrow down where
we're spending time.

thanks y'all
















On Wed, Jul 10, 2013 at 11:35 AM, Noah Watkins <noah.watk...@inktank.com>wrote:

> On Wed, Jul 10, 2013 at 9:17 AM, ker can <kerca...@gmail.com> wrote:
> >
> > Seems like a good read ahead value that the ceph hadoop client can use
> as a
> > default   !
>
> Great, I'll add this tunable to the list of changes to be pushed into
> next release.
>
> > I'll look at the DFS write tests later today .... any tuning suggestions
> you
> > can think of there. I was thinking of trying out increasing the journal
> size
> > and separating out the journaling to a separate  disk.  Anything else ?
>
> I expect that you will see a lot of improvement by moving the journal
> to a separate physical device, so I would start there.
>
> As for journal size tuning, I'm not completely sure, but there may be
> an opportunity to optimize for Hadoop workloads. I think ceph.com/docs
> has some general guidelines. Maybe someone more knowledgeable than me
> can chime in on the trade-offs
>
> >
> > For hdfs dfsio read test:
> >
> > Average execution time: 258
> > Best execution time: 149
> > Worst exec time: 361
> >
> > For ceph with default read ahead setting:
> >
> > Average execution time: 316
> > Best execution time: 296
> > Worst execution time: 358
> >
> > For ceph with read ahead setting = 4193404
> >
> > Average execution time: 285
> > Best execution time: 277
> > Worst execution time: 294
>
> This is looking pretty good. I'd really like to work on that best
> execution time for Ceph. I wonder if there are any Hadoop profiling
> tools... narrowing down where time is being taken up would be very
> useful.
>
> Thanks!
> Noah
>
>
> >
> > I didn't set max bytes ... I guess the default is zero which means no
> max ?
> > I tried increasing the readahead max periods to 8 .. didn't look like a
> good
> > change.
> >
> > thanks !
> >
> >
> >
> >
> > On Wed, Jul 10, 2013 at 10:56 AM, Noah Watkins <noah.watk...@inktank.com
> >
> > wrote:
> >>
> >> Hey KC,
> >>
> >> I wanted to follow up on this, but ran out of time yesterday. To set
> >> the options in ceph.conf you can do something like
> >>
> >> [client]
> >>     readahead min = blah
> >>     readahead max bytes = blah
> >>     readahead max periods = blah
> >>
> >> then, make just sure that your client is pointing to a ceph.conf with
> >> these settings.
> >>
> >>
> >> On Tue, Jul 9, 2013 at 4:32 PM, Noah Watkins <noah.watk...@inktank.com>
> >> wrote:
> >> > Yes, the libcephfs client. You should be able to adjust the settings
> >> > without changing any code. The settings should be adjustable either by
> >> > setting the config options in ceph.conf, or using the
> >> > "ceph.conf.options" settings in Hadoop's core-site.xml.
> >> >
> >> > On Tue, Jul 9, 2013 at 4:26 PM, ker can <kerca...@gmail.com> wrote:
> >> >> Makes sense.  I can try playing around with these settings  .... when
> >> >> you're
> >> >> saying client, would this be libcephfs.so ?
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Jul 9, 2013 at 5:35 PM, Noah Watkins <
> noah.watk...@inktank.com>
> >> >> wrote:
> >> >>>
> >> >>> Greg pointed out the read-ahead client options. I would suggest
> >> >>> fiddling with these settings. If things improve, we can put
> automatic
> >> >>> configuration of these settings into the Hadoop client itself. At
> the
> >> >>> very least, we should be able to see if it is the read-ahead that is
> >> >>> causing performance problems.
> >> >>>
> >> >>> OPTION(client_readahead_min, OPT_LONGLONG, 128*1024) // readahead at
> >> >>> _least_ this much.
> >> >>> OPTION(client_readahead_max_bytes, OPT_LONGLONG, 0) //8 * 1024*1024
> >> >>> OPTION(client_readahead_max_periods, OPT_LONGLONG, 4) // as multiple
> >> >>> of file layout period (object size * num stripes)
> >> >>>
> >> >>> -Noah
> >> >>>
> >> >>>
> >> >>> On Tue, Jul 9, 2013 at 3:27 PM, Noah Watkins
> >> >>> <noah.watk...@inktank.com>
> >> >>> wrote:
> >> >>> >> Is the JNI interface still an issue or have we moved past that ?
> >> >>> >
> >> >>> > We haven't done much performance tuning with Hadoop, but I suspect
> >> >>> > that the JNI interface is not a bottleneck.
> >> >>> >
> >> >>> > My very first thought about what might be causing slow read
> >> >>> > performance is the read-ahead settings we use vs Hadoop. Hadoop
> >> >>> > should
> >> >>> > be performing big, efficient, block-size reads and caching these
> in
> >> >>> > each map task. However, I think we are probably doing lots of
> small
> >> >>> > reads on demand. That would certainly hurt performance.
> >> >>> >
> >> >>> > In fact, in CephInputStream.java I see we are doing buffer-sized
> >> >>> > reads. Which, at least in my tree, turn out to be 4096 bytes :)
> >> >>> >
> >> >>> > So, there are two issues now. First, the C-Java barrier is being
> >> >>> > cross
> >> >>> > a lot (16K times for a 64MB block). That's probably not a huge
> >> >>> > overhead, but it might be something. The second is read-ahead. I'm
> >> >>> > not
> >> >>> > sure how much read-ahead the libcephfs client is performing, but
> the
> >> >>> > more round trips its doing the more overhead we would incur.
> >> >>> >
> >> >>> >
> >> >>> >>
> >> >>> >> thanks !
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> On Tue, Jul 9, 2013 at 3:01 PM, ker can <kerca...@gmail.com>
> wrote:
> >> >>> >>>
> >> >>> >>> For this particular test I turned off replication for both hdfs
> >> >>> >>> and
> >> >>> >>> ceph.
> >> >>> >>> So there is just one copy of the data lying around.
> >> >>> >>>
> >> >>> >>> hadoop@vega7250:~$ ceph osd dump | grep rep
> >> >>> >>> pool 0 'data' rep size 1 min_size 1 crush_ruleset 0 object_hash
> >> >>> >>> rjenkins
> >> >>> >>> pg_num 960 pgp_num 960 last_change 26 owner 0
> >> >>> >>> crash_replay_interval 45
> >> >>> >>> pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1
> >> >>> >>> object_hash
> >> >>> >>> rjenkins pg_num 960 pgp_num 960 last_change 1 owner 0
> >> >>> >>> pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash
> >> >>> >>> rjenkins
> >> >>> >>> pg_num 960 pgp_num 960 last_change 1 owner 0
> >> >>> >>>
> >> >>> >>> From hdfs-site.xml:
> >> >>> >>>
> >> >>> >>>   <property>
> >> >>> >>>     <name>dfs.replication</name>
> >> >>> >>>     <value>1</value>
> >> >>> >>>   </property>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> On Tue, Jul 9, 2013 at 2:44 PM, Noah Watkins
> >> >>> >>> <noah.watk...@inktank.com>
> >> >>> >>> wrote:
> >> >>> >>>>
> >> >>> >>>> On Tue, Jul 9, 2013 at 12:35 PM, ker can <kerca...@gmail.com>
> >> >>> >>>> wrote:
> >> >>> >>>> > hi Noah,
> >> >>> >>>> >
> >> >>> >>>> > while we're still on the hadoop topic ... I was also trying
> out
> >> >>> >>>> > the
> >> >>> >>>> > TestDFSIO tests ceph v/s hadoop.  The Read tests on ceph
> takes
> >> >>> >>>> > about
> >> >>> >>>> > 1.5x
> >> >>> >>>> > the hdfs time.  The write tests are worse about ... 2.5x the
> >> >>> >>>> > time
> >> >>> >>>> > on
> >> >>> >>>> > hdfs,
> >> >>> >>>> > but I guess we have additional journaling overheads for the
> >> >>> >>>> > writes
> >> >>> >>>> > on
> >> >>> >>>> > ceph.
> >> >>> >>>> > But there should be no such overheads for the read  ?
> >> >>> >>>>
> >> >>> >>>> Out of the box Hadoop will keep 3 copies, and Ceph 2, so it
> could
> >> >>> >>>> be
> >> >>> >>>> the case that reads are slower because there is less
> opportunity
> >> >>> >>>> for
> >> >>> >>>> scheduling local reads. You can create a new pool with
> >> >>> >>>> replication=3
> >> >>> >>>> and test this out (documentation on how to do this is on
> >> >>> >>>> http://ceph.com/docs/wip-hadoop-doc/cephfs/hadoop/).
> >> >>> >>>>
> >> >>> >>>> As for writes, Hadoop will write 2 remote and 1 local blocks,
> >> >>> >>>> however
> >> >>> >>>> Ceph will write all copies remotely, so there is some overhead
> >> >>> >>>> for
> >> >>> >>>> the
> >> >>> >>>> extra remote object write  (compared to Hadoop), but i wouldn't
> >> >>> >>>> have
> >> >>> >>>> expected 2.5x. It might be useful to run dd or something like
> >> >>> >>>> that on
> >> >>> >>>> Ceph to see if the numbers make sense to rule out Hadoop as the
> >> >>> >>>> bottleneck.
> >> >>> >>>>
> >> >>> >>>> -Noah
> >> >>> >>>
> >> >>> >>>
> >> >>> >>
> >> >>
> >> >>
> >
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to