On Jan 3, 2011, at 8:47 PM, Christopher Smith wrote:

> On Mon, Jan 3, 2011 at 5:05 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote:
> 
>> On Jan 3, 2011, at 5:17 PM, Christopher Smith wrote:
>>> On Mon, Jan 3, 2011 at 11:40 AM, Brian Bockelman <bbock...@cse.unl.edu
>>> wrote:
>>> 
>>>> It's not immediately clear to me the size of the benefit versus the
>> costs.
>>>> Two cases where one normally thinks about direct I/O are:
>>>> 1) The usage scenario is a cache anti-pattern.  This will be true for
>> some
>>>> Hadoop use cases (MapReduce), not true for some others.
>>>> - http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
>>>> 2) The application manages its own cache.  Not applicable.
>>>> Atom processors, which you mention below, will just exacerbate (1) due
>> to
>>>> the small cache size.
>>>> 
>>> 
>>> Actually, assuming you thrash the cache anyway, having a smaller cache
>> can
>>> often be a good thing. ;-)
>> 
>> Assuming no other thread wants to use that poor cache you are thrashing ;)
> 
> 
> Even then: a small cash can be cleared up more quickly. As in all cases, it
> very much depends on circumstance, but much like O_DIRECT, if you are
> blowing the cache anyway, there is little at stake.
> 
>>> All-in-all, doing this specialization such that you don't hurt the
>> general
>>>> case is going to be tough.
>>> 
>>> For the Hadoop case, the advantages of O_DIRECT would seem to be
>>> comparatively petty to using O_APPEND and/or MMAP (yes, I realize this is
>>> not quite the same as what you are proposing, but it seems close enough
>> for
>>> most cases.. Your best case for a win is when you have reasonably random
>>> access to a file, and then something else that would benefit from more
>> logve
>> 
>> Actually, our particular site would greatly benefit from O_DIRECT - we have
>> non-MapReduce clients with a highly non-repetitive, random read I/O pattern
>> with an actively managed application-level read-ahead (note: because we're
>> almost guaranteed to wait for a disk seek - 2PB of SSDs are a touch pricey,
>> the latency overheads of Java are not actually too important).  The OS page
>> cache is mostly useless for us as the working set size is on the order of a
>> few hundred TB.
>> 
> 
> Sounds like a lot of fun! Even in a circumstance like the one you describe,
> unless the I/O pattern isn't truly random and some application level insight
> provides a unique advantage, the page cache will often do a better job of
> managing the memory both in terms of caching and read-ahead (it becomes a
> lot like the "building a better TCP using UDP": possible, but not really
> worth the effort). If you can pull off zero-copy I/O, the O_DIRECT can be a
> huge win, but Java makes that very, very difficult, and horribly painful to
> manage.
> 
> 

The I/O pattern isn't truly random.  To convert from physicist terms to CS 
terms, the application is iterating through the rows of a column-oriented 
store, reading out somewhere between 1 and 10% of the columns.  The twist is 
that the columns are compressed, meaning the size of a set of rows on disk is 
variable.  This prevents any sort of OS page cache stride detection from 
helping - the OS sees everything as random.  However, the application also has 
an index of where each row is located, meaning if it knows the active set of 
columns, it can predict the reads the client will perform and do a read-ahead.

Some days, it does feel like "building a better TCP using UDP".  However, we 
got a 3x performance improvement by building it (and multiplying by 10-15k 
cores for just our LHC experiment, that's real money!), so it's a particular 
monstrosity we are stuck with.

>> However, I wouldn't actively clamor for O_DIRECT support, but could
>> probably do wonders with a HDFS-equivalent to fadvise.  I really don't want
>> to get into the business of managing buffering in my application code any
>> more than we already do.
> 
> 
> Yes, I think a few minor simple tweaks to HDFS could help tremendously,
> particularly for Map/Reduce style jobs.
> 
> 
>> PS - if there are bored folks wanting to do something beneficial to
>> high-performance HDFS, I'd note that currently it is tough to get >1Gbps
>> performance from a single Hadoop client transferring multiple files.
>> However, HP labs had a clever approach:
>> http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf .  I'd love to see
>> a generic, easy-to-use API to do this.
>> 
>> 
> Interesting. We haven't tried to push the envelope, but we have achieved
>> 1Gbps... I can't recall if we ever got over 2Gbps though...
> 

We hit a real hard wall at 2.5Gbps / server.  Hence, to fill our 10Gbps pipe, 
we've taken the approach of deploying 12 moderate external-facing servers 
instead of one large, fast server.  Unfortunately, buying new servers was much 
cheaper than finding more time to track down the bottlenecks.

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to