Re: Local state write throughput

Jay Kreps Sun, 25 Jan 2015 19:50:42 -0800

To change the readahead amount iirc is something like
   blockdev --setra 16 /dev/sda
where the number is the readahead buffer size in KB I think.


-Jay

On Sun, Jan 25, 2015 at 3:50 PM, Roger Hoover <roger.hoo...@gmail.com>
wrote:

> I haven't had a chance to try it yet.  Hopefully next week.  I'll let you
> know what I find.
>
> On Sun, Jan 25, 2015 at 2:40 PM, Chris Riccomini <criccom...@apache.org>
> wrote:
>
> > Awesome, I'll have a look at this. @Roger, did setting this improve your
> > RocksDB throughput?
> >
> > On Sun, Jan 25, 2015 at 12:53 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
> >
> > > I have seen a similar thing from the OS tunable readahead. I think
> Linux
> > > defaults to reading a full 128K into pagecache with every read. This is
> > > sensible for spinning disks where maybe blowing 500us may mean you get
> > > lucky and save a 10ms seek. But for SSDs, especially a key-value store
> > > doing purely random access, it is a total waste and huge perf hit.
> > >
> > > -Jay
> > >
> > > On Sun, Jan 25, 2015 at 12:29 PM, Roger Hoover <roger.hoo...@gmail.com
> >
> > > wrote:
> > >
> > > > FYI, for Linux with SSDs, changing the io scheduler to deadline or
> noop
> > > can
> > > > make a 500x improvement.  I haven't tried this myself.
> > > >
> > > >
> > > >
> > >
> >
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/hardware.html#_disks
> > > >
> > > > On Tue, Jan 20, 2015 at 9:28 AM, Chris Riccomini <
> > > > criccom...@linkedin.com.invalid> wrote:
> > > >
> > > > > Hey Roger,
> > > > >
> > > > > We did some benchmarking, and discovered very similar performance
> to
> > > what
> > > > > you've described. We saw ~40k writes/sec, and ~20 k reads/sec,
> > > > > per-container, on a Virident SSD. This was without any changelog.
> Are
> > > you
> > > > > using a changelog on the store?
> > > > >
> > > > > When we attached a changelog to the store, the writes dropped
> > > > > significantly (~1000 writes/sec). When we hooked up VisualVM, we
> saw
> > > that
> > > > > the container was spending > 99% of its time in
> > > > KafkaSystemProducer.send().
> > > > >
> > > > > We're currently doing two things:
> > > > >
> > > > > 1. Working with our performance team to understand and tune RocksDB
> > > > > properly.
> > > > > 2. Upgrading the Kafka producer to use the new Java-based API.
> > > > (SAMZA-227)
> > > > >
> > > > > For (1), it seems like we should be able to get a lot higher
> > throughput
> > > > > from RocksDB. Anecdotally, we've heard that RocksDB requires many
> > > threads
> > > > > in order to max out an SSD, and since Samza is single-threaded, we
> > > could
> > > > > just be hitting a RocksDB bottleneck. We won't know until we dig
> into
> > > the
> > > > > problem (which we started investigating last week). The current
> plan
> > is
> > > > to
> > > > > start by benchmarking RocksDB JNI outside of Samza, and see what we
> > can
> > > > > get. From there, we'll know our "speed of light", and can try to
> get
> > > > Samza
> > > > > as close as possible to it. If RocksDB JNI can't be made to go
> > "fast",
> > > > > then we'll have to understand why.
> > > > >
> > > > > (2) should help with the changelog issue. I believe that the
> slowness
> > > > with
> > > > > the changelog is caused because the changelog is using a sync
> > producer
> > > to
> > > > > send to Kafka, and is blocking when a batch is flushed. In the new
> > API,
> > > > > the concept of a "sync" producer is removed. All writes are handled
> > on
> > > an
> > > > > async writer thread (though we can still guarantee writes are
> safely
> > > > > written before checkpointing, which is what we need).
> > > > >
> > > > > In short, I agree, it seems slow. We see this behavior, too. We're
> > > > digging
> > > > > into it.
> > > > >
> > > > > Cheers,
> > > > > Chris
> > > > >
> > > > > On 1/17/15 12:58 PM, "Roger Hoover" <roger.hoo...@gmail.com>
> wrote:
> > > > >
> > > > > >Michael,
> > > > > >
> > > > > >Thanks for the response.  I used VisualVM and YourKit and see the
> > CPU
> > > is
> > > > > >not being used (0.1%).  I took a few thread dumps and see the main
> > > > thread
> > > > > >blocked on the flush() method inside the KV store.
> > > > > >
> > > > > >On Sat, Jan 17, 2015 at 7:09 AM, Michael Rose <
> > elementat...@gmail.com
> > > >
> > > > > >wrote:
> > > > > >
> > > > > >> Is your process at 100% CPU? I suspect you're spending most of
> > your
> > > > > >>time in
> > > > > >> JSON deserialization, but profile it and check.
> > > > > >>
> > > > > >> Michael
> > > > > >>
> > > > > >> On Friday, January 16, 2015, Roger Hoover <
> roger.hoo...@gmail.com
> > >
> > > > > >>wrote:
> > > > > >>
> > > > > >> > Hi guys,
> > > > > >> >
> > > > > >> > I'm testing a job that needs to load 40M records (6GB in Kafka
> > as
> > > > > >>JSON)
> > > > > >> > from a bootstrap topic.  The topic has 4 partitions and I'm
> > > running
> > > > > >>the
> > > > > >> job
> > > > > >> > using the ProcessJobFactory so all four tasks are in one
> > > container.
> > > > > >> >
> > > > > >> > Using RocksDB, it's taking 19 minutes to load all the data
> which
> > > > > >>amounts
> > > > > >> to
> > > > > >> > 35k records/sec or 5MB/s based on input size.  I ran iostat
> > during
> > > > > >>this
> > > > > >> > time as see the disk write throughput is 14MB/s.
> > > > > >> >
> > > > > >> > I didn't tweak any of the storage settings.
> > > > > >> >
> > > > > >> > A few questions:
> > > > > >> > 1) Does this seem low?  I'm running on a Macbook Pro with SSD.
> > > > > >> > 2) Do you have any recommendations for improving the load
> speed?
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Roger
> > > > > >> >
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Local state write throughput

Reply via email to