container concurrency and pipelining

2015-02-06 Thread Jordan Shaw
Hi everyone, I've done some raw Disk, Kafka and Samza benchmarking. I peaked out a single Samza container's consumer at around 2MB/s. Running a Kafka Consumer Perf test though on the same machine I can do 100's of MB/s. It seems like most of the bottleneck exists in the Kafka async client. There ap

Re: container concurrency and pipelining

2015-02-08 Thread Jordan Shaw
s.apache.org/jira/browse/SAMZA-6 > > > > Here's what I'd recommend: > > > > 0. Write something reproducible and post it on SAMZA-6. For bonus points, > > write an equivalent raw-Kafka-producer test (no Samza) so we can compare > > them. > > 1. Ch

Re: container concurrency and pipelining

2015-02-10 Thread Jordan Shaw
VM, and do CPU sampling? > It'd be good to get a view of exactly where in the "produce" call things > are slow. > > Cheers, > Chris > > On Sun, Feb 8, 2015 at 9:47 PM, Jordan Shaw wrote: > > > Hey Chris, > > Sorry for the delayed response, did a Tahoe

Re: container concurrency and pipelining

2015-02-10 Thread Jordan Shaw
rmal results. Thanks! -Jordan On Tue, Feb 10, 2015 at 10:27 AM, Jordan Shaw wrote: > Hey Chris, > We've done pretty extensive testing already on that task. Here's a SS of a > sample of those results showing the 2MB/s rate. I haven't done those > profiling specifically, we

Java Opts and Max Heap

2015-03-10 Thread Jordan Shaw
Hey Everyone, This I have a question somewhat related to SAMZA-109 and this line in run-class.sh: # Check if a max-heap size is specified. If not - set a 768M heap [[ $JAVA_OPTS != *-Xmx* ]] && JAVA_OPTS="$JAVA_OPTS -Xmx768M" If I were to set the container.memory.mb for yarn to 4GB ( yarn.containe

Re: Java Opts and Max Heap

2015-03-10 Thread Jordan Shaw
es, and off-heap memory usage. All of these > contribute to the physical memory usage that YARN cares about, but are > outside the JVM heap. This means that we can't just use one memory setting > for both YARN and Java. We have to have two. > > Cheers, > Chris > > On Tue

Re: How do you serve the data computed by Samza?

2015-03-27 Thread Jordan Shaw
all over? > 6. If there was a highly-optimized and reliable way of ingesting > partitioned streams quickly into your online serving system, would that > help you leverage Samza more effectively? > > Your insights would be much appreciated! > > > Thanks (: > > > -- > Felix > -- Jordan Shaw Full Stack Software Engineer PubNub Inc 1045 17th St San Francisco, CA 94107

Re: Thoughts and obesrvations on Samza

2015-07-08 Thread Jordan Shaw
I'm all for any optimizations that can be made to the Yarn workflow. I actually agree with Jakob in regard to the producers/consumers. I have spent sometime writing consumers and producers for other transport abstractions and overall the current api abstractions in Samza I feel are pretty good. Th

Re: Thoughts and obesrvations on Samza

2015-07-13 Thread Jordan Shaw
Jay, I think doing this iteratively in smaller chunks is a better way to go as new issues arise. As Navina said Kafka is a "stream system" and Samza is a "stream processor" and those two ideas should be mutually exclusive. -Jordan On Mon, Jul 13, 2015 at 10:06 AM, Jay Kreps wrote: > Hmm, though

Kafka broker error from samza producer

2015-07-22 Thread Jordan Shaw
from the samza producer any idea what could be causing this? Just about the only thing that I can find is maybe a issue with snappy or compression but I don't see a snappy call in the traceback. -- Jordan Shaw Full Stack Software Engineer PubNub Inc

Re: Kafka broker error from samza producer

2015-07-23 Thread Jordan Shaw
on and the new > producer. If you disable compression or switch to lz4 or gzip, does the > issue go away? > > Cheers, > > Roger > > On Wed, Jul 22, 2015 at 11:54 PM, Jordan Shaw wrote: > > > Hey Everyone, > > I'm getting an: > > "kafka.me

Re: [Discuss/Vote] upgrade to Yarn 2.6.0

2015-08-24 Thread Jordan Shaw
Roger, We upgraded from yarn 2.4 to 2.6 a while ago and been running it in prod with no issues. It was basically a drop in if I remember right. Jordan > On Aug 20, 2015, at 1:48 PM, Yi Pan wrote: > > Hi, Selina, > > Samza 0.9.1 on YARN 2.6 is the proved working solution. > > Best, > > -Yi >

Re: Asynchronous approach and samza

2015-09-21 Thread Jordan Shaw
ou to do this, (well, discouraged anyway :). Samza > >> by > >>>>> default does not provide this feature. So you maybe a little cautious > >>>> when > >>>>> implementing this. > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Fang, Yan > >>>>> yanfang...@gmail.com > >>>>> > >>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar >>> > >>>>> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> What would be the best approach for doing "blocking" operations in > >>>> Samza? > >>>>>> > >>>>>> For example, we have a kafka stream of urls for which we need to > >>> gather > >>>>>> external data via HTTP (such as alexa rank, get the page title and > >>>>>> headers..). Other scenarios include database access and decision > >>> making > >>>>> via > >>>>>> a rule engine. > >>>>>> > >>>>>> Samza processes messages in a singe thread, HTTP requests might > >> take > >>>>>> hundreds of miliseconds. With the single threaded design the > >>> throughput > >>>>>> would be very limited, which can be solved with an asynchronous > >>>> approach. > >>>>>> However Samza documentation explicitely states > >>>>>> "*You are strongly discouraged from using threads in your job’s > >>> code*". > >>>>>> > >>>>>> It seems that Samza design suits very well "data transformation" > >>>>> scenarios, > >>>>>> what is not clear is how well can it support external services? > >>>>>> > >>>>>> Thanks, > >>>>>> Michael Sklyar > > > > > > -- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > -- Jordan Shaw Full Stack Software Engineer PubNub Inc 1045 17th St San Francisco, CA 94107

Re: Does Samza work with ResourceManager in HA?

2015-11-03 Thread Jordan Shaw
gt; > > > > If I set it to the first of my RMs, the the job submission works ok if I > submit the job from that RM and the RM is the active one. If the RM machine > that I run the job submission from is not active, I get connection refused > errors on port 8032. If I don't set it, I get errors where run-job.sh > tries to submit to 0.0.0.0:8032 > > > > > > Many thanks, > > > > > > John > > -- Jordan Shaw Full Stack Software Engineer PubNub Inc 1045 17th St San Francisco, CA 94107

Re: Monitoring consumer lag

2015-11-16 Thread Jordan Shaw
ging? > > On a related subject, I'd also like to monitor throughput per topic in > terms of messages per second and bytes per second. Should I query brokers > periodically, or maybe there is a better way? > > Thanks, > Michael > -- Jordan Shaw Full Stack Software Engineer PubNub Inc 1045 17th St San Francisco, CA 94107

Re: Monitoring consumer lag

2015-11-16 Thread Jordan Shaw
umed that it's the default Kafka config > for commiting offsets. Will try again with Burrow set to read from > __consumer_offsets. > > Thanks > > On Mon, Nov 16, 2015 at 8:04 PM, Jordan Shaw wrote: > > > Michael, > > It depends on how you define lag. > >