sending data to a different partitions of the output stream

2015-04-07 Thread Vladimir Lebedev
Hey, I can not find clear explanation of this in the documentation or in hello-samza: how to tell collector to send my output data to a particular partition of the output stream? My understanding is that in my process() method I have to create OutgoingMessageEnvelope object passing not only

Re: sending data to a different partitions of the output stream

2015-04-07 Thread Tommy Becker
If you want to send to a specific partition number, you can just pass that number as the partition key. This works because the default partitioner is via hashcode, and the hash of integers is the value itself. On 04/07/2015 07:40 AM, Vladimir Lebedev wrote: Hey, I can not find clear explanati

Re: sending data to a different partitions of the output stream

2015-04-07 Thread Vladimir Lebedev
Great, passing partition number as a partition key is enough for my needs. Many thanks, Tommy! On 04/07/2015 02:46 PM, Tommy Becker wrote: If you want to send to a specific partition number, you can just pass that number as the partition key. This works because the default partitioner is via

consistency between input, output and changelog streams

2015-04-07 Thread Bart Wyatt
We are trying to make sure that we are handling the proper edge cases in our stateful tasks so that our processes can survive failure well. Given the changelog will recreate the KV store (state) up to the point of time of the last durable changelog write(Ts), the checkpoint will start input from

Dealing with partitioning mismatches between bootstrap and input streams

2015-04-07 Thread Tommy Becker
We have a Kafka topic containing data needed by several Samza jobs. These jobs will essentially read the data and build up state that will be used for processing their inputs. Ideally, we would use the topic as a bootstrap stream to build up this state. The problem with that is the topic contai

Producer performance in 0.9.0

2015-04-07 Thread Gian Merlino
Has anyone else seen issues with producer performance in 0.9.0? I updated a few of our jobs recently and ended up rolling one back to 0.8 since it was being really sluggish. I profiled it for a bit and a lot of time was being spent in BufferPool.allocate and the busy-loop in KafkaSystemProducer's

Re: Producer performance in 0.9.0

2015-04-07 Thread Chris Riccomini
Hey Gian, Hmm, this is strange. We ran some tests, and found that the new producer to be faster than the old producer default (sync), and almost as fast as the old producer's async producer. Could you paste all of your configs? Cheers, Chris On Tue, Apr 7, 2015 at 10:40 AM, Gian Merlino wrote:

Re: consistency between input, output and changelog streams

2015-04-07 Thread Yan Fang
Hi Bart, In terms of your assumption, * Ts <= To , this is correction. The code backups this assumption is here: in RunLoop , the commit is called after each process and window methods

Re: Dealing with partitioning mismatches between bootstrap and input streams

2015-04-07 Thread Chris Riccomini
Hey Tommy, Your summary sounds pretty accurate. One other way, which requires no change to Samza, would be to repartition the input topic properly for each task. This is kind of hacky, though. (2) is the ideal solution. It is a bit of work, but it might not be so bad. I think most of the changes

Re: Producer performance in 0.9.0

2015-04-07 Thread Gian Merlino
Hey Chris, I tried setting producer.batch.size to 256KB (the Kafka docs say the default is 16KB) and the throughput is much better. That job is running a bit faster than 0.8 now. Gian On Tue, Apr 7, 2015 at 2:02 PM, Chris Riccomini wrote: > Hey Gian, > > Hmm, this is strange. We ran some tests,

Re: Review Request 32877: SAMZA-616

2015-04-07 Thread Tommy Becker
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32877/ --- (Updated April 7, 2015, 8:19 p.m.) Review request for samza. Bugs: SAMZA-616