Hi, Yan: Thanks a lot for your reply.
You mentioned "if you give the msgs the same partition key", which mean same partition key value or same partition key attribute name? I mentioned "primary key" as "key" at public KeyedMessage(java.lang.String topic, K key, V message) or you can ignore it. I explain it in another way below. If I need aggregate data, but the data are not in same partition, do we need consumer the data, and put it back it to Kafka with with new key and then consumer it again and aggregate it in Samza. For example, messages about student GPA information was send to Kafka by* K key(String schoolName)*. The message looks like "name, schoolName, departmentName, grade, GPA", and assuming I have 3 partitions, With my understanding, all student records in one school should go to same partition. Right now I need to aggregate data for same department, no matter which school. which mean all the same departmentName message will be in three different partition. If I just count it in one samza job, will the result correct? Do I need to consumer the original input and send it back to Kafka and reset the* Key to departmentName *and then consume it again to count in Samza? If I did not understand the partition and task of Samza, would you like to correct me? Sincerely, Selina On Sat, Oct 24, 2015 at 2:45 AM, Yan Fang <yanfangw...@163.com> wrote: > > > Hi Selina, > > > what do you mean by "primary key" here? Is it one of the partitions of > "input" or something like "if one msg meets condition x, we think msg has > the primary key"? > > > If you just want to count the msgs, you can count in one Samza job and > send the result to "output" topic. You can send to any partition of the > "output" if you give the msgs the same partition key. > > > Thanks, > Yan > > > > > > > > At 2015-10-22 08:30:15, "Selina Tech" <swucaree...@gmail.com> wrote: > >Hi, All: > > > > In the Samza document, it mentioned "Each task consumes data from > >one partition for each of the job’s input streams." Does it mean if the > >data processing one job is not in one partition, the result will be wrong. > > > > Assuming my Samza input data on Kafka topic -- "input" is > >partitioned by default -- round robin. And I have five partitions. If my > >Samza job is to count messages by primary key of the message at "input" > >topic, and then output it to kafka topic -- "output". > > > > So I need steps as below > > 1. read data from Kafka topic "input" > > 2. reset the partition key to "primary key" in Samza > > 3. produce it back to Kafka topic named as "temp" > > 4. read "temp" topic at Samza > > 5. count it in Samza > > 6. write it to Kafka topic named as "output" > > > > If I just read data from Kafka topic "input" and count it in Samza > >and write it to topic "output". The result will not be correct because > there > >might have multiple messages for same "primary key" in "output" topic. Do > >I understand it correctly? > > > >Sincerely, > >Selina >