Hi, Yan: Thanks a lot for your answer.
Sincerely, Selina On Mon, Oct 26, 2015 at 8:03 PM, Yan Fang <yanfangw...@163.com> wrote: > Hi Selina, > > > Your understanding is correct. Yes, you "need to consumer the original > input and send it back to Kafka and reset the* Key to departmentName *and > then consume it again > to count in Samza" if you want to count the number of students in the same > departmentName. This is a typical aggregation use case. Because after > aggregating the students in the same department, you can do more than just > "count". :) > > > Cheers, > Yan > > > At 2015-10-25 06:12:50, "Selina Tech" <swucaree...@gmail.com> wrote: > >Hi, Yan: > > > > Thanks a lot for your reply. > > > > You mentioned "if you give the msgs the same partition key", which > >mean same partition key value or same partition key attribute name? > > > > I mentioned "primary key" as "key" at public > >KeyedMessage(java.lang.String topic, K key, V message) or you can ignore > >it. I explain it in another way below. > > > > If I need aggregate data, but the data are not in same partition, > do > >we need consumer the data, and put it back it to Kafka with with new key > >and then consumer it again and aggregate it in Samza. > > > > For example, messages about student GPA information was send to > >Kafka by* K key(String schoolName)*. The message looks like "name, > >schoolName, departmentName, grade, GPA", and assuming I have 3 > >partitions, With my understanding, all student records in one school > should > >go to same partition. > > > > Right now I need to aggregate data for same department, no matter > >which school. which mean all the same departmentName message will be in > >three different partition. If I just count it in one samza job, will the > >result correct? Do I need to consumer the original input and send it back > >to Kafka and reset the* Key to departmentName *and then consume it again > >to count in Samza? > > > > If I did not understand the partition and task of Samza, would you > >like to correct me? > > > >Sincerely, > >Selina > > > >On Sat, Oct 24, 2015 at 2:45 AM, Yan Fang <yanfangw...@163.com> wrote: > > > >> > >> > >> Hi Selina, > >> > >> > >> what do you mean by "primary key" here? Is it one of the partitions of > >> "input" or something like "if one msg meets condition x, we think msg > has > >> the primary key"? > >> > >> > >> If you just want to count the msgs, you can count in one Samza job and > >> send the result to "output" topic. You can send to any partition of the > >> "output" if you give the msgs the same partition key. > >> > >> > >> Thanks, > >> Yan > >> > >> > >> > >> > >> > >> > >> > >> At 2015-10-22 08:30:15, "Selina Tech" <swucaree...@gmail.com> wrote: > >> >Hi, All: > >> > > >> > In the Samza document, it mentioned "Each task consumes data > from > >> >one partition for each of the job’s input streams." Does it mean if the > >> >data processing one job is not in one partition, the result will be > wrong. > >> > > >> > Assuming my Samza input data on Kafka topic -- "input" is > >> >partitioned by default -- round robin. And I have five partitions. If > my > >> >Samza job is to count messages by primary key of the message at "input" > >> >topic, and then output it to kafka topic -- "output". > >> > > >> > So I need steps as below > >> > 1. read data from Kafka topic "input" > >> > 2. reset the partition key to "primary key" in Samza > >> > 3. produce it back to Kafka topic named as "temp" > >> > 4. read "temp" topic at Samza > >> > 5. count it in Samza > >> > 6. write it to Kafka topic named as "output" > >> > > >> > If I just read data from Kafka topic "input" and count it in > Samza > >> >and write it to topic "output". The result will not be correct because > >> there > >> >might have multiple messages for same "primary key" in "output" > topic. Do > >> >I understand it correctly? > >> > > >> >Sincerely, > >> >Selina > >> >