Re: questions of partition and task of Samza

Selina Tech Sat, 24 Oct 2015 15:13:05 -0700

Hi, Yan:

      Thanks a lot for your reply.


      You mentioned "if you give the msgs the same partition key", which
mean same partition key value or  same partition key attribute name?

       I mentioned "primary key" as "key" at public
KeyedMessage(java.lang.String topic, K key, V message) or you can ignore
it. I explain it in another way below.

       If I need aggregate data, but the data are not in same partition, do
we need consumer the data, and put it back it to Kafka with with new key
and then consumer it again and aggregate it in Samza.

      For example,  messages about student GPA information was send to
Kafka by* K key(String schoolName)*. The message looks like "name,
schoolName,  departmentName,  grade, GPA", and assuming I have 3
partitions, With my understanding, all student records in one school should
go to same partition.

      Right now I need to aggregate data for same department, no matter
which school.  which mean all the same departmentName message will be in
three different partition. If I just count it in one samza job, will the
result correct?  Do I need to consumer the original input and send it back
to Kafka and reset the* Key to  departmentName *and then consume it again
to count in Samza?

     If I did not understand the partition and task of Samza, would you
like to correct me?

Sincerely,
Selina

On Sat, Oct 24, 2015 at 2:45 AM, Yan Fang <yanfangw...@163.com> wrote:

>
>
> Hi Selina,
>
>
> what do you mean by "primary key" here? Is it one of the partitions of
> "input" or something like "if one msg meets condition x, we think msg has
> the primary key"?
>
>
> If you just want to count the msgs, you can count in one Samza job and
> send the result to "output" topic. You can send to any partition of the
> "output" if you give the msgs the same partition key.
>
>
> Thanks,
> Yan
>
>
>
>
>
>
>
> At 2015-10-22 08:30:15, "Selina Tech" <swucaree...@gmail.com> wrote:
> >Hi, All:
> >
> >        In the Samza document, it mentioned "Each task consumes data from
> >one partition for each of the job’s input streams." Does it mean if the
> >data processing one job is not in one partition, the result will be wrong.
> >
> >        Assuming my Samza input data on Kafka topic -- "input" is
> >partitioned by default -- round robin. And I have five partitions. If my
> >Samza job is to count messages by primary key of the message at "input"
> >topic, and then output it to kafka topic -- "output".
> >
> >       So I need steps as below
> >      1. read data from Kafka topic "input"
> >      2. reset the partition key to "primary key" in Samza
> >      3. produce it back to Kafka topic named as "temp"
> >      4. read "temp" topic at Samza
> >      5. count it in Samza
> >      6. write it to Kafka topic named as "output"
> >
> >      If I just read data from Kafka topic "input" and count it in Samza
> >and write it to topic "output". The result will not be correct because
> there
> >might have multiple messages for same "primary key" in "output" topic.  Do
> >I understand it correctly?
> >
> >Sincerely,
> >Selina
>

Re: questions of partition and task of Samza

Reply via email to