Hi all,

I was stuck on a problem that I faced recently. The problem statement is
like :
Event Bean consists of eventId, eventTag, text, .....
We need to run a spark job that aggregates the eventTag column and picks
top K1 of them.
Additionally, we need for each eventTag, list of eventIds (first K2 size)
which contain that eventTag.
K1, K2 < 100

**Like for top 1 eventTag**
K1 = 1, K2 = 2
id_1   eventTag1  text1
id_2   eventTag2  text2
id_3   eventTag3  text3
id_4   eventTag1  text4

Ans would be:   eventTag1 - >  id_1, id_4

The job would run over data of the last 30 days (will iterate over
200million * 30 records).
The records are flattened and saved in parquet.

What could be the solution for the same?
1.
We can group and sum the eventTag count and order by count limit K.
(Aggregate Task)
We can run MapTask and fitter out nonrelevant eventTag column.
Then I can aggregate and define User Defined Aggregate Function and Create
Column of eventIds.
One problem :
We need to run the operation on all files multiple times (2 here).
Is it possible after we find top K elements, that we re-read the rows again
which are already present in memory?
I can persist initial dataset but it will be large.

2.
We write custom UDAF  and in reduce and merge phase, I add the count of
grouped eventTag.
During the final phase, if I can somehow broadcast all the count and add
them in the priority queue of size K. If I can do above, I can actually
filter at the last phase if
the column is needed or not by setting the boolean variable.  We can
actually collect eventIds in UDAF  if we want and for the column with
boolean true,
we filter them out and get the result.
Two problems :
How will I achieve broadcast? can we use accumulators?
For each eventTag, I will end up having K2 eventIds (Wastage of Memory)

3.
We write custom UDAF and add count, eventIds for each eventTag.
Create a string column with some delimiter consisting of the count, K2
eventIds.
Break columns on the basis of delimiter and then order by count column and
pick K1 rows.


Can someone help with the better approach?
1. I want to reduce num of times all files are listed, footers read and
they are brought in memory.
2.  If possible do not collect eventIds for each eventTag.


Thanks




*This e-mail may include confidential and/or*





*proprietary information and may be used only by the person or entityto
which it is addressed. If the reader of this e-mail is not the
intendedrecipient or his or her authorized agent, the reader is hereby
notifiedthat any dissemination, distribution or copying of this e-mail
isprohibited. If you have received this e-mail in error, please notify
thesender by replying to this message and delete this e-mail immediately.*

Reply via email to