Hi Xinh
sorry for my late reply
it`s slow because of two reasons (at least to my knowledge)
1. lots of IOs - writing as json, then reading and writing again as parquet
2. because of nested rdd I can`t run the cycle and filter by event_type
in parallel - this applies to your solution (3rd step
Hi Michal,
Why is your solution so slow? Is it from the file IO caused by storing in a
temp file as JSON and then reading it back in and writing it as Parquet?
How are you getting "events" in the first place?
Do you have the original Kafka messages as an RDD[String]? Then how about:
1. Start wit
Hi Xinh
For (1) the biggest problem are those null columns. e.g. DF will have
~1000 columns so every partition of that DF will have ~1000 columns, one
of the partitioned columns can have 996 null columns which is big waste
of space (in my case more than 80% in avg)
for (2) I can`t really cha
Hi Michal,
For (1), would it be possible to partitionBy two columns to reduce the
size? Something like partitionBy("event_type", "date").
For (2), is there a way to separate the different event types upstream,
like on different Kafka topics, and then process them separately?
Xinh
On Wed, May 4,