subject:"Re\: groupBy and store in parquet"

Re: groupBy and store in parquet

2016-05-12 Thread Michal Vince

Hi Xinh sorry for my late reply it`s slow because of two reasons (at least to my knowledge) 1. lots of IOs - writing as json, then reading and writing again as parquet 2. because of nested rdd I can`t run the cycle and filter by event_type in parallel - this applies to your solution (3rd step

Re: groupBy and store in parquet

2016-05-05 Thread Xinh Huynh

Hi Michal, Why is your solution so slow? Is it from the file IO caused by storing in a temp file as JSON and then reading it back in and writing it as Parquet? How are you getting "events" in the first place? Do you have the original Kafka messages as an RDD[String]? Then how about: 1. Start wit

Re: groupBy and store in parquet

2016-05-05 Thread Michal Vince

Hi Xinh For (1) the biggest problem are those null columns. e.g. DF will have ~1000 columns so every partition of that DF will have ~1000 columns, one of the partitioned columns can have 996 null columns which is big waste of space (in my case more than 80% in avg) for (2) I can`t really cha

Re: groupBy and store in parquet

2016-05-04 Thread Xinh Huynh

Hi Michal, For (1), would it be possible to partitionBy two columns to reduce the size? Something like partitionBy("event_type", "date"). For (2), is there a way to separate the different event types upstream, like on different Kafka topics, and then process them separately? Xinh On Wed, May 4,