Re: Best way to deal with skewed partition sizes

2017-03-23 Thread Gourav Sengupta
Hi, In the latest release of SPARK I have seen significant improvements in case your data is in parquet format, which I see it is. But since you are not using spark session and using older API's of spark with spark sqlContext therefore there is a high chance that you are not using the spark impro

Re: Best way to deal with skewed partition sizes

2017-03-23 Thread Gourav Sengupta
And on another note, is there any particular reason for you using s3a:// instead of s3://? Regards, Gourav On Wed, Mar 22, 2017 at 8:30 PM, Matt Deaver wrote: > For various reasons, our data set is partitioned in Spark by customer id > and saved to S3. When trying to read this data, however,

Re: Best way to deal with skewed partition sizes

2017-03-22 Thread Ryan
could you give the event timeline and dag for the time consuming stages on spark UI? On Thu, Mar 23, 2017 at 4:30 AM, Matt Deaver wrote: > For various reasons, our data set is partitioned in Spark by customer id > and saved to S3. When trying to read this data, however, the larger > partitions m