EMR: Use extra mounted EBS volumes for spark.local.dir

2017-10-08 Thread Tushar Sudake
Hello everyone, I'm using 'r4.8xlarge' instances on EMR for my Spark Application. To each node, I'm attaching one 512 GB EBS volume. By logging in into nodes I tried verifying that this volume is being set for 'spark.local.dir' by EMR automatically, but couldn't find any such configuration. Can

Re: Quick one... AWS SDK version?

2017-10-08 Thread Tushar Sudake
Hi Jonathan, Does that mean Hadoop-AWS 2.7.3 too is built against AWS SDK 1.11.160 and not 1.7.4? Thanks. On Oct 7, 2017 3:50 PM, "Jean Georges Perrin" wrote: Hey Marco, I am actually reading from S3 and I use 2.7.3, but I inherited the project and they use some AWS API from Amazon SDK, whi

How to merge fragmented IDs into one cluster if one/more IDs are shared

2017-10-05 Thread Tushar Sudake
Hello Sparkans, I want to merge following cluster / set of IDs into one if they have shared IDs. For example: uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4 uuid_3_2,uuid_3_5,uuid_3_6 uuid_3_5,uuid_3_7,uuid_3_8,uuid_3_9 into single: uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4,uuid_3_5,uuid_3_6,uuid_3_7,uuid_3_8,

'Premature end of Content-Length' using S3A to read huge data

2017-08-21 Thread Tushar Sudake
Hello, I'm writing a Spark based application which works around a pretty huge data stored on s3. It's about **15 TB** in size uncompressed. Data is laid across multiple small LZO compressed files files, varying from 10-100MB. By default the job spawns 130k tasks while reading dataset and mapping