Handling Very Large volume(500TB) data using spark

2018-08-25 Thread Great Info
Hi All, I have large volume of data nearly 500TB(from 2016-2018-till date), I have to do some ETL on that data. This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select . 1. Do I need to process monthly o

spark rename or access columns which has special chars " ?:

2018-07-13 Thread Great Info
I have a columns like below root |-- metadata: struct (nullable = true) ||-- "drop":{"dropPath":" https://dstpath.media27.ec2.st-av.net/drop?source_id: string (nullable = true) ||-- "selection":{"AlllURL":" https://dstpath.media27.ec2.st-av.net/image?source_id: string (