subject:"Efficient way to split an input data set into different output files"

Re: Efficient way to split an input data set into different output files

2014-11-19 Thread Nicholas Chammas

I don't have a solution for you, but it sounds like you might want to follow this issue: SPARK-3533 - Add saveAsTextFileByKey() method to RDDs On Wed Nov 19 2014 at 6:41:11 AM Tom Seddon wrote: > I'm trying to set up a PySpark ETL job that take

Efficient way to split an input data set into different output files

2014-11-19 Thread Tom Seddon

I'm trying to set up a PySpark ETL job that takes in JSON log files and spits out fact table files for upload to Redshift. Is there an efficient way to send different event types to different outputs without having to just read the same cached RDD twice? I have my first RDD which is just a json p