from:"chuwiey"

Re: Write to Parquet File in Python

2015-03-23 Thread chuwiey

Hey Akriti23, pyspark gives you a saveAsParquetFile() api, to save your rdd as parquet. You will however, need to infer the schema or describe it manually before you can do so. Here are some docs about that (v1.2.1, you can search for the others, they're relatively similar 1.1 and up): http://spa

Re: PySpark, ResultIterable and taking a list and saving it into different parquet files

2015-03-23 Thread chuwiey

In case anyone wants to learn about my solution for this: groupByKey is highly inefficient due to the swapping of elements between the different partitions as well as requiring enough mem in each worker to handle the elements for each group. So instead of using groupByKey, I ended up taking the fla