Hey Akriti23,
pyspark gives you a saveAsParquetFile() api, to save your rdd as parquet.
You will however, need to infer the schema or describe it manually before
you can do so. Here are some docs about that (v1.2.1, you can search for the
others, they're relatively similar 1.1 and up):
http://spa
In case anyone wants to learn about my solution for this:
groupByKey is highly inefficient due to the swapping of elements between the
different partitions as well as requiring enough mem in each worker to
handle the elements for each group. So instead of using groupByKey, I ended
up taking the fla