I'm trying to use MongoDB as a destination for an ETL I'm writing in Spark. It appears I'm gaining a lot of overhead in my system databases (and possibly in the primary documents themselves); I can only assume it's because I'm left to using PairRDD.saveAsNewAPIHadoopFile.
- Is there a way to batch some of the data together and use Casbah natively so I can use bulk inserts? - Is there maybe a less "hacky" way to load to MongoDB (instead of using saveAsNewAPIHadoopFile)? Thanks in advance!