Re: Using mongo with PySpark

2014-06-06 Thread Mayur Rustagi
Yes initialization each turn is hard.. you seem to using python. Another risky thing you can try is to serialize the mongoclient object using any serializer (like kryo wrappers in python) & pass it on to mappers.. then in each mapper you'll just have to unserialize it & use it directly... This may

Re: Using mongo with PySpark

2014-06-04 Thread Samarth Mailinglist
Thanks a lot, sorry for the really late reply! (Didn't have my laptop) This is working, but it's dreadfully slow and seems to not run in parallel? On Mon, May 19, 2014 at 2:54 PM, Nick Pentreath wrote: > You need to use mapPartitions (or foreachPartition) to instantiate your > client in each p

Re: Using mongo with PySpark

2014-05-19 Thread Nick Pentreath
You need to use mapPartitions (or foreachPartition) to instantiate your client in each partition as it is not serializable by the pickle library. Something like def mapper(iter): db = MongoClient()['spark_test_db'] *collec = db['programs']* *for val in iter:* asc = val.encode('

Re: Using mongo with PySpark

2014-05-18 Thread Samarth Mailinglist
db = MongoClient()['spark_test_db'] *collec = db['programs']* def mapper(val): asc = val.encode('ascii','ignore') json = convertToJSON(asc, indexMap) collec.insert(json) # *this is not working* def convertToJSON(string, indexMap): values = string.strip().split(",") json = {}

Re: Using mongo with PySpark

2014-05-17 Thread Mayur Rustagi
You have to ideally pass the mongoclient object along with your data in the mapper(python should be try to serialize your mongoclient, but explicit is better) if client is serializable then all should end well.. if not then you are better off using map partition & initilizing the driver in each

Re: Using mongo with PySpark

2014-05-17 Thread Nicholas Chammas
Where's your driver code (the code interacting with the RDDs)? Are you getting serialization errors? 2014년 5월 17일 토요일, Samarth Mailinglist님이 작성한 메시지: > Hi all, > > I am trying to store the results of a reduce into mongo. > I want to share the variable "collection" in the mappers. > > > Here's wha

Using mongo with PySpark

2014-05-17 Thread Samarth Mailinglist
Hi all, I am trying to store the results of a reduce into mongo. I want to share the variable "collection" in the mappers. Here's what I have so far (I'm using pymongo) db = MongoClient()['spark_test_db'] collec = db['programs'] db = MongoClient()['spark_test_db'] *collec = db['programs']* def