subject:"Using mongo with PySpark"

Re: Using mongo with PySpark

2014-06-06 Thread Mayur Rustagi

Yes initialization each turn is hard.. you seem to using python. Another risky thing you can try is to serialize the mongoclient object using any serializer (like kryo wrappers in python) & pass it on to mappers.. then in each mapper you'll just have to unserialize it & use it directly... This may

Re: Using mongo with PySpark

2014-06-04 Thread Samarth Mailinglist

Thanks a lot, sorry for the really late reply! (Didn't have my laptop) This is working, but it's dreadfully slow and seems to not run in parallel? On Mon, May 19, 2014 at 2:54 PM, Nick Pentreath wrote: > You need to use mapPartitions (or foreachPartition) to instantiate your > client in each p

Re: Using mongo with PySpark

2014-05-19 Thread Nick Pentreath

You need to use mapPartitions (or foreachPartition) to instantiate your client in each partition as it is not serializable by the pickle library. Something like def mapper(iter): db = MongoClient()['spark_test_db'] *collec = db['programs']* *for val in iter:* asc = val.encode('

Re: Using mongo with PySpark

2014-05-18 Thread Samarth Mailinglist

db = MongoClient()['spark_test_db'] *collec = db['programs']* def mapper(val): asc = val.encode('ascii','ignore') json = convertToJSON(asc, indexMap) collec.insert(json) # *this is not working* def convertToJSON(string, indexMap): values = string.strip().split(",") json = {}

Re: Using mongo with PySpark

2014-05-17 Thread Mayur Rustagi

You have to ideally pass the mongoclient object along with your data in the mapper(python should be try to serialize your mongoclient, but explicit is better) if client is serializable then all should end well.. if not then you are better off using map partition & initilizing the driver in each

Re: Using mongo with PySpark

2014-05-17 Thread Nicholas Chammas

Where's your driver code (the code interacting with the RDDs)? Are you getting serialization errors? 2014년 5월 17일 토요일, Samarth Mailinglist님이 작성한 메시지: > Hi all, > > I am trying to store the results of a reduce into mongo. > I want to share the variable "collection" in the mappers. > > > Here's wha

Using mongo with PySpark

2014-05-17 Thread Samarth Mailinglist

Hi all, I am trying to store the results of a reduce into mongo. I want to share the variable "collection" in the mappers. Here's what I have so far (I'm using pymongo) db = MongoClient()['spark_test_db'] collec = db['programs'] db = MongoClient()['spark_test_db'] *collec = db['programs']* def

Re: Using mongo with PySpark

Re: Using mongo with PySpark

Re: Using mongo with PySpark

Re: Using mongo with PySpark

Re: Using mongo with PySpark

Re: Using mongo with PySpark

Using mongo with PySpark

7 matches

Site Navigation

Mail list logo

Footer information