Yes initialization each turn is hard.. you seem to using python. Another
risky thing you can try is to serialize the mongoclient object using any
serializer (like kryo wrappers in python) & pass it on to mappers.. then in
each mapper you'll just have to unserialize it & use it directly... This
may
Thanks a lot, sorry for the really late reply! (Didn't have my laptop)
This is working, but it's dreadfully slow and seems to not run in parallel?
On Mon, May 19, 2014 at 2:54 PM, Nick Pentreath
wrote:
> You need to use mapPartitions (or foreachPartition) to instantiate your
> client in each p
You need to use mapPartitions (or foreachPartition) to instantiate your
client in each partition as it is not serializable by the pickle library.
Something like
def mapper(iter):
db = MongoClient()['spark_test_db']
*collec = db['programs']*
*for val in iter:*
asc = val.encode('
db = MongoClient()['spark_test_db']
*collec = db['programs']*
def mapper(val):
asc = val.encode('ascii','ignore')
json = convertToJSON(asc, indexMap)
collec.insert(json) # *this is not working*
def convertToJSON(string, indexMap):
values = string.strip().split(",")
json = {}
You have to ideally pass the mongoclient object along with your data in the
mapper(python should be try to serialize your mongoclient, but explicit is
better)
if client is serializable then all should end well.. if not then you are
better off using map partition & initilizing the driver in each
Where's your driver code (the code interacting with the RDDs)? Are you
getting serialization errors?
2014년 5월 17일 토요일, Samarth Mailinglist님이 작성한
메시지:
> Hi all,
>
> I am trying to store the results of a reduce into mongo.
> I want to share the variable "collection" in the mappers.
>
>
> Here's wha
Hi all,
I am trying to store the results of a reduce into mongo.
I want to share the variable "collection" in the mappers.
Here's what I have so far (I'm using pymongo)
db = MongoClient()['spark_test_db']
collec = db['programs']
db = MongoClient()['spark_test_db']
*collec = db['programs']*
def