foreach vs. map isn't the issue. Both require serializing the called function, so the pickle error would still apply, yes?
And at the moment, I'm just testing. Definitely wouldn't want to log something for each element, but may want to detect something and log for SOME elements. So my question is: how are other people doing logging from distributed tasks, given the serialization issues? The same issue actually exists in Scala, too. I could work around it by creating a small serializable object that provides a logger, but it seems kind of kludgy to me, so I'm wondering if other people are logging from tasks, and if so, how? Diana On Tue, May 6, 2014 at 6:24 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > I think you're looking for > RDD.foreach()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach> > . > > According to the programming > guide<http://spark.apache.org/docs/latest/scala-programming-guide.html> > : > > Run a function func on each element of the dataset. This is usually done >> for side effects such as updating an accumulator variable (see below) or >> interacting with external storage systems. > > > Do you really want to log something for each element of your RDD? > > Nick > > > On Tue, May 6, 2014 at 3:31 PM, Diana Carroll <dcarr...@cloudera.com>wrote: > >> What should I do if I want to log something as part of a task? >> >> This is what I tried. To set up a logger, I followed the advice here: >> http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off >> >> logger = logging.getLogger("py4j") >> logger.setLevel(logging.INFO) >> logger.addHandler(logging.StreamHandler()) >> >> This works fine when I call it from my driver (ie pyspark): >> logger.info("this works fine") >> >> But I want to try logging within a distributed task so I did this: >> >> def logTestMap(a): >> logger.info("test") >> return a >> >> myrdd.map(logTestMap).count() >> >> and got: >> PicklingError: Can't pickle 'lock' object >> >> So it's trying to serialize my function and can't because of a lock >> object used in logger, presumably for thread-safeness. But then...how >> would I do it? Or is this just a really bad idea? >> >> Thanks >> Diana >> > >