Avro support broken?

2019-07-04 Thread Paul Wais
Dear List, Has anybody gotten avro support to work in pyspark? I see multiple reports of it being broken on Stackoverflow and added my own repro to this ticket: https://issues.apache.org/jira/browse/SPARK-27623?focusedCommentId=16878896&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomm

pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-20 Thread Paul Wais
Dear List, I've observed some sort of memory leak when using pyspark to run ~100 jobs in local mode. Each job is essentially a create RDD -> create DF -> write DF sort of flow. The RDD and DFs go out of scope after each job completes, hence I call this issue a "memory leak." Here's pseudocode:

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Paul Wais
produce without reproducer and even couldn't reproduce even > they spent their time. Memory leak issue is not really easy to reproduce, > unless it leaks some objects without any conditions. > > - Jungtaek Lim (HeartSaVioR) > > On Sun, Oct 20, 2019 at 7:18 PM Paul Wais wrote

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Paul Wais
To force one instance per executor, you could explicitly subclass FlatMapFunction and have it lazy-create your parser in the subclass constructor. You might also want to try RDD#mapPartitions() (instead of RDD#flatMap() if you want one instance per partition. This approach worked well for me when

Re: Support for SQL on unions of tables (merge tables?)

2015-01-21 Thread Paul Wais
; On 1/11/15 9:51 PM, Paul Wais wrote: >> >> >> Dear List, >> >> What are common approaches for addressing over a union of tables / RDDs? >> E.g. suppose I have a collection of log files in HDFS, one log file per day, >> and I want to compute the sum of some fi

Perf impact of BlockManager byte[] copies

2015-02-27 Thread Paul Wais
Dear List, I'm investigating some problems related to native code integration with Spark, and while picking through BlockManager I noticed that data (de)serialization currently issues lots of array copies. Specifically: - Deserialization: BlockManager marshals all deserialized bytes through a spa

Release date for new pyspark

2014-07-16 Thread Paul Wais
gards, -Paul Wais

Re: Release date for new pyspark

2014-07-17 Thread Paul Wais
ld and pass tests on Jenkins. >> >> You shouldn't expect new features to be added to stable code in >> maintenance releases (e.g. 1.0.1). >> >> AFAIK, we're still on track with Spark 1.1.0 development, which means that >> it should be released sometime in

Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Paul Wais
Dear List, I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for reading SequenceFiles. In particular, I'm seeing: Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.ca

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
it --master yarn-cluster ... > > will work, but > > spark-submit --master yarn-client ... > > will fail. > > > But on the personal build obtained from the command above, both will then > work. > > > -Christian > > > > > On Sep 15, 2014, at 6:

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
nd on Hadoop 1.0.4. I > suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with > your app, when it shouldn't be packaged. > > Spark works out of the box with just about any modern combo of HDFS and YARN. > > On Tue, Sep 16, 2014 at 2:28 AM, Paul

Re: Stable spark streaming app

2014-09-17 Thread Paul Wais
Thanks Tim, this is super helpful! Question about jars and spark-submit: why do you provide myawesomeapp.jar as the program jar but then include other jars via the --jars argument? Have you tried building one uber jar with all dependencies and just sending that to Spark as your app jar? Also, h

Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
Dear List, I'm writing an application where I have RDDs of protobuf messages. When I run the app via bin/spar-submit with --master local --driver-class-path path/to/my/uber.jar, Spark is able to ser/deserialize the messages correctly. However, if I run WITHOUT --driver-class-path path/to/my/uber.

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
ache.org/repos/asf/hadoop/common/branches/branch-2.3.0/hadoop-project/pom.xml On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais wrote: > Dear List, > > I'm writing an application where I have RDDs of protobuf messages. > When I run the app via bin/spar-submit with --master local >

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
d3259.html * https://github.com/apache/spark/pull/181 * http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E * https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I On Thu, Sep 18, 2014 at 4:51 PM,

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
hmm would using kyro help me here? On Thursday, September 18, 2014, Paul Wais wrote: > Ah, can one NOT create an RDD of any arbitrary Serializable type? It > looks like I might be getting bitten by the same > "java.io.ObjectInputStream uses root class loader only"

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
es the problem): https://github.com/apache/spark/blob/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L74 If uber.jar is on the classpath, then the root classloader would have the code, hence why --driver-class-path fixes the bug. On Thu, Sep 18, 201

Re: Unable to find proto buffer class error with RDD

2014-09-19 Thread Paul Wais
Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just making Kryo use the JavaSerializer for my messages. The resulting stack trace made it look like protobuf GeneratedMessageLite is actually using the classloade

Re: Unable to find proto buffer class error with RDD

2014-09-19 Thread Paul Wais
Derp, one caveat to my "solution": I guess Spark doesn't use Kryo for Function serde :( On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais wrote: > Well it looks like this is indeed a protobuf issue. Poked a little more > with Kryo. Since protobuf messages are serializable

Re: Any issues with repartition?

2014-10-08 Thread Paul Wais
Looks like an OOM issue? Have you tried persisting your RDDs to allow disk writes? I've seen a lot of similar crashes in a Spark app that reads from HDFS and does joins. I.e. I've seen "java.io.IOException: Filesystem closed," "Executor lost," "FetchFailed," etc etc with non-deterministic crashe

Do Spark executors restrict native heap vs JVM heap?

2014-10-30 Thread Paul Wais
freeMemory() shows gigabytes free and the native code needs only megabytes. Does spark limit the /native/ heap size somehow? Am poking through the executor code now but don't see anything obvious. Best Regards, -Paul Wais - To

Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Paul Wais
s also taking memory. > > On Oct 30, 2014 6:43 PM, "Paul Wais" > wrote: >> >> Dear Spark List, >> >> I have a Spark app that runs native code inside map functions. I've >> noticed that the native code sometimes sets errno to ENOMEM indicating

Native / C/C++ code integration

2014-11-07 Thread Paul Wais
Dear List, Has anybody had experience integrating C/C++ code into Spark jobs? I have done some work on this topic using JNA. I wrote a FlatMapFunction that processes all partition entries using a C++ library. This approach works well, but there are some tradeoffs: * Shipping the native dylib

Re: Native / C/C++ code integration

2014-11-11 Thread Paul Wais
More thoughts. I took a deeper look at BlockManager, RDD, and friends. Suppose one wanted to get native code access to un-deserialized blocks. This task looks very hard. An RDD behaves much like a Scala iterator of deserialized values, and interop with BlockManager is all on deserialized data.

Support for SQL on unions of tables (merge tables?)

2015-01-11 Thread Paul Wais
Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g. suppose I have a collection of log files in HDFS, one log file per day, and I want to compute the sum of some field over a date range in SQL. Using log schema, I can read each as a distinct SchemaRDD, but I wa