Dear List,
Has anybody gotten avro support to work in pyspark? I see multiple
reports of it being broken on Stackoverflow and added my own repro to
this ticket:
https://issues.apache.org/jira/browse/SPARK-27623?focusedCommentId=16878896&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomm
Dear List,
I've observed some sort of memory leak when using pyspark to run ~100
jobs in local mode. Each job is essentially a create RDD -> create DF
-> write DF sort of flow. The RDD and DFs go out of scope after each
job completes, hence I call this issue a "memory leak." Here's
pseudocode:
produce without reproducer and even couldn't reproduce even
> they spent their time. Memory leak issue is not really easy to reproduce,
> unless it leaks some objects without any conditions.
>
> - Jungtaek Lim (HeartSaVioR)
>
> On Sun, Oct 20, 2019 at 7:18 PM Paul Wais wrote
To force one instance per executor, you could explicitly subclass
FlatMapFunction and have it lazy-create your parser in the subclass
constructor. You might also want to try RDD#mapPartitions() (instead of
RDD#flatMap() if you want one instance per partition. This approach worked
well for me when
; On 1/11/15 9:51 PM, Paul Wais wrote:
>>
>>
>> Dear List,
>>
>> What are common approaches for addressing over a union of tables / RDDs?
>> E.g. suppose I have a collection of log files in HDFS, one log file per day,
>> and I want to compute the sum of some fi
Dear List,
I'm investigating some problems related to native code integration
with Spark, and while picking through BlockManager I noticed that data
(de)serialization currently issues lots of array copies.
Specifically:
- Deserialization: BlockManager marshals all deserialized bytes
through a spa
gards,
-Paul Wais
ld and pass tests on Jenkins.
>>
>> You shouldn't expect new features to be added to stable code in
>> maintenance releases (e.g. 1.0.1).
>>
>> AFAIK, we're still on track with Spark 1.1.0 development, which means that
>> it should be released sometime in
Dear List,
I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles. In particular, I'm seeing:
Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
Server IPC version 7 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.ca
it --master yarn-cluster ...
>
> will work, but
>
> spark-submit --master yarn-client ...
>
> will fail.
>
>
> But on the personal build obtained from the command above, both will then
> work.
>
>
> -Christian
>
>
>
>
> On Sep 15, 2014, at 6:
nd on Hadoop 1.0.4. I
> suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
> your app, when it shouldn't be packaged.
>
> Spark works out of the box with just about any modern combo of HDFS and YARN.
>
> On Tue, Sep 16, 2014 at 2:28 AM, Paul
Thanks Tim, this is super helpful!
Question about jars and spark-submit: why do you provide
myawesomeapp.jar as the program jar but then include other jars via
the --jars argument? Have you tried building one uber jar with all
dependencies and just sending that to Spark as your app jar?
Also, h
Dear List,
I'm writing an application where I have RDDs of protobuf messages.
When I run the app via bin/spar-submit with --master local
--driver-class-path path/to/my/uber.jar, Spark is able to
ser/deserialize the messages correctly.
However, if I run WITHOUT --driver-class-path path/to/my/uber.
ache.org/repos/asf/hadoop/common/branches/branch-2.3.0/hadoop-project/pom.xml
On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais wrote:
> Dear List,
>
> I'm writing an application where I have RDDs of protobuf messages.
> When I run the app via bin/spar-submit with --master local
>
d3259.html
* https://github.com/apache/spark/pull/181
*
http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E
* https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I
On Thu, Sep 18, 2014 at 4:51 PM,
hmm would using kyro help me here?
On Thursday, September 18, 2014, Paul Wais wrote:
> Ah, can one NOT create an RDD of any arbitrary Serializable type? It
> looks like I might be getting bitten by the same
> "java.io.ObjectInputStream uses root class loader only"
es the problem):
https://github.com/apache/spark/blob/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L74
If uber.jar is on the classpath, then the root classloader would have
the code, hence why --driver-class-path fixes the bug.
On Thu, Sep 18, 201
Well it looks like this is indeed a protobuf issue. Poked a little more
with Kryo. Since protobuf messages are serializable, I tried just making
Kryo use the JavaSerializer for my messages. The resulting stack trace
made it look like protobuf GeneratedMessageLite is actually using the
classloade
Derp, one caveat to my "solution": I guess Spark doesn't use Kryo for
Function serde :(
On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais wrote:
> Well it looks like this is indeed a protobuf issue. Poked a little more
> with Kryo. Since protobuf messages are serializable
Looks like an OOM issue? Have you tried persisting your RDDs to allow
disk writes?
I've seen a lot of similar crashes in a Spark app that reads from HDFS
and does joins. I.e. I've seen "java.io.IOException: Filesystem
closed," "Executor lost," "FetchFailed," etc etc with
non-deterministic crashe
freeMemory()
shows gigabytes free and the native code needs only megabytes. Does
spark limit the /native/ heap size somehow? Am poking through the
executor code now but don't see anything obvious.
Best Regards,
-Paul Wais
-
To
s also taking memory.
>
> On Oct 30, 2014 6:43 PM, "Paul Wais" >
wrote:
>>
>> Dear Spark List,
>>
>> I have a Spark app that runs native code inside map functions. I've
>> noticed that the native code sometimes sets errno to ENOMEM indicating
Dear List,
Has anybody had experience integrating C/C++ code into Spark jobs?
I have done some work on this topic using JNA. I wrote a FlatMapFunction
that processes all partition entries using a C++ library. This approach
works well, but there are some tradeoffs:
* Shipping the native dylib
More thoughts. I took a deeper look at BlockManager, RDD, and friends.
Suppose one wanted to get native code access to un-deserialized blocks.
This task looks very hard. An RDD behaves much like a Scala iterator of
deserialized values, and interop with BlockManager is all on deserialized
data.
Dear List,
What are common approaches for addressing over a union of tables / RDDs?
E.g. suppose I have a collection of log files in HDFS, one log file per
day, and I want to compute the sum of some field over a date range in SQL.
Using log schema, I can read each as a distinct SchemaRDD, but I wa
25 matches
Mail list logo