produce without reproducer and even couldn't reproduce even
> they spent their time. Memory leak issue is not really easy to reproduce,
> unless it leaks some objects without any conditions.
>
> - Jungtaek Lim (HeartSaVioR)
>
> On Sun, Oct 20, 2019 at 7:18 PM Paul Wais wrote
Dear List,
I've observed some sort of memory leak when using pyspark to run ~100
jobs in local mode. Each job is essentially a create RDD -> create DF
-> write DF sort of flow. The RDD and DFs go out of scope after each
job completes, hence I call this issue a "memory leak." Here's
pseudocode:
Dear List,
Has anybody gotten avro support to work in pyspark? I see multiple
reports of it being broken on Stackoverflow and added my own repro to
this ticket:
https://issues.apache.org/jira/browse/SPARK-27623?focusedCommentId=16878896&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomm
Dear List,
I'm investigating some problems related to native code integration
with Spark, and while picking through BlockManager I noticed that data
(de)serialization currently issues lots of array copies.
Specifically:
- Deserialization: BlockManager marshals all deserialized bytes
through a spa
; On 1/11/15 9:51 PM, Paul Wais wrote:
>>
>>
>> Dear List,
>>
>> What are common approaches for addressing over a union of tables / RDDs?
>> E.g. suppose I have a collection of log files in HDFS, one log file per day,
>> and I want to compute the sum of some fi
To force one instance per executor, you could explicitly subclass
FlatMapFunction and have it lazy-create your parser in the subclass
constructor. You might also want to try RDD#mapPartitions() (instead of
RDD#flatMap() if you want one instance per partition. This approach worked
well for me when
Dear List,
What are common approaches for addressing over a union of tables / RDDs?
E.g. suppose I have a collection of log files in HDFS, one log file per
day, and I want to compute the sum of some field over a date range in SQL.
Using log schema, I can read each as a distinct SchemaRDD, but I wa
More thoughts. I took a deeper look at BlockManager, RDD, and friends.
Suppose one wanted to get native code access to un-deserialized blocks.
This task looks very hard. An RDD behaves much like a Scala iterator of
deserialized values, and interop with BlockManager is all on deserialized
data.
Dear List,
Has anybody had experience integrating C/C++ code into Spark jobs?
I have done some work on this topic using JNA. I wrote a FlatMapFunction
that processes all partition entries using a C++ library. This approach
works well, but there are some tradeoffs:
* Shipping the native dylib
s also taking memory.
>
> On Oct 30, 2014 6:43 PM, "Paul Wais" >
wrote:
>>
>> Dear Spark List,
>>
>> I have a Spark app that runs native code inside map functions. I've
>> noticed that the native code sometimes sets errno to ENOMEM indicating
freeMemory()
shows gigabytes free and the native code needs only megabytes. Does
spark limit the /native/ heap size somehow? Am poking through the
executor code now but don't see anything obvious.
Best Regards,
-Paul Wais
-
To
Looks like an OOM issue? Have you tried persisting your RDDs to allow
disk writes?
I've seen a lot of similar crashes in a Spark app that reads from HDFS
and does joins. I.e. I've seen "java.io.IOException: Filesystem
closed," "Executor lost," "FetchFailed," etc etc with
non-deterministic crashe
Derp, one caveat to my "solution": I guess Spark doesn't use Kryo for
Function serde :(
On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais wrote:
> Well it looks like this is indeed a protobuf issue. Poked a little more
> with Kryo. Since protobuf messages are serializable
Well it looks like this is indeed a protobuf issue. Poked a little more
with Kryo. Since protobuf messages are serializable, I tried just making
Kryo use the JavaSerializer for my messages. The resulting stack trace
made it look like protobuf GeneratedMessageLite is actually using the
classloade
es the problem):
https://github.com/apache/spark/blob/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L74
If uber.jar is on the classpath, then the root classloader would have
the code, hence why --driver-class-path fixes the bug.
On Thu, Sep 18, 201
hmm would using kyro help me here?
On Thursday, September 18, 2014, Paul Wais wrote:
> Ah, can one NOT create an RDD of any arbitrary Serializable type? It
> looks like I might be getting bitten by the same
> "java.io.ObjectInputStream uses root class loader only"
d3259.html
* https://github.com/apache/spark/pull/181
*
http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E
* https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I
On Thu, Sep 18, 2014 at 4:51 PM,
ache.org/repos/asf/hadoop/common/branches/branch-2.3.0/hadoop-project/pom.xml
On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais wrote:
> Dear List,
>
> I'm writing an application where I have RDDs of protobuf messages.
> When I run the app via bin/spar-submit with --master local
>
Dear List,
I'm writing an application where I have RDDs of protobuf messages.
When I run the app via bin/spar-submit with --master local
--driver-class-path path/to/my/uber.jar, Spark is able to
ser/deserialize the messages correctly.
However, if I run WITHOUT --driver-class-path path/to/my/uber.
Thanks Tim, this is super helpful!
Question about jars and spark-submit: why do you provide
myawesomeapp.jar as the program jar but then include other jars via
the --jars argument? Have you tried building one uber jar with all
dependencies and just sending that to Spark as your app jar?
Also, h
nd on Hadoop 1.0.4. I
> suspect that's what you're doing -- packaging Spark(+Hadoop1.0.4) with
> your app, when it shouldn't be packaged.
>
> Spark works out of the box with just about any modern combo of HDFS and YARN.
>
> On Tue, Sep 16, 2014 at 2:28 AM, Paul
it --master yarn-cluster ...
>
> will work, but
>
> spark-submit --master yarn-client ...
>
> will fail.
>
>
> But on the personal build obtained from the command above, both will then
> work.
>
>
> -Christian
>
>
>
>
> On Sep 15, 2014, at 6:
Dear List,
I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles. In particular, I'm seeing:
Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
Server IPC version 7 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.ca
ld and pass tests on Jenkins.
>>
>> You shouldn't expect new features to be added to stable code in
>> maintenance releases (e.g. 1.0.1).
>>
>> AFAIK, we're still on track with Spark 1.1.0 development, which means that
>> it should be released sometime in
gards,
-Paul Wais
25 matches
Mail list logo