Greetings!
Is it true that functions, such as those passed to RDD.map(), are deserialized
once per task?This seems to be the case looking at Executor.scala, but I don't
really understand the code.
I'm hoping the answer is yes because that makes it easier to write code without
worrying about thre
This is something of a wild guess, but I find that when executors start
disappearingfor no obvious reason, this is usually because the yarn
node-managers have decided that the containers are using too much memory and
then terminate the executors.
Unfortunately, to see evidence of this, one needs
Note that in scala, "return" is a non-local return:
https://tpolecat.github.io/2014/05/09/return.htmlSo that "return" is *NOT*
returning from the anonymous function, but attempting to return from the
enclosing method, i.e., "main".Which is running on the driver, not on the
workers.So on the wor
My apologies for following my own post, but a friend just pointed out that if I
use kryo with reference counting AND copy-and-paste, this runs.
However, if I try to "load ", this fails as described below.
I thought load was supposed to be equivalent?
Thanks!-Mike
From: Michael A
Greetings!
For me, the code below fails from the shell.However, I can do essentially the
same from compiled code, exporting the jar.
If I use default serialization or kryo with reference tracking, the error
message tells me it can't find the constructor for "A".If I use kryo with
reference track
;",-1L) } } }}
From: Sean Owen
To: Michael Albert
Cc: User
Sent: Monday, March 23, 2015 7:31 AM
Subject: Re: How to check that a dataset is sorted after it has been written
out?
Data is not (necessarily) sorted when read from disk, no. A file might
have many block
Greetings![My apologies for this respost, I'm not certain that the first
message made it to the list].
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
"partit
Greetings!
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
"partition" (i.e., as seen in the partition index of mapPartitionsWithIndex) is
not the same as im
For what it's worth, I was seeing mysterious hangs, but it went away when
upgrading from spark1.2 to 1.2.1.I don't know if this is your problem.Also, I'm
using AWS EMR images, which were also "upgraded".
Anyway, that's my experience.
-Mike
From: Manas Kar
To: "user@spark.apache.org"
S
confused :-).
Thanks!-Mike
From: Michael Albert
To: "user@spark.apache.org"
Sent: Thursday, February 5, 2015 9:04 PM
Subject: Spark stalls or hangs: is this a clue? remote fetches seem to never
return?
Greetings!
Again, thanks to all who have given suggestions.I am still
Greetings!
Again, thanks to all who have given suggestions.I am still trying to diagnose a
problem where I have processes than run for one or several hours but
intermittently stall or hang.By "stall" I mean that there is no CPU usage on
the workers or the driver, nor network activity, nor do I s
From: Sandy Ryza
To: Imran Rashid
Cc: Michael Albert ; "user@spark.apache.org"
Sent: Wednesday, February 4, 2015 12:54 PM
Subject: Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?
Also, do you see any lines in the YARN NodeManager logs where it says
1) Parameters like "--num-executors" should come before the jar. That is, you
want something like$SPARK_HOME --num-executors 3 --driver-memory 6g
--executor-memory 7g \--master yarn-cluster --class EDDApp
target/scala-2.10/eddjar \
That is, *your* parameters come after the jar, spark's par
Greetings!
First, my sincere thanks to all who have given me advice.Following previous
discussion, I've rearranged my code to try to keep the partitions to more
manageable sizes.Thanks to all who commented.
At the moment, the input set I'm trying to work with is about 90GB (avro
parquet format).
Thank you!
This is very helpful.
-Mike
From: Aaron Davidson
To: Imran Rashid
Cc: Michael Albert ; Sean Owen ;
"user@spark.apache.org"
Sent: Tuesday, February 3, 2015 6:13 PM
Subject: Re: 2GB limit for partitions?
To be clear, there is no distinction between part
You might also try "stty sane".
From: amoners
I am not sure that this way can help you. There is my situation that I can
not see any input in terminal after some work gets done via spark-shell, I
used to run a command "stty echo" , and It fixed.
ArrayOps.scala:108) at
org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
From: Sean Owen
To: Michael Albert
Cc: "user@spark.apache.org"
Sent: Monday, February 2, 2015 10:13 PM
Subject: Re: 2GB limit for partitions?
The limit is on blocks, not partitions. Partitions ha
Greetings!
SPARK-1476 says that there is a 2G limit for "blocks".Is this the same as a 2G
limit for partitions (or approximately so?)?
What I had been attempting to do is the following.1) Start with a moderately
large data set (currently about 100GB, but growing).2) Create about 1,000 files
(ye
Greetings!
My executors apparently are being terminated because they are "running beyond
physical memory limits" according to the "yarn-hadoop-nodemanager" logs on the
worker nodes (/mnt/var/log/hadoop on AWS EMR). I'm setting the "driver-memory"
to 8G.However, looking at "stdout" in userlogs,
so createsan "fs" object and starts writing, but perhaps there is
some subtle difference in the context?
Thank you.
Sincerely, Mike
From: Akhil Das
To: Michael Albert
Cc: "user@spark.apache.org"
Sent: Monday, January 5, 2015 1:21 AM
Subject: Re: a vague ques
Greetings!
So, I think I have data saved so that each partition (part-r-0, etc)is
exactly what I wan to translate into an output file of a format not related to
hadoop.
I believe I've figured out how to tell Spark to read the data set without
re-partitioning (in another post I mentioned this
Greetings!
I would like to know if the code below will read "one-partition-at-a-time",
and whether I am reinventing the wheel.
If I may explain, upstream code has managed (I hope) to save an RDD such
that each partition file (e.g, part-r-0, part-r-1) contains exactly the
data subset whi
HADOOP_OPTS="-Xmx6g -Xmx5g")
Again, each grouping should have no more than 6E7 values, and the data is
(DataKey(Int,Int), Option[Float]), so that shouldn't need 5g?
Anyway, thanks for the info.
Best wishes,Mike
From: Sean Owen
To: Michael Albert
Cc: user@spark.apache.o
Greetings!
I'm trying to do something similar, and having a very bad time of it.
What I start with is
key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)key2:
(col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...)
What I want (what I have been asked to produce :-)
rk doesn't know how to
serialize, instead of a guava.util.List, which spark likes.
Hive at 0.13.1 still can't read it though...Thanks!-Mike
From: Michael Armbrust
To: Michael Albert
Cc: "user@spark.apache.org"
Sent: Tuesday, November 4, 2014 2:37 PM
Subject: Re:
Greetings!
I'm trying to use avro and parquet with the following schema:
{
"name": "TestStruct",
"namespace": "bughunt",
"type": "record",
"fields": [
{
"name": "string_array",
"type": { "type": "array", "items": "string" }
}
Greetings!
This might be a documentation issue as opposed to a coding issue, in that
perhaps the correct answer is "don't do that", but as this is not obvious, I am
writing.
The following code produces output most would not expect:
package misc
import org.apache.spark.SparkConfimport org.apache.s
27 matches
Mail list logo