Thanks for the input. I will give foldByKey a shot.
The way I am doing is, data is partitioned hourly. So I am computing
distinct values hourly. Then I use unionRDD to merge them and compute
distinct on the overall data.
> Is there a way to know which key,value pair is resulting in the OOM ?
> Is
Hi,
I have a single node (192G RAM) stand-alone spark, with memory
configuration like this in spark-env.sh
SPARK_WORKER_MEMORY=180g
SPARK_MEM=180g
In spark-shell I have a program like this:
val file = sc.textFile("/localpath") //file size is 40G
file.cache()
val output = file.map(line =>
Thanks Matei!
In the example all three items have the same key, so they go to the same
partition:
scala> sc.parallelize(Seq(0->3, 0->2, 0->1), 3).partitionBy(new
HashPartitioner(3)).glom.collect
Array(Array((0,3), (0,2), (0,1)), Array(), Array())
I guess the apparent stability is just due to
hi, zhen
i met the same problem in ec2, application details can not be accessed.
but i can read stdout
and stderr. the problem has not been solved yet
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-master-UI-does-not-keep-detailed-application-hist
The order is not guaranteed actually, only which keys end up in each partition.
Reducers may fetch data from map tasks in an arbitrary order, depending on
which ones are available first. If you’d like a specific order, you should sort
each partition. Here you might be getting it because each par
Hi,I'm attempting to run the following simple standalone app on mac os and
spark 1.0 using sbt:val sparkConf = new
SparkConf().setAppName("ProcessEvents").setMaster("local[*]").setSparkHome("/Users/me/Downloads/spark")val
ssc = new StreamingContext(sparkConf, Seconds(10))val lines =
ssc.textFileStr
Actually, are you defining Person as an inner class?
You might be running into this:
http://stackoverflow.com/questions/18866866/why-there-is-no-typetag-available-in-nested-instantiations-when-interpreted-by
On Sat, Jun 14, 2014 at 1:51 PM, Michael Armbrust
wrote:
> Can you maybe attach the fu
Can you maybe attach the full scala file?
On Sat, Jun 14, 2014 at 5:03 AM, premdass wrote:
> Hi,
>
> I am trying to run the spark sql example provided on the example
> https://spark.apache.org/docs/latest/sql-programming-guide.html as a
> standalone program.
>
> When i try to run the compile t
Ll mlll
On Jun 14, 2014 4:05 AM, "Matei Zaharia" wrote:
> You need to factor your program so that it’s not just a main(). This is
> not a Spark-specific issue, it’s about how you’d unit test any program in
> general. In this case, your main() creates a SparkContext, so you can’t
> pass one from o
What I mean is, let's say I run this:
sc.parallelize(Seq(0->3, 0->2, 0->1), 3).partitionBy(HashPartitioner(3)).collect
Will the result always be Array((0,3), (0,2), (0,1))? Or could I
possibly get a different order?
I'm pretty sure the shuffle files are taken in the order of the source
partiti
Hi All,
I've some Streaming code in Java that works on 0.9.1. After upgrading to
1.0 (with fix to minor API changes) DStream does not seem to be executing.
The tasks got killed in 1 second by the worker. Any idea what is causing
it?
The worker log file is not logging my debug statements. The f
Grouping by key is always problematic since a key might have a huge number
of values. You can do a little better than grouping *all* values and *then*
finding distinct values by using foldByKey, putting values into a Set. At
least you end up with only distinct values in memory. (You don't need two
Hi,
For last couple of days I have been trying hard to get around this
problem. Please share any insights on solving this problem.
Problem :
There is a huge list of (key, value) pairs. I want to transform this to
(key, distinct values) and then eventually to (key, distinct values count)
On sma
Hey all!
I have got an iterative problem. I'm trying to find something similar to
Hadoop's MultipleOutputs [1] in Spark 1.0. I need to build up a couple of
large dense vectors (may contain billions of elements - 2 billion doubles =>
at least 16GB) by adding partial vector chunks to it. This can be
Hi,
I am trying to run the spark sql example provided on the example
https://spark.apache.org/docs/latest/sql-programming-guide.html as a
standalone program.
When i try to run the compile the program, i am getting the below error
Done updating.
Compiling 1 Scala source to
C:\Work\Dev\scala\wo
15 matches
Mail list logo