Tr this
spark.storage.memoryFraction0.9
On 31 Oct 2014 20:27, "akhandeshi" wrote:
> Thanks for the pointers! I did tried but didn't seem to help...
>
> In my latest try, I am doing spark-submit local
>
> But see the same message in spark App ui (4040)
> localhost CANNOT FIND ADDRESS
>
Hi,
Maybe it is a stupid question, but I am running spark on yarn. I request the
resources by the following command:
{code}
./spark-submit --master yarn-client --num-executors #number of worker
--executor-cores #number of cores. ...
{code}
However, after launching the task, I use /yarn node -statu
Are these /vols formatted? You typically need to format and define a mount
point in /mnt for attached EBS volumes.
I’m not using the ec2 script, so I don’t know what is installed, but there’s
usually an HDFS info service running on port 50070. After changing
hdfs-site.xml, you have to restart t
Hi Jean,
Thanks for reporting this. This is indeed a bug: some column types (Binary,
Array, Map and Struct, and unfortunately for some reason, Boolean), a
NoopColumnStats is used to collect column statistics, which causes this
issue. Filed SPARK-4182 to track this issue, will fix this ASAP.
Cheng
Great! Thanks.
Sent from my iPad
> On Nov 1, 2014, at 8:35 AM, Cheng Lian wrote:
>
> Hi Jean,
>
> Thanks for reporting this. This is indeed a bug: some column types (Binary,
> Array, Map and Struct, and unfortunately for some reason, Boolean), a
> NoopColumnStats is used to collect column st
join seems to me the proper approach followed by keying the fits by KeyID
and using combineByKey to choose the best -
I am implementing that now and will report on performance
On Fri, Oct 31, 2014 at 11:56 AM, Sonal Goyal wrote:
> Does the following help?
>
> JavaPairRDD join with JavaPairRDD
>
Hi,
I get exactly the same error. It runs on my local machine but not on the
cluster. I am running the example pi.py example.
Best,
Tassilo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-java-lang-IllegalStateException-unread-block-data-tp1
Now I am getting to problems using:
distData = sc.textFile(sys.argv[2]).coalesce(10)
The problem is that it seems that Spark is trying to put all the data to RAM
first and then perform coalesce. Do you know if there is something that would
do coalesce on fly with for example fixed size of the
Hi,
I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of
count ~ 100 million. The data size is 20GB and groupBy results in an RDD of
1061 keys with values being Iterable>. The job runs on 3 hosts in a standalone setup with each host's
executor having 100G RAM and 24 cores de
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit.
On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar
wrote:
> Hi,
>
> I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD
> of count ~ 100 million. The data size is 20GB and groupBy results in an RDD
> of 1
coalesce() is a streaming operation if used without the second parameter,
it does not put all the data in RAM. If used with the second parameter
(shuffle = true), then it performs a shuffle, but still does not put all
the data in RAM.
On Sat, Nov 1, 2014 at 12:09 PM, wrote:
> Now I am getting to
I would like to combine 2 parquet tables I have create.
I tried:
sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB"))
but that just returns RDD[Row].
How do I combine them to get a SchemaRDD[Row]?
thanks
Daniel
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on
the results.
Matei
> On Nov 1, 2014, at 3:57 PM, Daniel Mahler wrote:
>
> I would like to combine 2 parquet tables I have create.
> I tried:
>
> sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB"))
>
Thanks Matei. What does unionAll do if the input RDD schemas are not 100%
compatible. Does it take the union of the columns and generalize the types?
thanks
Daniel
On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia
wrote:
> Try unionAll, which is a special method on SchemaRDDs that keeps the
> schem
Hi there,
I am trying to run the example code pi.py on a cluster, however, I only got
it working on localhost. When trying to run in standalone mode,
./bin/spark-submit \
--master spark://[mymaster]:7077 \
examples/src/main/python/pi.py \
I get warnings about resources and memory (the works
It does generalize types, but only on the intersection of the columns it seems.
There might be a way to get the union of the columns too using HiveQL. Types
generalize up with string being the "most general".
Matei
> On Nov 1, 2014, at 6:22 PM, Daniel Mahler wrote:
>
> Thanks Matei. What does
I have same requirement of passing list of values to in clause, when i am
trying to do
i am getting below error
scala> val longList = Seq[Expression]("a", "b")
:11: error: type mismatch;
found : String("a")
required: org.apache.spark.sql.catalyst.expressions.Expression
val longList = S
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059
On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud
wrote:
> Great! Thanks.
>
> Sent from my iPad
>
> On Nov 1, 2014, at 8:35 AM, Cheng Lian wrote:
>
> Hi Jean,
>
> Thanks for reporting this. This is indeed a bug: some c
Resurfacing the thread. Oom shouldn't be the norm for a common groupby /
sort use case in a framework that is leading in sorting bench marks? Or is
there something fundamentally wrong in the usage?
On 02-Nov-2014 1:06 am, "Bharath Ravi Kumar" wrote:
> Hi,
>
> I'm trying to run groupBy(function) f
Hi,
FYI as follows. Could you post your heap size settings as well your Spark app
code?
Regards
Arthur
3.1.3 Detail Message: Requested array size exceeds VM limit
The detail message Requested array size exceeds VM limit indicates that the
application (or APIs used by that application) attem
Hello,
I am trying to run Connected Component algorithm on a very big graph. In
practice I found that a small number of partition size would lead to OOM,
while a large number would cause various time out exceptions. Thus I wonder
how to estimate the number of partition of a graph in GraphX?
Alcai
How large is your graph, and how much memory does your cluster have?
We don't have a good way to determine the *optimal* number of partitions
aside from trial and error, but to get the job to at least run to
completion, it might help to use the MEMORY_AND_DISK storage level and a
large number of p
Hello,
We get a graph with 100B edges of nearly 800GB in gz format.
We have 80 machines, each one has 60GB memory.
I have not ever seen the program run to completion.
Alcaid
2014-11-02 14:06 GMT+08:00 Ankur Dave :
> How large is your graph, and how much memory does your cluster have?
>
> We don
None of your tuning will help here because the problem is actually the way
you are saving the output. If you take a look at the stacktrace, it is
trying to build a single string that is too large for the VM to allocate
memory. The VM is actually not running out of memory, but rather, JVM
cannot sup
24 matches
Mail list logo