Re: "CANNOT FIND ADDRESS"

2014-11-01 Thread Akhil Das
Tr this spark.storage.memoryFraction0.9 On 31 Oct 2014 20:27, "akhandeshi" wrote: > Thanks for the pointers! I did tried but didn't seem to help... > > In my latest try, I am doing spark-submit local > > But see the same message in spark App ui (4040) > localhost CANNOT FIND ADDRESS >

--executor-cores cannot change vcores in yarn?

2014-11-01 Thread Gen
Hi, Maybe it is a stupid question, but I am running spark on yarn. I request the resources by the following command: {code} ./spark-submit --master yarn-client --num-executors #number of worker --executor-cores #number of cores. ... {code} However, after launching the task, I use /yarn node -statu

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-11-01 Thread Marius Soutier
Are these /vols formatted? You typically need to format and define a mount point in /mnt for attached EBS volumes. I’m not using the ec2 script, so I don’t know what is installed, but there’s usually an HDFS info service running on port 50070. After changing hdfs-site.xml, you have to restart t

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Hi Jean, Thanks for reporting this. This is indeed a bug: some column types (Binary, Array, Map and Struct, and unfortunately for some reason, Boolean), a NoopColumnStats is used to collect column statistics, which causes this issue. Filed SPARK-4182 to track this issue, will fix this ASAP. Cheng

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Jean-Pascal Billaud
Great! Thanks. Sent from my iPad > On Nov 1, 2014, at 8:35 AM, Cheng Lian wrote: > > Hi Jean, > > Thanks for reporting this. This is indeed a bug: some column types (Binary, > Array, Map and Struct, and unfortunately for some reason, Boolean), a > NoopColumnStats is used to collect column st

Re: A Spark Design Problem

2014-11-01 Thread Steve Lewis
join seems to me the proper approach followed by keying the fits by KeyID and using combineByKey to choose the best - I am implementing that now and will report on performance On Fri, Oct 31, 2014 at 11:56 AM, Sonal Goyal wrote: > Does the following help? > > JavaPairRDD join with JavaPairRDD >

Re: stage failure: java.lang.IllegalStateException: unread block data

2014-11-01 Thread TJ Klein
Hi, I get exactly the same error. It runs on my local machine but not on the cluster. I am running the example pi.py example. Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-java-lang-IllegalStateException-unread-block-data-tp1

Re: Spark speed performance

2014-11-01 Thread jan.zikes
Now I am getting to problems using: distData = sc.textFile(sys.argv[2]).coalesce(10)   The problem is that it seems that Spark is trying to put all the data to RAM first and then perform coalesce. Do you know if there is something that would do coalesce on fly with for example fixed size of the

OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Hi, I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of count ~ 100 million. The data size is 20GB and groupBy results in an RDD of 1061 keys with values being Iterable>. The job runs on 3 hosts in a standalone setup with each host's executor having 100G RAM and 24 cores de

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit. On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar wrote: > Hi, > > I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD > of count ~ 100 million. The data size is 20GB and groupBy results in an RDD > of 1

Re: Spark speed performance

2014-11-01 Thread Aaron Davidson
coalesce() is a streaming operation if used without the second parameter, it does not put all the data in RAM. If used with the second parameter (shuffle = true), then it performs a shuffle, but still does not put all the data in RAM. On Sat, Nov 1, 2014 at 12:09 PM, wrote: > Now I am getting to

union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB")) but that just returns RDD[Row]. How do I combine them to get a SchemaRDD[Row]? thanks Daniel

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on the results. Matei > On Nov 1, 2014, at 3:57 PM, Daniel Mahler wrote: > > I would like to combine 2 parquet tables I have create. > I tried: > > sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB")) >

Re: union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
Thanks Matei. What does unionAll do if the input RDD schemas are not 100% compatible. Does it take the union of the columns and generalize the types? thanks Daniel On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia wrote: > Try unionAll, which is a special method on SchemaRDDs that keeps the > schem

org.apache.hadoop.security.UserGroupInformation.doAs Issue

2014-11-01 Thread TJ Klein
Hi there, I am trying to run the example code pi.py on a cluster, however, I only got it working on localhost. When trying to run in standalone mode, ./bin/spark-submit \ --master spark://[mymaster]:7077 \ examples/src/main/python/pi.py \ I get warnings about resources and memory (the works

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
It does generalize types, but only on the intersection of the columns it seems. There might be a way to get the union of the columns too using HiveQL. Types generalize up with string being the "most general". Matei > On Nov 1, 2014, at 6:22 PM, Daniel Mahler wrote: > > Thanks Matei. What does

Re: Spark SQL : how to find element where a field is in a given set

2014-11-01 Thread abhinav chowdary
I have same requirement of passing list of values to in clause, when i am trying to do i am getting below error scala> val longList = Seq[Expression]("a", "b") :11: error: type mismatch; found : String("a") required: org.apache.spark.sql.catalyst.expressions.Expression val longList = S

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059 On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud wrote: > Great! Thanks. > > Sent from my iPad > > On Nov 1, 2014, at 8:35 AM, Cheng Lian wrote: > > Hi Jean, > > Thanks for reporting this. This is indeed a bug: some c

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Bharath Ravi Kumar
Resurfacing the thread. Oom shouldn't be the norm for a common groupby / sort use case in a framework that is leading in sorting bench marks? Or is there something fundamentally wrong in the usage? On 02-Nov-2014 1:06 am, "Bharath Ravi Kumar" wrote: > Hi, > > I'm trying to run groupBy(function) f

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread arthur.hk.c...@gmail.com
Hi, FYI as follows. Could you post your heap size settings as well your Spark app code? Regards Arthur 3.1.3 Detail Message: Requested array size exceeds VM limit The detail message Requested array size exceeds VM limit indicates that the application (or APIs used by that application) attem

How to correctly extimate the number of partition of a graph in GraphX

2014-11-01 Thread James
Hello, I am trying to run Connected Component algorithm on a very big graph. In practice I found that a small number of partition size would lead to OOM, while a large number would cause various time out exceptions. Thus I wonder how to estimate the number of partition of a graph in GraphX? Alcai

Re: How to correctly extimate the number of partition of a graph in GraphX

2014-11-01 Thread Ankur Dave
How large is your graph, and how much memory does your cluster have? We don't have a good way to determine the *optimal* number of partitions aside from trial and error, but to get the job to at least run to completion, it might help to use the MEMORY_AND_DISK storage level and a large number of p

Re: How to correctly extimate the number of partition of a graph in GraphX

2014-11-01 Thread James
Hello, We get a graph with 100B edges of nearly 800GB in gz format. We have 80 machines, each one has 60GB memory. I have not ever seen the program run to completion. Alcaid 2014-11-02 14:06 GMT+08:00 Ankur Dave : > How large is your graph, and how much memory does your cluster have? > > We don

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Reynold Xin
None of your tuning will help here because the problem is actually the way you are saving the output. If you take a look at the stacktrace, it is trying to build a single string that is too large for the VM to allocate memory. The VM is actually not running out of memory, but rather, JVM cannot sup