Re: Submiting Spark application through code

2014-11-02 Thread Marius Soutier
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my Hadoop dependencies to run a SparkContext. In your build.sbt: "org.apache.hadoop" % "hadoop-common" % “..." exclude("javax.servlet", "servlet-api"), "org.apache.hadoop" % "hadoop-hdfs" % “..." exclude("javax.servlet",

Spark on Yarn probably trying to load all the data to RAM

2014-11-02 Thread jan.zikes
Hi, I am using Spark on Yarn, particularly Spark in Python. I am trying to run: myrdd = sc.textFile("s3n://mybucket/files/*/*/*.json") myrdd.getNumPartitions() Unfortunately it seems that Spark tries to load everything to RAM, or at least after while of running this everything slows down and t

Re: Spark speed performance

2014-11-02 Thread jan.zikes
Thank you, I would expect it to work as you write, but I am probably experiencing it working other way. But now it seems that Spark is generally trying to fit everything to RAM. I run Spark on YARN and I have wraped this to another question:  http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark SQL : how to find element where a field is in a given set

2014-11-02 Thread Rishi Yadav
did you create SQLContext? On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary wrote: > I have same requirement of passing list of values to in clause, when i am > trying to do > > i am getting below error > > scala> val longList = Seq[Expression]("a", "b") > :11: error: type mismatch; > found :

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Bharath Ravi Kumar
Thanks for responding. This is what I initially suspected, and hence asked why the library needed to construct the entire value buffer on a single host before writing it out. The stacktrace appeared to suggest that user code is not constructing the large buffer. I'm simply calling groupBy and saveA

Re: properties file on a spark cluster

2014-11-02 Thread Akhil Das
The problem here is, when you run a spark program in cluster mode, it will look for the file in the worker machine. Best approach would be to put the file in hdfs and use it instead of local path. Another approach would be to create the same file in the same path on all worker machines and hopefull

Re: ExecutorLostFailure (executor lost)

2014-11-02 Thread Akhil Das
You can check in the worker logs for more accurate information(that are found under the work directory inside spark directory). I used to hit this issue with: - Too many open files : Increasing the ulimit would solve this issue - Akka connection timeout/Framesize: Setting the following while creat

Re: Cannot instantiate hive context

2014-11-02 Thread Akhil Das
Adding the libthrift jar in the class path would resolve this issue. Thanks Best Regards On Sat, Nov 1, 2014 at 12:34 AM, Pala M Muthaia wrote: > Hi, > > I am trying to load hive datasets using HiveContext, in spark shell. Sp

Re: hadoop_conf_dir when running spark on yarn

2014-11-02 Thread Akhil Das
You can set HADOOP_CONF_DIR inside the spark-env.sh file Thanks Best Regards On Sat, Nov 1, 2014 at 4:14 AM, ameyc wrote: > How do i setup hadoop_conf_dir correctly when I'm running my spark job on > Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the > configuration that

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Sean Owen
saveAsText means "save every element of the RDD as one line of text". It works like TextOutputFormat in Hadoop MapReduce since that's what it uses. So you are causing it to create one big string out of each Iterable this way. On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar wrote: > Thanks for

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-11-02 Thread ashu
Hi, Sorry to bounce back the old thread. What is the state now? Is this problem solved. How spark handle categorical data now? Regards, Ashutosh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apac

Re: Prediction using Classification with text attributes in Apache Spark MLLib

2014-11-02 Thread Xiangrui Meng
This operation requires two transformers: 1) Indexer, which maps string features into categorical features 2) OneHotEncoder, which flatten categorical features into binary features We are working on the new dataset implementation, so we can easily express those transformations. Sorry for late! If

Spark Master Web UI showing "0 cores" in Completed Applications

2014-11-02 Thread Justin Yip
Hello, I have a question about the "Completed Applications" table on the Spark Master web UI page. For the column "Cores", it used to show the number of cores used in the application. However, after I added a line "sparkContext.stop()" as the end my spark app, it shows "0 cores". My application

How do I kill av job submitted with spark-submit

2014-11-02 Thread Steve Lewis
I see the job in the web interface but don't know how to kill it

Re: hadoop_conf_dir when running spark on yarn

2014-11-02 Thread Amey Chaugule
I thought that only applied when you're trying to run a job using spark-submit or in the shell... On Sun, Nov 2, 2014 at 8:47 AM, Akhil Das wrote: > You can set HADOOP_CONF_DIR inside the spark-env.sh file > > Thanks > Best Regards > > On Sat, Nov 1, 2014 at 4:14 AM, ameyc wrote: > >> How do i

Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Paul Wais
Thanks Sean! My novice understanding is that the 'native heap' is the address space not allocated to the JVM heap, but I wanted to check to see if I'm missing something. I found out my issue appeared to be actual memory pressure on the executor machine. There was space for the JVM heap but not mu

Spark SQL takes unexpected time

2014-11-02 Thread Shailesh Birari
Hello, I have written an Spark SQL application which reads data from HDFS and query on it. The data size is around 2GB (30 million records). The schema and query I am running is as below. The query takes around 05+ seconds to execute. I tried by adding rdd.persist(StorageLevel.MEMORY_AND

Re: Does SparkSQL work with custom defined SerDe?

2014-11-02 Thread Chirag Aggarwal
Did https://issues.apache.org/jira/browse/SPARK-3807 fix the issue seen by you? If yes, then please note that it shall be part of 1.1.1 and 1.2 Chirag From: Chen Song mailto:chen.song...@gmail.com>> Date: Wednesday, 15 October 2014 4:03 AM To: "user@spark.apache.org

Spark cluster stability

2014-11-02 Thread jatinpreet
Hi, I am running a small 6 node spark cluster for testing purposes. Recently, one of the node's physical memory was filled up by temporary files and there was no space left on the disk. Due to this my Spark jobs started failing even though on the Spark Web UI the was shown 'Alive'. Once I logged o

Re: Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Sean Owen
Yes, that's correct to my understanding and the probable explanation of your issue. There are no additional limits or differences from how the JVM works here. On Nov 3, 2014 4:40 AM, "Paul Wais" wrote: > Thanks Sean! My novice understanding is that the 'native heap' is the > address space not all

Re: Spark cluster stability

2014-11-02 Thread Akhil Das
You can enable monitoring (nagios) with alerts to tackle these kind of issues. Thanks Best Regards On Mon, Nov 3, 2014 at 10:55 AM, jatinpreet wrote: > Hi, > > I am running a small 6 node spark cluster for testing purposes. Recently, > one of the node's physical memory was filled up by temporar

Parquet files are only 6-20MB in size?

2014-11-02 Thread ag007
Hi there, I have a pySpark job that is simply taking a tab separated CSV outputting it to a Parquet file. The code is based on the SQL write parquet example. (Using a different inferred schema, only 35 columns). The input files range from 100MB to 12 Gb. I have tried different different block s

graph x extracting the path

2014-11-02 Thread dizzy5112
Hi all, just wondering if there was a way to extract paths in graphx. For example if i have the graph attached i would like to return the results along the lines of : 101 -> 103 101 ->104 ->108 102 ->105 102 ->106->107