Re: Problem getting Spark running on a Yarn cluster

2015-01-06 Thread Akhil Das
Just follow this documentation http://spark.apache.org/docs/1.1.1/running-on-yarn.html Ensure that *HADOOP_CONF_DIR* or *YARN_CONF_DIR* points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to the dfs and connect to the

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Daniel Haviv
Quoting Michael: Predicate push down into the input format is turned off by default because there is a bug in the current parquet library that null pointers when there are full row groups that are null. https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want: http://spa

Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Xuelin Cao
Hi,        I'm testing parquet file format, and the predicate pushdown is a very useful feature for us.        However, it looks like the predicate push down doesn't work after I set        sqlContext.sql("SET spark.sql.parquet.filterPushdown=true")        Here is my sql:       sqlContext.sql("

Re: Spark SQL implementation error

2015-01-06 Thread Pankaj Narang
As per telephonic call see how we can fetch the count val tweetsCount = sql("SELECT COUNT(*) FROM tweets") println(f"\n\n\nThere are ${tweetsCount.collect.head.getLong(0)} Tweets on this Dataset\n\n") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spar

Re: Using graphx to calculate average distance of a big graph

2015-01-06 Thread James
We are going to estimate the average distance using [HyperAnf]( http://arxiv.org/abs/1011.5599) on a 100 billion edge graph. 2015-01-07 2:18 GMT+08:00 Ankur Dave : > [-dev] > > What size of graph are you hoping to run this on? For small graphs where > materializing the all-pairs shortest path is

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-06 Thread Niranda Perera
Hi Sean, I removed the hadoop dependencies from the app and ran it on the cluster. It gives a java.io.EOFException 15/01/07 11:19:29 INFO MemoryStore: ensureFreeSpace(177166) called with curMem=0, maxMem=2004174766 15/01/07 11:19:29 INFO MemoryStore: Block broadcast_0 stored as values in memory (

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-06 Thread Xiangrui Meng
Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch wrote: > Hi All, > > Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in > local mode and not on distributed mode. Null pointer exception has been > th

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-06 Thread k.tham
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within RDDs, in fact, I don't think it makes any sense) I suggest rather than having an RDD of file names, collect those file name strings back on to the driver as a Scala array of file names, and then from there, make an array

Re: confidence/probability for prediction in MLlib

2015-01-06 Thread Xiangrui Meng
This is addressed in https://issues.apache.org/jira/browse/SPARK-4789. In the new pipeline API, we can simply output two columns, one for the best predicted class, and the other for probabilities or confidence scores for each class. -Xiangrui On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li wrote: > H

Re: [MLLib] storageLevel in ALS

2015-01-06 Thread Xiangrui Meng
Which Spark version are you using? We made this configurable in 1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202 -Xiangrui On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. wrote: > Hi, >I was doing a tests with ALS and I

Re: MLLIB and Openblas library in non-default dir

2015-01-06 Thread Xiangrui Meng
spark-submit may not share the same JVM with Spark master and executors. On Tue, Jan 6, 2015 at 11:40 AM, Tomas Hudik wrote: > thanks Xiangrui > > I'll try it. > > BTW: spark-submit is a standalone program (bin/spark-submit). Therefore, JVM > has to be executed after spark-submit script > Am I co

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Sven Krasser
Hey Davies, Here are some more details on a configuration that causes this error for me. Launch an AWS Spark EMR cluster as follows: *aws emr create-cluster --region us-west-1 --no-auto-terminate \ --ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \ --bootstrap-actions Path=

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Krishna Sankar
Interestingly Google Chrome translates the materials. Cheers On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas wrote: > I do not understand Chinese but the diagrams on that page are very helpful. > > On Tue, Jan 6, 2015 at 9:46 PM, eric wong wrote: > >> A good beginning if you are chinese. >> >> h

SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-06 Thread Nathan McCarthy
Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple example; load up some sample data from parquet on HDFS (about 380m rows, 10 columns) on a 7 node cluster. val t = sqlC.parquetFile("/user/n/sales-tran12m.

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Boromir Widas
I do not understand Chinese but the diagrams on that page are very helpful. On Tue, Jan 6, 2015 at 9:46 PM, eric wong wrote: > A good beginning if you are chinese. > > https://github.com/JerryLead/SparkInternals/tree/master/markdown > > 2015-01-07 10:13 GMT+08:00 bit1...@163.com : > >> Thank you

Re: Cannot see RDDs in Spark UI

2015-01-06 Thread Andrew Ash
Hi Manoj, I've noticed that the storage tab only shows RDDs that have been cached. Did you call .cache() or .persist() on any of the RDDs? Andrew On Tue, Jan 6, 2015 at 6:48 PM, Manoj Samel wrote: > Hi, > > I create a bunch of RDDs, including schema RDDs. When I run the program > and go to UI

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread bit1...@163.com
Thanks Eric. Yes..I am Chinese, :-). I will read through the articles, thank you! bit1...@163.com From: eric wong Date: 2015-01-07 10:46 To: bit1...@163.com CC: user Subject: Re: Re: I think I am almost lost in the internals of Spark A good beginning if you are chinese. https://github.com/Je

Cannot see RDDs in Spark UI

2015-01-06 Thread Manoj Samel
Hi, I create a bunch of RDDs, including schema RDDs. When I run the program and go to UI on xxx:4040, the storage tab does not shows any RDDs. Spark version is 1.1.1 (Hadoop 2.3) Any thoughts? Thanks,

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread eric wong
A good beginning if you are chinese. https://github.com/JerryLead/SparkInternals/tree/master/markdown 2015-01-07 10:13 GMT+08:00 bit1...@163.com : > Thank you, Tobias. I will look into the Spark paper. But it looks that > the paper has been moved, > http://www.cs.berkeley.edu/~matei/papers/2012

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread bit1...@163.com
Thank you, Tobias. I will look into the Spark paper. But it looks that the paper has been moved, http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf. A web page is returned (Resource not found)when I access it. bit1...@163.com From: Tobias Pfeiffer Date: 2015-01-07 09:24 To: Todd C

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 11:13 AM, Riginos Samaras wrote: > exactly thats what I'm looking for, my code is like this: > //code > > val users_map = users_file.map{ s => > > val parts = s.split(",") > > (parts(0).toInt, parts(1)) > > }.distinct > > //code > > > but i get the error: > > error: va

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 10:47 AM, Riginos Samaras wrote: > Yes something like this. Can you please give me an example to create a Map? > That depends heavily on the shape of your input file. What about something like: (for (line <- Source.fromFile(filename).getLines()) { val items = line.

Re: How to replace user.id to user.names in a file

2015-01-06 Thread Tobias Pfeiffer
Hi, it looks to me as if you need the whole user database on every node, so maybe put the id->name information as a Map[Id, String] in a broadcast variable and then do something like recommendations.map(line => { line.map(uid => usernames(uid)) }) or so? Tobias

Re: Parquet schema changes

2015-01-06 Thread Michael Armbrust
I want to support this but we don't yet. Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3851 On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore wrote: > Anyone got any further thoughts on this? I saw the _metadata file seems > to store the schema of every single part (i.e. file) in th

Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Tobias Pfeiffer
Hi, On Tue, Jan 6, 2015 at 11:24 PM, Todd wrote: > I am a bit new to Spark, except that I tried simple things like word > count, and the examples given in the spark sql programming guide. > Now, I am investigating the internals of Spark, but I think I am almost > lost, because I could not grasp

Re: Parquet schema changes

2015-01-06 Thread Adam Gilmore
Anyone got any further thoughts on this? I saw the _metadata file seems to store the schema of every single part (i.e. file) in the parquet directory, so in theory it should be possible. Effectively, our use case is that we have a stack of JSON that we receive and we want to encode to Parquet for

Re: Parquet predicate pushdown

2015-01-06 Thread Adam Gilmore
Thanks for that. Strangely enough I was actually using 1.1.1 where it did seem to be enabled by default. Since upgrading to 1.2.0 and setting that flag, I do get the expected result! Looks good! On Tue, Jan 6, 2015 at 12:17 PM, Michael Armbrust wrote: > Predicate push down into the input form

How to replace user.id to user.names in a file

2015-01-06 Thread riginos
I work on a user to user recommender for a website using mllib.recommendation. I have created a file (recommends.txt) which contains the top 5 recommendations for each user id. The file's form(recommends.txt) is something like this (user::rec1:rec2:rec3:rec4:rec5): /**file's snapshot**/ 5823::944

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
One approach I was considering was to use mapPartitions. It is straightforward to compute the moving average over a partition, except for near the end point. Does anyone see how to fix that? On Tue, Jan 6, 2015 at 7:20 PM, Sean Owen wrote: > Interesting, I am not sure the order in which fold() e

Re: Using ec2 launch script with locally built version of spark?

2015-01-06 Thread gen tang
Hi, As the ec2 launch script provided by spark uses https://github.com/mesos/spark-ec2 to download and configure all the tools in the cluster (spark, hadoop etc). You can create your own git repository to achieve your goal. More precisely: 1. Upload your own version of spark in s3 at address 2.

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
Interesting, I am not sure the order in which fold() encounters elements is guaranteed, although from reading the code, I imagine in practice it is first-to-last by partition and then folded first-to-last from those results on the driver. I don't know this would lead to a solution though as the res

R: RDD Moving Average

2015-01-06 Thread Paolo Platter
In my opinion you should use fold pattern. Obviously after an sort by trasformation. Paolo Inviata dal mio Windows Phone Da: Asim Jalis Inviato: ‎06/‎01/‎2015 23:11 A: Sean Owen Cc: user@spark.apache.org

Using ec2 launch script with locally built version of spark?

2015-01-06 Thread Ganon Pierce
Is there a way to use the ec2 launch script with a locally built version of spark? I launch and destroy clusters pretty frequently and would like to not have to wait each time for the master instance to compile the source as happens when I set the -v tag with the latest git commit. To be clear,

Re: Snappy error when driver is running in JBoss

2015-01-06 Thread Charles Li
Hi Thanks for the reply! I did do a echo $CLASSPATH, but I got nothing. Since we are running inside jboss, I guess the class path is not set? People did mention that JBoss loads snappy-java multiple times. But I cannot find a way to solve that problem. Cheers On Jan 6, 2015, at 5:3

Re: Snappy error when driver is running in JBoss

2015-01-06 Thread Ted Yu
Might be due to conflict between multiple snappy jars. Can you check the classpath to see if there are more than one snappy jar ? Cheers On Tue, Jan 6, 2015 at 2:26 PM, Charles wrote: > I get this exception(java.lang.UnsatisfiedLinkError) when the driver is > running inside JBoss. > > We are r

Re: Spark 1.1.0 and HBase: Snappy UnsatisfiedLinkError

2015-01-06 Thread Charles
Hi, I am getting this same error. Did you figure out how to solve the problem? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-and-HBase-Snappy-UnsatisfiedLinkError-tp19827p21005.html Sent from the Apache Spark User List mailing list arch

Snappy error when driver is running in JBoss

2015-01-06 Thread Charles
I get this exception(java.lang.UnsatisfiedLinkError) when the driver is running inside JBoss. We are running with DataStax 4.6 version, which is using spark 1.1.0. The driver runs inside a wildfly container. The snappy-java version is 1.0.5. 2015-01-06 20:25:03,771 ERROR [akka.actor.ActorSystem

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
One problem with this is that we are creating a lot of iterables containing a lot of repeated data. Is there a way to do this so that we can calculate a moving average incrementally? On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen wrote: > Yes, if you break it down to... > > tickerRDD.map(ticker => >

Re: Current Build Gives HTTP ERROR

2015-01-06 Thread Sean Owen
FWIW I do not see any such error, after a "mvn -DskipTests clean package" and "./bin/spark-shell" from master. Maybe double-check you have done a full clean build. On Tue, Jan 6, 2015 at 9:09 PM, Ganon Pierce wrote: > I’m attempting to build from the latest commit on git and receive the > follow

Re: Launching Spark app in client mode for standalone cluster

2015-01-06 Thread Boromir Widas
Thanks for the pointers. The issue was due to route caching by Spray, which would always return the same value. Other than that the program is working fine. On Mon, Jan 5, 2015 at 12:44 AM, Simon Chan wrote: > Boromir, > > You may like to take a look at how we make Spray and Spark working > toge

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
Yes, if you break it down to... tickerRDD.map(ticker => (ticker.timestamp, ticker) ).map { case(ts, ticker) => ((ts / 6) * 6, ticker) }.groupByKey ... as Michael alluded to, then it more naturally extends to the sliding window, since you can flatMap one Ticker to many (bucket, ticker)

Current Build Gives HTTP ERROR

2015-01-06 Thread Ganon Pierce
I’m attempting to build from the latest commit on git and receive the following error upon attempting to access the application web ui: HTTP ERROR: 500 Problem accessing /jobs/. Reason: Server Error Powered by Jetty:// My driver also prints this error: java.lang.UnsupportedOperationExcept

Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Oops, just kidding, this method is not in the current release. However, it is included in the latest commit on git if you want to do a build. > On Jan 6, 2015, at 2:56 PM, Ganon Pierce wrote: > > Two billion words is a very large vocabulary… You can try solving this issue > by by setting the

Re: Data Locality

2015-01-06 Thread Andrew Ash
You can also read about locality here in the docs: http://spark.apache.org/docs/latest/tuning.html#data-locality On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger wrote: > No, not all rdds have location information, and in any case tasks may be > scheduled on non-local nodes if there is idle capaci

Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Two billion words is a very large vocabulary… You can try solving this issue by by setting the number of times words must occur in order to be included in the vocabulary using setMinCount, this will be prevent common misspellings, websites, and other things from being included and may improve th

[MLLib] storageLevel in ALS

2015-01-06 Thread Fernando O.
Hi, I was doing a tests with ALS and I noticed that if I persist the inner RDDs from a MatrixFactorizationModel the RDD is not replicated, it seems like the storagelevel is hardcoded to MEMORY_AND_DISK, do you think it makes sense to make that configurable? [image: Inline image 1]

Re: RDD Moving Average

2015-01-06 Thread Michael Malak
Asim Jalis writes: > > ​Thanks. Another question. ​I have event data with timestamps. I want to > create a sliding window > using timestamps. Some windows will have a lot of events in them others > won’t. Is there a way > to get an RDD made of this kind of a variable length window? You should c

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
I guess I can use a similar groupBy approach. Map each event to all the windows that it can belong to. Then do a groupBy, etc. I was wondering if there was a more elegant approach. On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis wrote: > Except I want it to be a sliding window. So the same record cou

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
Except I want it to be a sliding window. So the same record could be in multiple buckets. On Tue, Jan 6, 2015 at 3:43 PM, Sean Owen wrote: > So you want windows covering the same length of time, some of which will > be fuller than others? You could, for example, simply bucket the data by > minut

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
So you want windows covering the same length of time, some of which will be fuller than others? You could, for example, simply bucket the data by minute to get this kind of effect. If you an RDD[Ticker], where Ticker has a timestamp in ms, you could: tickerRDD.groupBy(ticker => (ticker.timestamp /

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
​Thanks. Another question. ​I have event data with timestamps. I want to create a sliding window using timestamps. Some windows will have a lot of events in them others won’t. Is there a way to get an RDD made of this kind of a variable length window? On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen wr

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Davies Liu
I still can not reproduce it with 2 nodes (4 CPUs). Your repro.py could be faster (10 min) than before (22 min): inpdata.map(lambda (pc, x): (x, pc=='p' and 2 or 1)).reduceByKey(lambda x, y: x|y).filter(lambda (x, pc): pc==3).collect() (also, no cache needed anymore) Davies On Tue, Jan 6, 20

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
Ah, so it's rdd specific - that would make sense. For those systems where it is possible to extract sensible susbets the rdds do so. My use case, which is probably biasing my thinking is DynamoDb which I don't think can efficiently extract records from M-to-N cheers On Wed, Jan 7, 2015 at 6:59 AM

Re: disable log4j for spark-shell

2015-01-06 Thread brichards
FYI its --driver-java-options "-Dkey=value" no equal sign between the flag and the arguments. Chewed up some time figuring that out. Bobby -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p21003.html Sent from the Apa

Re: Multiple Spark Streaming receiver model

2015-01-06 Thread Silvio Fiorito
Hi Manjul, Each StreamingContext will have its own batch size. If that doesn’t work for the different sources you have then you would have to create different streaming apps. You can only create a new StreamingContext in the same Spark app, once you’ve stopped the previous one. Spark certainly

Re: Reading from a centralized stored

2015-01-06 Thread Cody Koeninger
No, most rdds partition input data appropriately. On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter wrote: > > One more question, to be clarify. Will every node pull in all the data ? > > thanks > > On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger > wrote: > >> If you are not co-locating spark execut

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
One more question, to be clarify. Will every node pull in all the data ? thanks On Tue, Jan 6, 2015 at 12:56 PM, Cody Koeninger wrote: > If you are not co-locating spark executor processes on the same machines > where the data is stored, and using an rdd that knows about which node to > prefer

confidence/probability for prediction in MLlib

2015-01-06 Thread Jianguo Li
Hi, A while ago, somebody asked about getting a confidence value of a prediction with MLlib's implementation of Naive Bayes's classification. I was wondering if there is any plan in the near future for the predict function to return both a label and a confidence/probability? Or could the private

Re: MLLIB and Openblas library in non-default dir

2015-01-06 Thread Tomas Hudik
thanks Xiangrui I'll try it. BTW: spark-submit is a standalone program (bin/spark-submit). Therefore, JVM has to be executed after spark-submit script Am I correct? On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng wrote: > It might be hard to do that with spark-submit, because the executor > J

HDFS_DELEGATION_TOKEN errors after switching Spark Contexts

2015-01-06 Thread Ganelin, Ilya
Hi all. In order to get Spark to properly release memory during batch processing as a workaround to issue https://issues.apache.org/jira/browse/SPARK-4927 I tear down and re-initialize the spark context with : context.stop() and context = new SparkContext() The problem I run into is that eventu

Is there a way to read a parquet database without generating an RDD

2015-01-06 Thread Steve Lewis
I have an application where a function needs access to the results of a select from a parquet database. Creating a JavaSQLContext and from it a JavaSchemaRDD as shown below works but the parallelism is not needed - a simple JDBC call would work - Are there alternative non-parallel ways to achieve

Multiple Spark Streaming receiver model

2015-01-06 Thread manjuldixit
Hi, We have a requirement of receiving live input messages from RabbitMQ and process them into micro batches. For this we have selected SparkStreaming and we have written a connector for RabbitMQ receiver and Spark streaming, it is working fine. Now the main requirement is to receive different c

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Todd Nist
*@Sasi* You should be able to create a job something like this: package io.radtech.spark.jobserver import java.util.UUID import org.apache.spark.{ SparkConf, SparkContext } import org.apache.spark.rdd.RDD import org.joda.time.DateTime import com.datastax.spark.connector.types.TypeConverter impor

Re: Using graphx to calculate average distance of a big graph

2015-01-06 Thread Ankur Dave
[-dev] What size of graph are you hoping to run this on? For small graphs where materializing the all-pairs shortest path is an option, you could simply find the APSP using https://github.com/apache/spark/pull/3619 and then take the average distance (apsp.map(_._2.toDouble).mean). Ankur

RE: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Max Xu
Issue resolved after updating the Hbase version to 0.98.8-hadoop2. Thanks Ted for all the help! For future reference: This problem has nothing to do with Spark 1.2.0 but simply because I built Spark 1.2.0 with the wrong Hbase version. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, Janu

Re: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Ted Yu
I doubt anyone would deploy hbase 0.98.x on hadoop-1 Looks like hadoop2 profile should be made the default. Cheers On Tue, Jan 6, 2015 at 9:49 AM, Max Xu wrote: > Awesome. Thanks again Ted. I remember there is a block in the pom.xml > under the example folder that default hbase version to had

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
First you'd need to sort the RDD to give it a meaningful order, but I assume you have some kind of timestamp in your data you can sort on. I think you might be after the sliding() function, a developer API in MLlib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark

RE: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Max Xu
Awesome. Thanks again Ted. I remember there is a block in the pom.xml under the example folder that default hbase version to hadoop1. I figured out this last time when I built Spark 1.1.1 but forgot this time. hbase-hadoop1 !hbase.profile

Re: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Ted Yu
Default profile is hbase-hadoop1 so you need to specify -Dhbase.profile=hadoop2 See SPARK-1297 Cheers On Tue, Jan 6, 2015 at 9:11 AM, Max Xu wrote: > Thanks Ted. You are right, hbase-site.xml is in the classpath. But > previously I have it in the classpath too and the app works fine. I believ

1.2.0 - java.lang.ClassCastException: scala.Tuple2 cannot be cast to scala.collection.Iterator

2015-01-06 Thread bchazalet
I am running into the same problem describe here https://www.mail-archive.com/user%40spark.apache.org/msg17788.html which for some reasons does not appear in the archives. I am having a standalone scala application, build (using sbt) with spark jars from maven: "org.apache.spark" %% "spark-core

RDD Moving Average

2015-01-06 Thread Asim Jalis
Is there an easy way to do a moving average across a single RDD (in a non-streaming app). Here is the use case. I have an RDD made up of stock prices. I want to calculate a moving average using a window size of N. Thanks. Asim

RE: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Max Xu
Thanks Ted. You are right, hbase-site.xml is in the classpath. But previously I have it in the classpath too and the app works fine. I believe I found the problem. I built Spark 1.2.0 myself and forgot to change the dependency hbase version to 0.98.8-hadoop2, which is the version I use. When I u

Re: Shuffle Problems in 1.2.0

2015-01-06 Thread Sven Krasser
The issue has been sensitive to the number of executors and input data size. I'm using 2 executors with 4 cores each, 25GB of memory, 3800MB of memory overhead for YARN. This will fit onto Amazon r3 instance types. -Sven On Tue, Jan 6, 2015 at 12:46 AM, Davies Liu wrote: > I had ran your scripts

Re: Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Ted Yu
I assume hbase-site.xml is in the classpath. Can you try the code snippet in standalone program to see if the problem persists ? Cheers On Tue, Jan 6, 2015 at 6:42 AM, Max Xu wrote: > Hi all, > > > > I have a Spark streaming application that ingests data from a Kafka topic > and persists rece

Re: Data Locality

2015-01-06 Thread Cody Koeninger
No, not all rdds have location information, and in any case tasks may be scheduled on non-local nodes if there is idle capacity. see spark.locality.wait http://spark.apache.org/docs/latest/configuration.html On Tue, Jan 6, 2015 at 10:17 AM, gtinside wrote: > Does spark guarantee to push the

Data Locality

2015-01-06 Thread gtinside
Does spark guarantee to push the processing to the data ? Before creating tasks does spark always check for data location ? So for example if I have 3 spark nodes (Node1, Node2, Node3) and data is local to just 2 nodes (Node1 and Node2) , will spark always schedule tasks on the node for which the d

Re: Spark Driver "behind" NAT

2015-01-06 Thread Aaron
Found the issue in JIRA: https://issues.apache.org/jira/browse/SPARK-4389?jql=project%20%3D%20SPARK%20AND%20text%20~%20NAT On Tue, Jan 6, 2015 at 10:45 AM, Aaron wrote: > From what I can tell, this isn't a "firewall" issue per se..it's how the > Remoting Service "binds" to an IP given cmd line

Re: Spark Driver "behind" NAT

2015-01-06 Thread Aaron
>From what I can tell, this isn't a "firewall" issue per se..it's how the Remoting Service "binds" to an IP given cmd line parameters. So, if I have a VM (or OpenStack or EC2 instance) running on a private network let's say, where the IP address is 192.168.X.Y...I can't tell the Workers to "reach

Re: different akka versions and spark

2015-01-06 Thread Koert Kuipers
if the classes are in the original location than i think its safe to say that this makes it impossible for us to build one app that can run against spark 1.0.x, 1.1.x and spark 1.2.x. thats no big deal, but it does beg the question of what compatibility can reasonably be expected for spark 1.x ser

Location of logs in local mode

2015-01-06 Thread Brett Meyer
I¹m submitting a script using spark-submit in local mode for testing, and I¹m having trouble figuring out where the logs are stored. The documentation indicates that they should be in the work folder in the directory in which Spark lives on my system, but I see no such folder there. I¹ve set the S

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread bchazalet
It does not look like you're supposed to fiddle with the SparkConf and even SparkContext in a 'job' (again, I don't know much about jobserver), as you're given a SparkContext as parameter in the build method. I guess jobserver initialises the SparkConf and SparkContext itself when it first starts,

Re: Finding most occurrences in a JSON Nested Array

2015-01-06 Thread Pankaj Narang
Thats great. I was not having access on the developer machine so sent you the psuedo code only. Happy to see its working. If you need any more help related to spark let me know anytime. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrence

Problem getting Spark running on a Yarn cluster

2015-01-06 Thread Sharon Rapoport
Hello, We have hadoop 2.6.0 and Yarn set up on ec2. Trying to get spark 1.1.1 running on the Yarn cluster. I have of course googled around and found that this problem is solved for most after removing the line including 127.0.1.1 from /etc/hosts. This hasn’t seemed to solve this for me. Anyone

Saving data to Hbase hung in Spark streaming application with Spark 1.2.0

2015-01-06 Thread Max Xu
Hi all, I have a Spark streaming application that ingests data from a Kafka topic and persists received data to Hbase. It works fine with Spark 1.1.1 in YARN cluster mode. Basically, I use the following code to persist each partition of each RDD to Hbase: @Override void call(It

How to limit the number of concurrent tasks per node?

2015-01-06 Thread Pengcheng YIN
Hi Pro, One map() operation in my Spark APP takes an RDD[A] as input and map each element in RDD[A] using a custom mapping function func(x:A):B to another object of type B. I received lots of OutOfMemory error, and after some debugging I find this is because func() requires significant amount

I think I am almost lost in the internals of Spark

2015-01-06 Thread Todd
I am a bit new to Spark, except that I tried simple things like word count, and the examples given in the spark sql programming guide. Now, I am investigating the internals of Spark, but I think I am almost lost, because I could not grasp a whole picture what spark does when it executes the word

Re: Finding most occurrences in a JSON Nested Array

2015-01-06 Thread adstan
Many thanks Pankaj, I've got it working. For completeness, here's the whole segment (including the printout at diff stages): -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20996.html Sent from the Ap

Trouble with large Yarn job

2015-01-06 Thread Anders Arpteg
Hey, I have a job that keeps failing if too much data is processed, and I can't see how to get it working. I've tried repartitioning with more partitions and increasing amount of memory for the executors (now about 12G and 400 executors. Here is a snippets of the first part of the code, which succ

Pyspark Interactive shell

2015-01-06 Thread Naveen Kumar Pokala
Hi, Anybody tried to connect to spark cluster( on UNIX machines) from windows interactive shell ? -Naveen.

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Sasi
Boris, Thank you for your suggestion. I used following code and still facing the same issue - val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") .setAppName("jobserver test demo") .set

streamSQL - is it available or is it in POC ?

2015-01-06 Thread tfrisk
Hi, Just wondering whether this is released yet and if so on which version of Spark ? Many Thanks, Thomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streamSQL-is-it-available-or-is-it-in-POC-tp20993.html Sent from the Apache Spark User List mailing

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Akhil Das
Or you can use: sc.addJar("/path/to/your/datastax.jar") Thanks Best Regards On Tue, Jan 6, 2015 at 5:53 PM, bchazalet wrote: > I don't know much about spark-jobserver, but you can set jars > programatically > using the method setJars on SparkConf. Looking at your code it seems that > you're i

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Pankaj Narang
I suggest to create uber jar instead. check my thread for the same http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-td20926.html Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 S

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-06 Thread Pankaj Narang
Good luck. Let me know If I can assist you further Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-wi

Re: Spark error in execution

2015-01-06 Thread Daniel Darabos
Hello! I just had a very similar stack trace. It was caused by an Akka version mismatch. (From trying to use Play 2.3 with Spark 1.1 by accident instead of 1.2.) On Mon, Nov 24, 2014 at 7:15 PM, Blackeye wrote: > I created an application in spark. When I run it with spark, everything > works > f

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread bchazalet
I don't know much about spark-jobserver, but you can set jars programatically using the method setJars on SparkConf. Looking at your code it seems that you're importing classes from com.datastax.spark.connector._ to load data from cassandra, so you may need to add that datastax jar to your SparkCo

Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Sasi
We are trying to use spark-jobserver for one of our requirement. We referred *https://github.com/fedragon/spark-jobserver-examples* and modified little to match our requirement as below - /** ProductionRDDBuilder.scala ***/ package sparking package jobserver // Import required libraries.

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-06 Thread Christophe Billiard
Thanks Pankaj for the assembly plugin tip. Yes there is a version mismatch of akka actor between Spark 1.1.1 and akka-http/akka-stream (2.2.3 versus 2.3.x). After some digging, I see 4 options for this problem (in case others encounter it): 1) Upgrade to Spark 1.2.0, the same code will work (not

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-06 Thread Sean Owen
Oh, are you actually bundling Hadoop in your app? that may be the problem. If you're using stand-alone mode, why include Hadoop? In any event, Spark and Hadoop are intended to be 'provided' dependencies in the app you send to spark-submit. On Tue, Jan 6, 2015 at 10:15 AM, Niranda Perera wrote: >

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-06 Thread Niranda Perera
Hi Sean, My mistake, Guava 11 dependency came from the hadoop-commons indeed. I'm running the following simple app in spark 1.2.0 standalone local cluster (2 workers) with Hadoop 1.2.1 public class AvroSparkTest { public static void main(String[] args) throws Exception { SparkConf sp

  1   2   >