Re: Errors accessing hdfs while in local mode

2014-07-16 Thread Akhil Das
You can try the following in the spark-shell: 1. Run it in *Clustermode* by going inside the spark directory: $ SPARK_MASTER=spark://masterip:7077 ./bin/spark-shell val textFile = sc.textFile("hdfs://masterip/data/blah.csv") textFile.take(10).foreach(println) 2. Now try running in *Localmode:

Re: jar changed on src filesystem

2014-07-16 Thread cmti95035
They're all the same version. Actually even without the "--jars" parameter it got the same error. Looks like it needs to copy the assembly jar for running the example jar anyway during the staging. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jar-changed-

Re: can we insert and update with spark sql

2014-07-16 Thread Akhil Das
Is this what you are looking for? https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html According to the doc, it says "Operator that acts as a sink for queries on RDDs and can be used to store the output inside a directory of Parquet files. This ope

Using RDD in RDD transformation

2014-07-16 Thread tbin
I implemented a simple KNN classifier. And i can run it successfully on a single sample, but it occurs an error when it is run on a test samples RDD. I attach the source code in attachment. Look forward for you replay! Best wishes to you! The following is source code. import math from pyspark im

can we insert and update with spark sql

2014-07-16 Thread Hu, Leo
Hi As for spark 1.0, can we insert and update a table with SPARK SQL, and how? Thanks Best Regard

Re: jar changed on src filesystem

2014-07-16 Thread Chester@work
Since you are running in yarn-cluster mode, and you are supply the spark assembly jar file. There is no need to install spark on each node. Is it possible two spark jars have different version ? Chester Sent from my iPad On Jul 16, 2014, at 22:49, cmti95035 wrote: > Hi, > > I need some hel

jar changed on src filesystem

2014-07-16 Thread cmti95035
Hi, I need some help for running Spark over Yarn: I set up a cluster running HDP 2.0.6 with 6 nodes, and then installed the spark-1.0.1-bin-hadoop2 on each node. When running the SparkPi example with the following command: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master

Re: Error when testing with large sparse svm

2014-07-16 Thread crater
I don't really know how to create JIRA :( Specifically, the code I commented out are: //val prediction = model.predict(test.map(_.features)) //val predictionAndLabel = prediction.zip(test.map(_.label)) //val prediction = model.predict(training.map(_.features)) //val predictionAndL

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois
Hi Xiangrui, I will try this shortly. When using N partitions, do you recommend N be the number of cores on each slave or the number of cores on the master? Forgive my ignorance, but is this best achieved as an argument to sc.textFile? The slaves on the EC2 clusters start with only 8gb of storage

Re: Kyro deserialisation error

2014-07-16 Thread Hao Wang
I am not sure. Not every task will fail at this Kyro exception. In most time, the cluster could successfully finish the WikipediaPageRank. How could I debug this exception? Thanks Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan

RE: spark building error

2014-07-16 Thread Jack Yang
Hi all, Please ignore my question. I forgot to install unzip. Sorry for that. Jack From: Jack Yang [mailto:j...@uow.edu.au] Sent: Thursday, 17 July 2014 2:22 PM To: user@spark.apache.org Subject: spark building error Hi all, I got one problem when building spark. I am using maven 3.1.1, spark 1.0

spark building error

2014-07-16 Thread Jack Yang
Hi all, I got one problem when building spark. I am using maven 3.1.1, spark 1.0.1, scala 2.10.1, Hadoop 1.2.1 (OS: Ubuntu 12.0.4) I first download the binary package and unzip in a directory called "/home/hduser/spark". Then I do the following: $ cd /home/hduser/spark $ export MAVEN_OPTS="-Xmx2

Re: Kmeans

2014-07-16 Thread Xiangrui Meng
kmeans.py contains a naive implementation of k-means in python, served as an example of how to use pyspark. Please use MLlib's implementation in practice. There is a JIRA for making it clear: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Wed, Jul 16, 2014 at 8:16 PM, amin mohebbi

Re: Release date for new pyspark

2014-07-16 Thread Michael Armbrust
You should try cleaning and then building. We have recently hit a bug in the scala compiler that sometimes causes non-clean builds to fail. On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia wrote: > Yeah, we try to have a regular 3 month release cycle; see > https://cwiki.apache.org/confluence/di

Kmeans

2014-07-16 Thread amin mohebbi
Can anyone explain to me what is difference between kmeans in Mlib and kmeans in examples/src/main/python/kmeans.py?   Best Regards ... Amin Mohebbi PhD candidate in Software Engineering   at university of Malaysia   H/P : +60 18 2040 0

Re: MLLib - Regularized logistic regression in python

2014-07-16 Thread Yanbo Liang
AFAIK for question 2, there is no built-in method to account for that problem. At right now, we can only perform one type of regularization. However, the elastic net implementation is just underway. You can refer this topic for further discussion. https://issues.apache.org/jira/browse/SPARK-1543

Re: Spark Streaming timestamps

2014-07-16 Thread Tathagata Das
Answers inline. On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay wrote: > Hi all, > > I am currently using Spark Streaming to conduct a real-time data > analytics. We receive data from Kafka. We want to generate output files > that contain results that are based on the data we receive from a specific

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread Tathagata Das
I guess this is better explained in the streaming programming guide's window operation subsection. For completeness sake, its worth mentioning the following. Window operations can be applied on other windowed-DS

Use Spark with HBase' HFileOutputFormat

2014-07-16 Thread Jianshi Huang
Hi, I want to use Spark with HBase and I'm confused about how to ingest my data using HBase' HFileOutputFormat. It recommends calling configureIncrementalLoad which does the following: - Inspects the table to configure a total order partitioner - Uploads the partitions file to the cluster a

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Tathagata Das
Have you taken a look at DStream.transformWith( ... ) . That allows you apply arbitrary transformation between RDDs (of the same timestamp) of two different streams. So you can do something like this. 2s-window-stream.transformWith(1s-window-stream, (rdd1: RDD[...], rdd2: RDD[...]) => { ... //

spark-ec2 script with Tachyon

2014-07-16 Thread nit
Hi, It seems that spark-ec2 script deploys Tachyon module along with other setup. I am trying to use .persist(OFF_HEAP) for RDD persistence, but on worker I see this error -- Failed to connect (2) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused -- >From netsta

Spark Streaming timestamps

2014-07-16 Thread Bill Jay
Hi all, I am currently using Spark Streaming to conduct a real-time data analytics. We receive data from Kafka. We want to generate output files that contain results that are based on the data we receive from a specific time interval. I have several questions on Spark Streaming's timestamp: 1) I

Re: Memory & compute-intensive tasks

2014-07-16 Thread Liquan Pei
Hi Ravi, I have seen a similar issue before. You can try to set fs.hdfs.impl.disable.cache to true in your hadoop configuration. For example, suppose your hadoop configuration file is hadoopConf, you can use hadoopConf.setBoolean("fs.hdfs.impl.disable.cache", true) Let me know if that helps. Bes

Re: Memory & compute-intensive tasks

2014-07-16 Thread rpandya
Matei - I tried using coalesce(numNodes, true), but it then seemed to run too few SNAP tasks - only 2 or 3 when I had specified 46. The job failed, perhaps for unrelated reasons, with some odd exceptions in the log (at the end of this message). But I really don't want to force data movement between

Re: Release date for new pyspark

2014-07-16 Thread Matei Zaharia
Yeah, we try to have a regular 3 month release cycle; see https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage for the current window. Matei On Jul 16, 2014, at 4:21 PM, Mark Hamstra wrote: > You should expect master to compile and run: patches aren't merged unless > they build an

Re: Cassandra driver Spark question

2014-07-16 Thread RodrigoB
Tnks to both for the comments and the debugging suggestion, I will try to use. Regarding you comment, yes I do agree the current solution was not efficient but for using the saveToCassandra method I need an RDD thus the paralelize method. I finally got direct by Piotr to use the CassandraConnec

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
For ALS, I would recommend repartitioning the ratings to match the number of CPU cores or even less. ALS is not computation heavy for small k but communication heavy. Having small number of partitions may help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the default local directory becau

Re: Release date for new pyspark

2014-07-16 Thread Mark Hamstra
You should expect master to compile and run: patches aren't merged unless they build and pass tests on Jenkins. You shouldn't expect new features to be added to stable code in maintenance releases (e.g. 1.0.1). AFAIK, we're still on track with Spark 1.1.0 development, which means that it should b

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Sandy Ryza
Hi Ron, I just checked and this bug is fixed in recent releases of Spark. -Sandy On Sun, Jul 13, 2014 at 8:15 PM, Chester Chen wrote: > Ron, > Which distribution and Version of Hadoop are you using ? > > I just looked at CDH5 ( hadoop-mapreduce-client-core- > 2.3.0-cdh5.0.0), > > MR

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
H, it could be some weirdness with classloaders / Mesos / spark sql? I'm curious if you would hit an error if there were no lambda functions involved. Perhaps if you load the data using jsonFile or parquetFile. Either way, I'd file a JIRA. Thanks! On Jul 16, 2014 6:48 PM, "Svend" wrote: >

Release date for new pyspark

2014-07-16 Thread Paul Wais
Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master/python/pyspark/context.py#L353 I downloaded the recent 1.0.1 release and was surprised to see the distribution did not include these

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Svend
Hi Michael, Thanks for your reply. Yes, the reduce triggered the actual execution, I got a total length (totalLength: 95068762, for the record). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-line11-read-when-loading-an-HDFS-text

Re: Spark Streaming Json file groupby function

2014-07-16 Thread Tathagata Das
I think I know what the problem is. Spark Streaming is constantly doing garbage cleanup by throwing away data that it does not based on the operations in the DStream. Here the DSTream operations are not aware of the spark sql queries thats happening asynchronous to spark streaming. So data is being

Re: Multiple streams at the same time

2014-07-16 Thread Tathagata Das
I hope it all works :) On Wed, Jul 16, 2014 at 9:08 AM, gorenuru wrote: > Hi and thank you for your reply. > > Looks like it's possible. It looks like a hack for me because we are > specifying batch duration when creating context. This means that if we will > specify batch duration to 10 second

Re: can't print DStream after reduce

2014-07-16 Thread Tathagata Das
Yeah. I have been wondering how to check this in the general case, across all deployment modes, but thats a hard problem. Last week I realized that even if we can do it just for local, we can get the biggest bang of the buck. TD On Tue, Jul 15, 2014 at 9:31 PM, Tobias Pfeiffer wrote: > Hi, > >

Re: Spark Streaming, external windowing?

2014-07-16 Thread Tathagata Das
One way to do that is currently possible is given here http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAMwrk0=b38dewysliwyc6hmze8tty8innbw6ixatnd1ue2-...@mail.gmail.com%3E On Wed, Jul 16, 2014 at 1:16 AM, Gerard Maas wrote: > Hi Sargun, > > There have been few discussions o

SaveAsTextFile of RDD taking much time

2014-07-16 Thread sudiprc
Hi All,I am new to Spark. Written a program to read data from local big file, sort using Spark SQL and then filter based some validation rules. I have tested this program with 23860746 lines of file, and it took 39 secs (2 cores and Xmx as 6gb). But, when I want to serializing it to a local file, i

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread Walrus theCat
I did not! On Wed, Jul 16, 2014 at 12:31 PM, aaronjosephs wrote: > The only other thing to keep in mind is that window duration and slide > duration have to be multiples of batch duration, IDK if you made that fully > clear > > > > -- > View this message in context: > http://apache-spark-user-l

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
Oh, I'm sorry... reduce is also an operation On Wed, Jul 16, 2014 at 3:37 PM, Michael Armbrust wrote: > > Note that runnning a simple map+reduce job on the same hdfs files with the >> same installation works fine: >> > > Did you call collect() on the totalLength? Otherwise nothing has > ac

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
> Note that runnning a simple map+reduce job on the same hdfs files with the > same installation works fine: > Did you call collect() on the totalLength? Otherwise nothing has actually executed.

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-16 Thread Matt Work Coarr
Thanks Marcelo, I'm not seeing anything in the logs that clearly explains what's causing this to break. One interesting point that we just discovered is that if we run the driver and the slave (worker) on the same host it runs, but if we run the driver on a separate host it does not run. Anyways,

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread aaronjosephs
The only other thing to keep in mind is that window duration and slide duration have to be multiples of batch duration, IDK if you made that fully clear -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Difference-among-batchDuration-windowDuration-slideDurati

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread Walrus theCat
Here's what I understand: batchDuration: How often should the streaming context update? how many seconds of data should each dstream contain? windowDuration: What size windows are you looking for from this dstream? slideDuration: Once I've given you that slice, how many units forward do you

Re: SPARK_WORKER_PORT (standalone cluster)

2014-07-16 Thread jay vyas
Now I see the answer to this. Spark slaves are start on random ports, and tell the master where they are. then the master acknowledges them. (worker logs) Starting Spark worker :43282 (master logs) Registering worker on :43282 with 8 cores, 16.5 GB RAM Thus, the port is random because t

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-16 Thread Michael Armbrust
Mostly true. The execution of two equivalent logical plans will be exactly the same, independent of the dialect. Resolution can be slightly different as SQLContext defaults to case sensitive and HiveContext defaults to case insensitive. One other very technical detail: The actual planning done by

Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Burak Yavuz
Hi Tom, Actually I was mistaken, sorry about that. Indeed on the website, the keys for the datasets you mention are not showing up. However, they are still accessible through the spark-shell, which means that they are there. So in order to answer your questions: - Are the tiny and 1node sets s

Re: Kyro deserialisation error

2014-07-16 Thread Tathagata Das
Is the class that is not found in the wikipediapagerank jar? TD On Wed, Jul 16, 2014 at 12:32 AM, Hao Wang wrote: > Thanks for your reply. The SparkContext is configured as below: > > > sparkConf.setAppName("WikipediaPageRank") > > > sparkConf.set("spark.serializer", > "org.apache.spark.

RE: executor-cores vs. num-executors

2014-07-16 Thread Wei Tan
Thanks for sharing your experience. I got the same experience -- multiple moderate JVMs beat a single huge JVM. Besides the minor JVM starting overhead, is it always better to have multiple JVMs rather than a single one? Best regards, Wei - Wei Tan, PhD Research

Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread hsy...@gmail.com
When I'm reading the API of spark streaming, I'm confused by the 3 different durations StreamingContext(conf: SparkConf , batchDuration: Duration

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-16 Thread Jerry Lam
Hi Michael, Thank you for the explanation. Can you validate the following statement is true/incomplete/false: "hql uses Hive to parse and to construct the logical plan whereas sql is pure spark implementation of parsing and logical plan construction. Once spark obtains the logical plan, it is exec

Re: MLLib - Regularized logistic regression in python

2014-07-16 Thread fjeg
1) Okay, to clarify, there is *no* way to regularize logistic regression in python (sorry if I'm repeating your answer). 2) This method you described will have overflow errors when abs(margin) > 750. Is there a built-in method to account for this? Otherwise, I will probably have to implement some

Re: Need help on spark Hbase

2014-07-16 Thread Jerry Lam
Hi Rajesh, I saw : Warning: Local jar /home/rajesh/hbase-0.96.1.1-hadoop2/lib/hbase -client-0.96.1.1-hadoop2.jar, does not exist, skipping. in your log. I believe this jar contains the HBaseConfiguration. I'm not sure what went wrong in your case but can you try without spaces in --jars i.e. --j

Re: spark and mesos issue

2014-07-16 Thread Dario Rexin
Hi *, I already looked into this issue and created a PR that hopefully fixes the problem. Unfortunately I have not been able to reproduce the bug, but could track down a possible cause for this. See the PR for an explanation: https://github.com/apache/spark/pull/1358 If anyone who these experi

Re: running Spark App on Yarn produces: Exception in thread "main" java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
For others, to solve topic problem: in yarn-site.xml add: yarn.application.classpath $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*

Re: running Spark App on Yarn produces: Exception in thread "main" java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
Sandy, perfect! you saved me tons of time! added this in yarn-site.xml job ran to completion Can you do me (us) a favor and push newest and patched spark/hadoop to cdh5 (tar's) if possible and thanks again for this (huge time saver) On Wed, Jul 16, 2014 at 1:10 PM, Sandy Ryza wrote: > Andrew,

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
hey at least it's something (thanks!) ... not sure what i'm going to do if i can't find a solution (other than not use spark) as i really need these capabilities. anyone got anything else? On Wed, Jul 16, 2014 at 10:34 AM, Luis Ángel Vicente Sánchez < langel.gro...@gmail.com> wrote: > hum... ma

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Or, if not, is there a way to do this in terms of a single dstream? Keep in mind that dstream1, dstream2, and dstream3 have already had transformations applied. I tried creating the dstreams by calling .window on the first one, but that ends up with me having ... 3 dstreams... which is the same p

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Luis Ángel Vicente Sánchez
hum... maybe consuming all streams at the same time with an actor that would act as a new DStream source... but this is just a random idea... I don't really know if that would be a good idea or even possible. 2014-07-16 18:30 GMT+01:00 Walrus theCat : > Yeah -- I tried the .union operation and i

ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Svend
Hi all, I just installed a mesos 0.19 cluster. I am failing to execute basic SparkQL operations on text files with Spark 1.0.1 with the spark-shell. I have one Mesos master without zookeeper and 4 mesos slaves. All nodes are running JDK 1.7.51 and Scala 2.10.4. The spark package is uploade

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Yeah -- I tried the .union operation and it didn't work for that reason. Surely there has to be a way to do this, as I imagine this is a commonly desired goal in streaming applications? On Wed, Jul 16, 2014 at 10:10 AM, Luis Ángel Vicente Sánchez < langel.gro...@gmail.com> wrote: > I'm joining s

Re: Number of executors change during job running

2014-07-16 Thread Bill Jay
Hi Tathagata, I have tried the repartition method. The reduce stage first had 2 executors and then it had around 85 executors. I specified repartition(300) and each of the executors were specified 2 cores when I submitted the job. This shows repartition works to increase more executors. However, t

Re: running Spark App on Yarn produces: Exception in thread "main" java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
thanks Sandzy, no CM-managed cluster, straight from cloudera tar ( http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.3.tar.gz) trying your suggestion immediate! thanks so much for taking time.. On Wed, Jul 16, 2014 at 1:10 PM, Sandy Ryza wrote: > Andrew, > > Are you running on a CM-ma

Re: running Spark App on Yarn produces: Exception in thread "main" java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Sean Owen
OK, if you're sure your binary has Hadoop 2 and/or your classpath has Hadoop 2, that's not it. I'd look at Sandy's suggestion then. On Wed, Jul 16, 2014 at 6:11 PM, Andrew Milkowski wrote: > thanks Sean! so what I did is in project/SparkBuild.scala I made it compile > with 2.3.0-cdh5.0.3 (and I

Re: running Spark App on Yarn produces: Exception in thread "main" java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
thanks Sean! so what I did is in project/SparkBuild.scala I made it compile with 2.3.0-cdh5.0.3 (and I even did sbt clean before sbt/sbt assembly, this should have build example client with 2.3.0 object SparkBuild extends Build { // Hadoop version to build against. For example, "1.0.4" for A

Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Luis Ángel Vicente Sánchez
I'm joining several kafka dstreams using the join operation but you have the limitation that the duration of the batch has to be same,i.e. 1 second window for all dstreams... so it would not work for you. 2014-07-16 18:08 GMT+01:00 Walrus theCat : > Hi, > > My application has multiple dstreams o

Re: running Spark App on Yarn produces: Exception in thread "main" java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Sandy Ryza
Andrew, Are you running on a CM-managed cluster? I just checked, and there is a bug here (fixed in 1.0), but it's avoided by having yarn.application.classpath defined in your yarn-site.xml. -Sandy On Wed, Jul 16, 2014 at 10:02 AM, Sean Owen wrote: > Somewhere in here, you are not actually ru

using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Hi, My application has multiple dstreams on the same inputstream: dstream1 // 1 second window dstream2 // 2 second window dstream3 // 5 minute window I want to write logic that deals with all three windows (e.g. when the 1 second window differs from the 2 second window by some delta ...) I've

Re: running Spark App on Yarn produces: Exception in thread "main" java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Sean Owen
Somewhere in here, you are not actually running vs Hadoop 2 binaries. Your cluster is certainly Hadoop 2, but your client is not using the Hadoop libs you think it is (or your compiled binary is linking against Hadoop 1, which is the default for Spark -- did you change it?) On Wed, Jul 16, 2014 at

Re: Gradient Boosting Decision Trees

2014-07-16 Thread Pedro Silva
Hi Ameet, that's great news! Thanks, Pedro On Wed, Jul 16, 2014 at 9:33 AM, Ameet Talwalkar wrote: > Hi Pedro, > > Yes, although they will probably not be included in the next release > (since the code freeze is ~2 weeks away), GBM (and other ensembles of > decision trees) are currently under

Re: Spark Streaming Json file groupby function

2014-07-16 Thread Yin Huai
Hi Srinivas, Seems the query you used is val results =sqlContext.sql("select type from table1"). However, table1 does not have a field called type. The schema of table1 is defined as the class definition of your case class Record (i.e. ID, name, score, and school are fields of your table1). Can yo

running Spark App on Yarn produces: Exception in thread "main" java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
Hello community, tried to run storm app on yarn, using cloudera hadoop and spark distro (from http://archive.cloudera.com/cdh5/cdh/5) hadoop version: hadoop-2.3.0-cdh5.0.3.tar.gz spark version: spark-0.9.0-cdh5.0.3.tar.gz DEFAULT_YARN_APPLICATION_CLASSPATH is part of hadoop-api-yarn jar ... tha

Re: Gradient Boosting Decision Trees

2014-07-16 Thread Ameet Talwalkar
Hi Pedro, Yes, although they will probably not be included in the next release (since the code freeze is ~2 weeks away), GBM (and other ensembles of decision trees) are currently under active development. We're hoping they'll make it into the subsequent release. -Ameet On Wed, Jul 16, 2014 at

Re: Terminal freeze during SVM

2014-07-16 Thread AlexanderRiggers
so I need to reconfigure my sparkcontext this way: val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") .set("spark.akka.frameSize","20") val sc = new SparkContext(conf) And start a new

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Chris DuBois
Hi Ben, It worked for me, but only when using the default region. Using --region=us-west-2 resulted in errors about security groups. Chris On Wed, Jul 16, 2014 at 8:53 AM, Ben Horner wrote: > please add > > From: "Ben Horner [via Apache Spark User List]" <[hidden email] >

Errors accessing hdfs while in local mode

2014-07-16 Thread Chris DuBois
Hi all, When I try setMaster("local"), I get FileNotFound exceptions; without using setMaster my application is able to properly find my datasets at hdfs://[masterip]/data/blah.csv. Is there some other setting that I need to change in order to try running in local mode? I am running from the ec2

Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Tom
Hi Burak, Thank you for your pointer, it is really helping out. I do have some consecutive questions though. After looking at the Big Data Benchmark page (Section "Run this benchmark yourself), I was expecting the following combination of files: Sets

Gradient Boosting Decision Trees

2014-07-16 Thread Pedro Silva
Hi there, I am looking for a GBM MLlib implementation. Does anyone know if there is a plan to roll it out soon? Thanks! Pedro

Re: Multiple streams at the same time

2014-07-16 Thread gorenuru
Hi and thank you for your reply. Looks like it's possible. It looks like a hack for me because we are specifying batch duration when creating context. This means that if we will specify batch duration to 10 seconds, our time windows should be at least 10 seconds long or we will not get results in

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Ben Horner
please add From: "Ben Horner [via Apache Spark User List]" mailto:ml-node+s1001560n9934...@n3.nabble.com>> Date: Wednesday, July 16, 2014 at 8:47 AM To: Ben Horner mailto:ben.hor...@atigeo.com>> Subject: Re: Trouble with spark-ec2 script: --ebs-vol-size Should I take it from the lack of replies

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Ben Horner
Should I take it from the lack of replies that the --ebs-vol-size feature doesn't work? -Ben -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-spark-ec2-script-ebs-vol-size-tp9619p9934.html Sent from the Apache Spark User List mailing list archiv

Re: Spark Streaming Json file groupby function

2014-07-16 Thread srinivas
Hi TD, I Defines the Case Class outside the main method and was able to compile the code successfully. But getting a run time error when trying to process some json file from kafka. here is the code i an to compile import java.util.Properties import kafka.producer._ import org.apache.spark.stre

Re: Error: No space left on device

2014-07-16 Thread Chris DuBois
Hi Xiangrui, I accidentally did not send df -i for the master node. Here it is at the moment of failure: FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 280938 243350 54% / tmpfs3845409 1 38454081% /dev/shm /dev/xvdb

Re: Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Hi Michael, Tried it. It's correctly printing the line counts of both the files. Here's what I tried - *Code:* *package test* *object Test4 {* * case class Test(fld1: String, * * fld2: String, * * fld3: String, * * fld4: String, * * fld5: String, * * fld6: Double, * * fld7: String);*

Re: Simple record matching using Spark SQL

2014-07-16 Thread Michael Armbrust
What if you just run something like: *sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv").count()* On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Yes Soumya, I did it. > > First I tried with the example available in the documentation

Re: Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Yes Soumya, I did it. First I tried with the example available in the documentation (example using people table and finding teenagers). After successfully running it, I moved on to this one which is starting point to a bigger requirement for which I'm evaluating Spark SQL. On Wed, Jul 16, 2014 a

Re: Ambiguous references to id : what does it mean ?

2014-07-16 Thread Michael Armbrust
Yes, but if both tagCollection and selectedVideos have a column named "id" then Spark SQL does not know which one you are referring to in the where clause. Here's an example with aliases: val x = testData2.as('x) val y = testData2.as('y) val join = x.join(y, Inner, Some("x.a".attr ===

Re: Read all the columns from a file in spark sql

2014-07-16 Thread Michael Armbrust
I think what you might be looking for is the ability to programmatically specify the schema, which is coming in 1.1. Here's the JIRA: SPARK-2179 On Wed, Jul 16, 2014 at 8:24 AM, pandees waran wrote: > Hi, > > I am newbie to spark sql and i wou

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
Can you try submitting a very simple job to the cluster. > On Jul 16, 2014, at 10:25 AM, Sarath Chandra > wrote: > > Yes it is appearing on the Spark UI, and remains there with state as > "RUNNING" till I press Ctrl+C in the terminal to kill the execution. > > Barring the statements to cre

Re: Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Yes it is appearing on the Spark UI, and remains there with state as "RUNNING" till I press Ctrl+C in the terminal to kill the execution. Barring the statements to create the spark context, if I copy paste the lines of my code in spark shell, runs perfectly giving the desired output. ~Sarath On

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
When you submit your job, it should appear on the Spark UI. Same with the REPL. Make sure you job is submitted to the cluster properly. On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Hi Soumya, > > Data is very small, 500+ lines in each file.

Re: Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Hi Soumya, Data is very small, 500+ lines in each file. Removed last 2 lines and placed this at the end "matched.collect().foreach(println);". Still no luck. It's been more than 5min, the execution is still running. Checked logs, nothing in stdout. In stderr I don't see anything going wrong, all

Re: Re: how to construct a ClassTag object as a method parameter in Java

2014-07-16 Thread balvisio
Hi, I think same issue is happening with the constructor of the PartitionPruningRDD class. It hasn't been fixed in version 1.0.1 Should this be reported to JIRA? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-construct-a-ClassTag-object-as-a-method

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
Check your executor logs for the output or if your data is not big collect it in the driver and print it. > On Jul 16, 2014, at 9:21 AM, Sarath Chandra > wrote: > > Hi All, > > I'm trying to do a simple record matching between 2 files and wrote following > code - > > import org.apache.sp

Simple record matching using Spark SQL

2014-07-16 Thread Sarath Chandra
Hi All, I'm trying to do a simple record matching between 2 files and wrote following code - *import org.apache.spark.sql.SQLContext;* *import org.apache.spark.rdd.RDD* *object SqlTest {* * case class Test(fld1:String, fld2:String, fld3:String, fld4:String, fld4:String, fld5:Double, fld6:String)

Problem running Spark shell (1.0.0) on EMR

2014-07-16 Thread Ian Wilkinson
Hi, I’m trying to run the Spark (1.0.0) shell on EMR and encountering a classpath issue. I suspect I’m missing something gloriously obviously, but so far it is eluding me. I launch the EMR Cluster (using the aws cli) with: aws emr create-cluster --name "Test Cluster" \ --ami-version 3

Re: count vs countByValue in for/yield

2014-07-16 Thread Ognen Duzlevski
Hello all, Can anyone offer any insight on the below? Both are "legal" Spark but the first one works, the latter one does not. They both work on a local machine but in a standalone cluster the one with countByValue fails. Thanks! Ognen On 7/15/14, 2:23 PM, Ognen Duzlevski wrote: Hello, I

Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-16 Thread Rohit Pujari
Thanks Matei. On Tue, Jul 15, 2014 at 11:47 PM, Matei Zaharia wrote: > Yup, as mentioned in the FAQ, we are aware of multiple deployments running > jobs on over 1000 nodes. Some of our proof of concepts involved people > running a 2000-node job on EC2. > > I wouldn't confuse buzz with FUD :). >

Read all the columns from a file in spark sql

2014-07-16 Thread pandees waran
Hi, I am newbie to spark sql and i would like to know about how to read all the columns from a file in spark sql. I have referred the programming guide here: http://people.apache.org/~tdas/spark-1.0-docs/sql-programming-guide.html The example says: val people = sc.textFile("examples/src/main/re

Re: Server IPC version 7 cannot communicate with client version 4 with Spark Streaming 1.0.0 in Java and CH4 quickstart in local mode

2014-07-16 Thread Sean Owen
"Server IPC version 7 cannot communicate with client version 4" means your client is Hadoop 1.x and your cluster is Hadoop 2.x. The default Spark distribution is built for Hadoop 1.x. You would have to make your own build (or, use the artifacts distributed for CDH4.6 maybe? they are certainly built

Re: Reading file header in Spark

2014-07-16 Thread Silvina Caíno Lores
Thank you! This is what I needed, I've read it should work as the first() method as well. It's a pity that the taken element cannot be removed from the RDD though. Thanks again! On 16 July 2014 12:09, Sean Owen wrote: > You can rdd.take(1) to get just the header line. > > I think someone menti

  1   2   >