Unable to run spark-shell after build

2015-02-03 Thread Jaonary Rabarisoa
Hi all, I'm trying to run the master version of spark in order to test some alpha components in ml package. I follow the build spark documentation and build it with : $ mvn clean package The build is successful but when I try to run spark-shell I got the following errror : *Exception in thr

connecting spark with ActiveMQ

2015-02-03 Thread Mohit Durgapal
Hi All, I have a requirement where I need to consume messages from ActiveMQ and do live stream processing as well as batch processing using Spark. Is there a spark-plugin or library that can enable this? If not, then do you know any other way this could be done? Regards Mohit

Pig loader in Spark

2015-02-03 Thread Jianshi Huang
Hi, Anyone has implemented the default Pig Loader in Spark? (loading delimited text files with .pig_schema) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Setting maxPrintString in Spark Repl to view SQL query plans

2015-02-03 Thread Mick Davies
Hi, I want to increase the maxPrintString the Spark repl to look at SQL query plans, as they are truncated by default at 800 chars, but don't know how to set this. You don't seem to be able to do it in the same way as you would with with Scala repl. Anyone know how to set this? Also anyone kno

LeaseExpiredException while writing schemardd to hdfs

2015-02-03 Thread Hafiz Mujadid
I want to write whole schemardd to single in hdfs but facing following exception rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /test/data/data1.csv (inode 402042): File does not exist. Holder DFSClient_NONMAPREDUCE_-564238432_57 doe

Re: Spark streaming - tracking/deleting processed files

2015-02-03 Thread Prannoy
Hi, To keep processing the older file also you can use fileStream instead of textFileStream. It has a parameter to specify to look for already present files. For deleting the processed files one way is to get the list of all files in the dStream. This can be done by using the foreachRDD api of th

Re: 2GB limit for partitions?

2015-02-03 Thread Michael Albert
Greetings! Thanks for the response. Below is an example of the exception I saw.I'd rather not post code at the moment, so I realize it is completely unreasonable to ask for a diagnosis.However, I will say that adding a "partitionBy()" was the last change before this error was created. Thanks fo

Re: Spark Shell Timeouts

2015-02-03 Thread amoners
I am not sure that this way can help you. There is my situation that I can not see any input in terminal after some work gets done via spark-shell, I used to run a command "stty echo" , and It fixed. Best, Amoners -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

Re: How to define a file filter for file name patterns in Apache Spark Streaming in Java?

2015-02-03 Thread Emre Sevinc
Hello Akhil, Thank you for taking your time for a detailed answer. I managed to solve it in a very similar manner. Kind regards, Emre Sevinç On Mon, Feb 2, 2015 at 8:22 PM, Akhil Das wrote: > Hi Emre, > > This is how you do that in scala: > > val lines = ssc.fileStream[LongWritable, Text, > T

Re: Unable to run spark-shell after build

2015-02-03 Thread Sean Owen
Yes, I see this too. I think the Jetty shading still needs a tweak. It's not finding the servlet API classes. Let's converge on SPARK-5557 to discuss. On Tue, Feb 3, 2015 at 2:04 AM, Jaonary Rabarisoa wrote: > Hi all, > > I'm trying to run the master version of spark in order to test some alpha >

Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread Adamantios Corais
Hi, I am using Spark 0.9.1 and I am looking for a proper viz tools that supports that specific version. As far as I have seen all relevant tools (e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no mentions about older versions of Spark. Any ideas or suggestions? *// Adamantio

Writing RDD to a csv file

2015-02-03 Thread kundan kumar
I have a RDD which is of type org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))] I want to write it as a csv file. Please suggest how this can be done. myrdd.map(line => (line._1 + "," + line._2._1.mkString(",") + "," + line._2._2.mkString(','))).saveAsTextFile("hdfs://.

Re: Writing RDD to a csv file

2015-02-03 Thread Gerard Maas
this is more of a scala question, so probably next time you'd like to address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala val optArrStr:Option[Array[String]] = ??? optArrStr.map(arr => arr.mkString(",")).getOrElse("") // empty string or whatever default value you have for th

Re: Writing RDD to a csv file

2015-02-03 Thread kundan kumar
Thanks Gerard !! This is working. On Tue, Feb 3, 2015 at 6:44 PM, Gerard Maas wrote: > this is more of a scala question, so probably next time you'd like to > address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala > > val optArrStr:Option[Array[String]] = ??? > optArrStr.map(

Re: Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread andy petrella
Hello Adamantios, Thanks for the poke and the interest. Actually, you're the second asking about backporting it. Yesterday (late), I created a branch for it... and the simple local spark test worked! \o/. However, it'll be the 'old' UI :-/. Since I didn't ported the code using play 2.2.6 to the ne

Re: Spark Shell Timeouts

2015-02-03 Thread Michael Albert
You might also try "stty sane". From: amoners I am not sure that this way can help you. There is my situation that I can not see any input in terminal after some work gets done via spark-shell, I used to run a command  "stty echo"  , and It fixed.

Spark Master Build Failing to run on cluster in standalone ClassNotFoundException: javax.servlet.FilterRegistration

2015-02-03 Thread Night Wolf
Hi, I just built Spark 1.3 master using maven via make-distribution.sh; ./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive -Phive-thriftserver -Phive-0.12.0 When trying to start the standalone spark master on a cluster I get the following stack trace; 15/02/04 08:53:56 I

Re: Spark Master Build Failing to run on cluster in standalone ClassNotFoundException: javax.servlet.FilterRegistration

2015-02-03 Thread Sean Owen
Already come up several times today: https://issues.apache.org/jira/browse/SPARK-5557 On Tue, Feb 3, 2015 at 8:04 AM, Night Wolf wrote: > Hi, > > I just built Spark 1.3 master using maven via make-distribution.sh; > > ./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive > -Ph

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-03 Thread Jay Hutfles
I think this is a separate issue with how the EdgeRDDImpl partitions edges. If you can merge this change in and rebuild, it should work: https://github.com/apache/spark/pull/4136/files If you can't, I just called the Graph.partitonBy() method right after construction my graph but before perfo

Is LogisticRegressionWithSGD in MLlib scalable?

2015-02-03 Thread Peng Zhang
Hi Everyone, Is LogisticRegressionWithSGD in MLlib scalable? If so, what is the idea behind the scalable implementation? Thanks in advance, Peng - Peng Zhang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-LogisticRegressionWithSGD-in-MLlib-sca

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-03 Thread Andrew Lee
Hi All, In Spark 1.2.0-rc1, I have tried to set the hive.metastore.warehouse.dir to share with the Hive warehouse location on HDFS, however, it does NOT work on yarn-cluster mode. On the Namenode audit log, I see that spark is trying to access the default hive warehouse location which is /user/

Re: how to send JavaDStream RDD using foreachRDD using Java

2015-02-03 Thread sachin Singh
Hi all, issue has bee resolved, when I used rdd.foreachRDD(new Function, Void>() { @Override public Void call(JavaRDD rdd) throws Exception { if(rdd!=null) { List result = rdd.col

RE: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Wang, Ningjun (LNG-NPV)
Hi Gen Thanks for your feedback. We do have a business reason to run spark on windows. We have an existing application that is built on C# .NET running on windows. We are considering adding spark to the application for parallel processing of large data. We want spark to run on windows so it int

ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS somehow. I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk. Each HDFS node reports 73 GB, and the total capacity is ~370 GB. If I want to process 800 GB of data (assuming

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
You could also just push the data to Amazon S3, which would un-link the size of the cluster needed to process the data from the size of the data. DR On 02/03/2015 11:43 AM, Joe Wass wrote: I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS so

Spark (SQL) as OLAP engine

2015-02-03 Thread Adamantios Corais
Hi, After some research I have decided that Spark (SQL) would be ideal for building an OLAP engine. My goal is to push aggregated data (to Cassandra or other low-latency data storage) and then be able to project the results on a web page (web service). New data will be added (aggregated) once a da

Re: Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-02-03 Thread maven
The version I'm using was already pre-built for Hadoop 2.3. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382p21485.html Sent from the Apache Spark User List mailing list archive at Nabble

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-03 Thread Manoj Samel
Hi, Any thoughts ? Thanks, On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel wrote: > Spark 1.2 > > SchemaRDD has schema with decimal columns created like > > x1 = new StructField("a", DecimalType(14,4), true) > > x2 = new StructField("b", DecimalType(14,4), true) > > Registering as SQL Temp table

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
The data is coming from S3 in the first place, and the results will be uploaded back there. But even in the same availability zone, fetching 170 GB (that's gzipped) is slow. From what I understand of the pipelines, multiple transforms on the same RDD might involve re-reading the input, which very q

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
We use S3 as a main storage for all our input data and our generated (output) data. (10's of terabytes of data daily.) We read gzipped data directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long as you parallelize the work well by distributing the processing across enough

Kyro serialization and OOM

2015-02-03 Thread Joe Wass
I have about 500 MB of data and I'm trying to process it on a single `local` instance. I'm getting an Out of Memory exception. Stack trace at the end. Spark 1.1.1 My JVM has --Xmx2g spark.driver.memory = 1000M spark.executor.memory = 1000M spark.kryoserializer.buffer.mb = 256 spark.kryoserializer

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Sean McNamara
We have gone down a similar path at Webtrends, Spark has worked amazingly well for us in this use case. Our solution goes from REST, directly into spark, and back out to the UI instantly. Here is the resulting product in case you are curious (and please pardon the self promotion): https://www

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Jonathan Haddad
Write out the rdd to a cassandra table. The datastax driver provides saveToCassandra() for this purpose. On Tue Feb 03 2015 at 8:59:15 AM Adamantios Corais < adamantios.cor...@gmail.com> wrote: > Hi, > > After some research I have decided that Spark (SQL) would be ideal for > building an OLAP en

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
Thanks very much, that's good to know, I'll certainly give it a look. Can you give me a hint about you unzip your input files on the fly? I thought that it wasn't possible to parallelize zipped inputs unless they were unzipped before passing to Spark? Joe On 3 February 2015 at 17:48, David Rosen

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Ted Yu
Using s3a protocol (introduced in hadoop 2.6.0) would be faster compared to s3. The upcoming hadoop 2.7.0 contains some bug fixes for s3a. FYI On Tue, Feb 3, 2015 at 9:48 AM, David Rosenstrauch wrote: > We use S3 as a main storage for all our input data and our generated > (output) data. (10'

GraphX pregel: getting the current iteration number

2015-02-03 Thread Matthew Cornell
Hi Folks, I'm new to GraphX and Scala and my sendMsg function needs to index into an input list to my algorithm based on the pregel()() iteration number, but I don't see a way to access that. I see in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Preg

Re: GraphX pregel: getting the current iteration number

2015-02-03 Thread Daniil Osipov
I don't think its possible to access. What I've done before is send the current or next iteration index with the message, where the message is a case class. HTH Dan On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell wrote: > Hi Folks, > > I'm new to GraphX and Scala and my sendMsg function needs

Re: Setting maxPrintString in Spark Repl to view SQL query plans

2015-02-03 Thread Michael Armbrust
You should be able to do something like: sbt -Dscala.repl.maxprintstring=64000 hive/console Here's an overview of catalyst: https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit#heading=h.vp2tej73rtm2 On Tue, Feb 3, 2015 at 1:37 AM, Mick Davies wr

Re: Setting maxPrintString in Spark Repl to view SQL query plans

2015-02-03 Thread Michael Armbrust
I'll add i usually just do println(query.queryExecution) On Tue, Feb 3, 2015 at 11:34 AM, Michael Armbrust wrote: > You should be able to do something like: > > sbt -Dscala.repl.maxprintstring=64000 hive/console > > Here's an overview of catalyst: > https://docs.google.com/a/databricks.com/docu

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
Not all of our input files are zipped. The ones that are obviously are not parallelized - they're just processed by a single task. Not a big issue for us, though, as the those zipped files aren't too big. DR On 02/03/2015 01:08 PM, Joe Wass wrote: Thanks very much, that's good to know, I'll

Re: Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread andy petrella
Adamantios, As said, I backported it to 0.9.x and now it's pushed on this branch: https://github.com/andypetrella/spark-notebook/tree/spark-0.9.x. I didn't created dist atm, because I'd prefer to do it only if necessary :-). So, if you want to try it out, just clone the repo, checked out in this

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Sven Krasser
Hey Joe, With the ephemeral HDFS, you get the instance store of your worker nodes. For m3.xlarge that will be two 40 GB SSDs local to each instance, which are very fast. For the persistent HDFS, you get whatever EBS volumes the launch script configured. EBS volumes are always network drives, so t

Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
I am trying to implement secondary sort in spark as we do in map-reduce. Here is my data(tab separated, without c1, c2, c2). c1c2 c3 1 2 4 1 3 6 2 4 7 2 6 8 3 5 5 3 1 8 3 2 0 To do secondary sort, I crea

Re: Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
Just to add, I am suing Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

RE: Sort based shuffle not working properly?

2015-02-03 Thread Mohammed Guller
Nitin, Suing Spark is not going to help. Perhaps you should sue someone else :-) Just kidding! Mohammed -Original Message- From: nitinkak001 [mailto:nitinkak...@gmail.com] Sent: Tuesday, February 3, 2015 1:57 PM To: user@spark.apache.org Subject: Re: Sort based shuffle not working prop

Re: Sort based shuffle not working properly?

2015-02-03 Thread Sean Owen
Hm, I don't think the sort partitioner is going to cause the result to be ordered by c1,c2 if you only partitioned on c1. I mean, it's not even guaranteed that the type of c2 has an ordering, right? On Tue, Feb 3, 2015 at 3:38 PM, nitinkak001 wrote: > I am trying to implement secondary sort in sp

Re: Sort based shuffle not working properly?

2015-02-03 Thread Nitin kak
I thought thats what sort based shuffled did, sort the keys going to the same partition. I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that ordering of c2 type is the problem here. On Tue, Feb 3, 2015 at 5:21 PM, Sean Owen wrote: > Hm, I don't think the sort partitioner is go

Re: Sort based shuffle not working properly?

2015-02-03 Thread Nitin kak
This is an exerpt from the Design document of the implementation of Sort based shuffle.. I am thinking I might be wrong in my perception of sort based shuffle. Dont completely understand it though. *Motivation* A sort­based shuffle can be more scalable than Spark’s current hash­based one because

Re: 2GB limit for partitions?

2015-02-03 Thread Imran Rashid
Michael, you are right, there is definitely some limit at 2GB. Here is a trivial example to demonstrate it: import org.apache.spark.storage.StorageLevel val d = sc.parallelize(1 to 1e6.toInt, 1).map{i => new Array[Byte](5e3.toInt)}.persist(StorageLevel.DISK_ONLY) d.count() It gives the same err

Re: 2GB limit for partitions?

2015-02-03 Thread Aaron Davidson
To be clear, there is no distinction between partitions and blocks for RDD caching (each RDD partition corresponds to 1 cache block). The distinction is important for shuffling, where by definition N partitions are shuffled into M partitions, creating N*M intermediate blocks. Each of these blocks m

Re: 2GB limit for partitions?

2015-02-03 Thread Reynold Xin
cc dev list How are you saving the data? There are two relevant 2GB limits: 1. Caching 2. Shuffle For caching, a partition is turned into a single block. For shuffle, each map partition is partitioned into R blocks, where R = number of reduce tasks. It is unlikely a shuffle block > 2G, altho

Re: 2GB limit for partitions?

2015-02-03 Thread Michael Albert
Thank you! This is very helpful. -Mike From: Aaron Davidson To: Imran Rashid Cc: Michael Albert ; Sean Owen ; "user@spark.apache.org" Sent: Tuesday, February 3, 2015 6:13 PM Subject: Re: 2GB limit for partitions? To be clear, there is no distinction between partitions and blocks

Re: 2GB limit for partitions?

2015-02-03 Thread Imran Rashid
Thanks for the explanations, makes sense. For the record looks like this was worked on a while back (and maybe the work is even close to a solution?) https://issues.apache.org/jira/browse/SPARK-1476 and perhaps an independent solution was worked on here? https://issues.apache.org/jira/browse/SP

Re: Writing RDD to a csv file

2015-02-03 Thread Charles Feduke
In case anyone needs to merge all of their part-n files (small result set only) into a single *.csv file or needs to generically flatten case classes, tuples, etc., into comma separated values: http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/ On Tue Feb 03 2015 at 8:23:59 AM k

advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-03 Thread Michael Albert
Greetings! First, my sincere thanks to all who have given me advice.Following previous discussion, I've rearranged my code to try to keep the partitions to more manageable sizes.Thanks to all who commented. At the moment, the input set I'm trying to work with is about 90GB (avro parquet format).

Re: Is LogisticRegressionWithSGD in MLlib scalable?

2015-02-03 Thread Joseph Bradley
Hi Peng, Short answer: Yes. It has been run on billions of rows and tens of millions of columns. Long answer: There are many ways to implement LR in a distributed fashion, and their dependence on the dataset dimensions and compute cluster size varies. The implementation distributes the gradient

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Denny Lee
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted is at: OLAP with Cassandra and Spark http://www.slideshare.net/EvanChan2/2014-07olapcassspark. On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad wrote: > Write out the rdd to a cassandra table. The datastax driver provid

Re: 2GB limit for partitions?

2015-02-03 Thread Mridul Muralidharan
That is fairly out of date (we used to run some of our jobs on it ... But that is forked off 1.1 actually). Regards Mridul On Tuesday, February 3, 2015, Imran Rashid wrote: > Thanks for the explanations, makes sense. For the record looks like this > was worked on a while back (and maybe the wo

Re: connector for CouchDB

2015-02-03 Thread hnahak
Spark Doesn't support it, but this connector is open source, you can get it from github. The difference between these two DB is depending on what type of solution you are looking for. Please refer this link : http://blog.nahurst.com/visual-guide-to-nosql-systems FYI, from the list of NOSQL in

StackOverflowError on RDD.union

2015-02-03 Thread Thomas Kwan
I am trying to combine multiple RDDs into 1 RDD, and I am using the union function. I wonder if anyone has seen StackOverflowError as follows: Exception in thread "main" java.lang.StackOverflowError at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.Union

Re: StackOverflowError on RDD.union

2015-02-03 Thread Mark Hamstra
Use SparkContext#union[T](rdds: Seq[RDD[T]]) On Tue, Feb 3, 2015 at 7:43 PM, Thomas Kwan wrote: > I am trying to combine multiple RDDs into 1 RDD, and I am using the union > function. I wonder if anyone has seen StackOverflowError as follows: > > Exception in thread "main" java.lang.StackOverflo

“mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread Corey Nolet
I'm having a really bad dependency conflict right now with Guava versions between my Spark application in Yarn and (I believe) Hadoop's version. The problem is, my driver has the version of Guava which my application is expecting (15.0) while it appears the Spark executors that are working on my R

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread M. Dale
Try spark.yarn.user.classpath.first (see https://issues.apache.org/jira/browse/SPARK-2996 - only works for YARN). Also thread at http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html. HTH, Markus On 02/03/2015 11:20 PM, Corey Nolet wrote:

HiveContext in SparkSQL - concurrency issues

2015-02-03 Thread matha.harika
Hi, I've been trying to use HiveContext(instead of SQLContext) in my SparkSQL application and when I run the application simultaneously, it only works on the first call and every other call throws the following error- ERROR Datastore.Schema: Failed initialising database. Failed to start database

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread McNerlin, Andrew (Agoda)
Hi Sean, I'm interested in trying something similar. How was your performance when you had many concurrent queries running against spark? I know this will work well where you have a low volume of queries against a large dataset, but am concerned about having a high volume of queries against t

Multiple running SparkContexts detected in the same JVM!

2015-02-03 Thread gavin zhang
I have a cluster which running CDH5.1.0 with Spark component. Because the default version of Spark from CDH5.1.0 is 1.0.0 while I want to use some features of Spark 1.2.0, I compiled another Spark with Maven. But when I run into Spark-shell and created a new SparkContext, I met the below error: 15

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread bo yang
Corey, Which version of Spark do you use? I am using Spark 1.2.0, and guava 15.0. It seems fine. Best, Bo On Tue, Feb 3, 2015 at 8:56 PM, M. Dale wrote: > Try spark.yarn.user.classpath.first (see > https://issues.apache.org/jira/browse/SPARK-2996 - only works for YARN). > Also thread at > h

Re: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Denny Lee
Hi Ningjun, I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely for development purposes). I had most recently installed them utilizing Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy thread concerning the null\bin\winutils issue is addressed in

Exception in thread "main" java.lang.SecurityException: class "javax.servlet.ServletRegistration"'

2015-02-03 Thread DEVAN M.S.
HI all, I need a help. When i am trying to run spark project it is showing that, "Exception in thread "main" java.lang.SecurityException: class "javax.servlet.ServletRegistration"'s signer information does not match signer information of other classes in the same package". *After deleting "/home/d

Spark SQL taking long time to print records from a table

2015-02-03 Thread jguliani
I have 3 text files in hdfs which I am reading using spark sql and registering them as table. After that I am doing almost 5-6 operations - including joins , group by etc.. And this whole process is taking hardly 6-7 secs. ( Source File size - 3 GB with almost 20 million rows ). As a final step of