Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-09 Thread Sasi
Boris, Yes, as you mentioned, we are creating a new SparkContext for our Job. The reason being, to define Apache Cassandra connection using SparkConf. We hope, this also should work. For uploading JAR, we followed (1) Package JAR using *sbt package* command (2) Use *curl --data-binary @target/s

RE: skipping header from each file

2015-01-09 Thread Somnath Pandeya
May be you can use wholeTextFiles method, which returns filename and content of the file as PariRDD and ,then you can remove the first line from files. -Original Message- From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] Sent: Friday, January 09, 2015 11:48 AM To: user@spark.apache.

RE: Spark History Server can't read event logs

2015-01-09 Thread michael.england
Hi Marcelo, On MapR, the mapr user can read the files using the NFS mount, however using the normal hadoop fs -cat /... command, I get permission denied. As the history server is pointing to a location on mapfs, not the NFS mount, I'd imagine the Spark history server is trying to read the files

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread Xuelin Cao
Thanks, but, how to increase the tasks per core? For example, if the application claims 10 cores, is it possible to launch 100 tasks concurrently? On Fri, Jan 9, 2015 at 2:57 PM, Jörn Franke wrote: > Hallo, > > Based on experiences with other software in virtualized environments I > cannot re

Streaming Checkpointing

2015-01-09 Thread Asim Jalis
In Spark Streaming apps if I enable ssc.checkpoint(dir) does this checkpoint all RDDs? Or is it just checkpointing windowing and state RDDs? For example, if in a DStream I am using an iterative algorithm on a non-state non-window RDD, do I have to checkpoint it explicitly myself, or can I assume t

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-09 Thread Mukesh Jha
I am using pre built *spark-1.2.0-bin-hadoop2.4* from *[1] *to submit spark applications to yarn, I cannot find the pre built spark for *CDH-5.x* versions. So, In my case the org.apache.hadoop.yarn.util.ConverterUtils class is coming from the spark-assembly-1.1.0-hadoop2.4.0.jar which is part of th

Re: skipping header from each file

2015-01-09 Thread Sean Owen
I think this was already answered on stackoverflow: http://stackoverflow.com/questions/27854919/skipping-header-file-from-each-csv-file-in-spark where the one additional idea would be: If there were just one header line, in the first record, then the most efficient way to filter it out is: rdd.m

EventBatch and SparkFlumeProtocol not found in spark codebase?

2015-01-09 Thread bit1...@163.com
Hi, When I fetch the Spark code base and import into Intellj Idea as SBT project, then I build it with SBT, but there is compiling errors in the examples module,complaining that the EventBatch and SparkFlumeProtocol,looks they should be in org.apache.spark.streaming.flume.sink package. Not su

Re: EventBatch and SparkFlumeProtocol not found in spark codebase?

2015-01-09 Thread Sean Owen
What's up with the IJ questions all of the sudden? This PR from yesterday contains a summary of the answer to your question: https://github.com/apache/spark/pull/3952 : "Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated

Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Jun Yang
Guys, I have a question regarding to Spark 1.1 broadcast implementation. In our pipeline, we have a large multi-class LR model, which is about 1GiB size. To employ the benefit of Spark parallelism, a natural thinking is to broadcast this model file to the worker node. However, it looks that bro

Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Akhil Das
​You can try the following: - Increase ​spark.akka.frameSize (default is 10MB) - Try using torrentBroadcast Thanks Best Regards On Fri, Jan 9, 2015 at 3:41 PM, Jun Yang wrote: > Guys, > > I have a question regarding to Spark 1.1 broadcast implementation. > > In our pipeline, we have a large mu

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-09 Thread Shixiong Zhu
The official distribution has the same issue. I opened a ticket: https://issues.apache.org/jira/browse/SPARK-5172 Best Regards, Shixiong Zhu 2015-01-08 15:51 GMT+08:00 Shixiong Zhu : > I have not used CDH5.3.0. But looks > spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar contains some > ha

KryoDeserialization getting java.io.EOFException

2015-01-09 Thread yuemeng1
hi,when i run a query in spark sql ,there give me follow error,what's processible reason can casuse this problem ava.io.EOFException at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:148) at org.apache.spark.sql.hbase.HBasePartitioner$$an

PermGen issues on AWS

2015-01-09 Thread Joe Wass
I'm running on an AWS cluster of 10 x m1.large (64 bit, 7.5 GiB RAM). FWIW I'm using the Flambo Clojure wrapper which uses the Java API but I don't think that should make any difference. I'm running with the following command: spark/bin/spark-submit --class mything.core --name "My Thing" --conf sp

Parquet compression codecs not applied

2015-01-09 Thread Ayoub
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf("spark.sql.parquet.compression.codec", "gzip") the size of the generated files is the always the same, so it seems l

Queue independent jobs

2015-01-09 Thread Anders Arpteg
Hey, Lets say we have multiple independent jobs that each transform some data and store in distinct hdfs locations, is there a nice way to run them in parallel? See the following pseudo code snippet: dateList.map(date => sc.hdfsFile(date).map(transform).saveAsHadoopFile(date)) It's unfortunate i

OptionalDataException during Naive Bayes Training

2015-01-09 Thread jatinpreet
Hi, I am using Spark Version 1.1 in standalone mode in the cluster. Sometimes, during Naive Baye's training, I get OptionalDataException at line, map at NaiveBayes.scala:109 I am getting following exception on the console, java.io.OptionalDataException: java.io.ObjectInputStream.readO

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-09 Thread Sasi
We are able to resolve *SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up* as well. Spark-jobserver working fine now and need to experiment more. Thank you guys. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-EXT

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread gen tang
Hi, As you said, the --executor-cores will define the max number of tasks that an executor can take simultaneously. So, if you claim 10 cores, it is not possible to launch more than 10 tasks in an executor at the same time. According to my experience, set cores more than physical CPU core will cau

Zipping RDDs of equal size not possible

2015-01-09 Thread Niklas Wilcke
Hi Spark community, I have a problem with zipping two RDDs of the same size and same number of partitions. The error message says that zipping is only allowed on RDDs which are partitioned into chunks of exactly the same sizes. How can I assure this? My workaround at the moment is to repartition b

Re: Queue independent jobs

2015-01-09 Thread Sean Owen
You can parallelize on the driver side. The way to do it is almost exactly what you have here, where you're iterating over a local Scala collection of dates and invoking a Spark operation for each. Simply write "dateList.par.map(...)" to make the local map proceed in parallel. It should invoke the

Re: Queue independent jobs

2015-01-09 Thread Anders Arpteg
Awesome, it actually seems to work. Amazing how simple it can be sometimes... Thanks Sean! On Fri, Jan 9, 2015 at 12:42 PM, Sean Owen wrote: > You can parallelize on the driver side. The way to do it is almost > exactly what you have here, where you're iterating over a local Scala > collection

Re: PermGen issues on AWS

2015-01-09 Thread Sean Owen
It's normal for PermGen to be a bit more of an issue with Spark than for other JVM-based applications. You should simply increase the PermGen size, which I don't see in your command. -XX:MaxPermSize=256m allows it to grow to 256m for example. The right size depends on your total heap size and app.

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Rok Roskar
thanks for the suggestion -- however, looks like this is even slower. With the small data set I'm using, my aggregate function takes ~ 9 seconds and the colStats.mean() takes ~ 1 minute. However, I can't get it to run with the Kyro serializer -- I get the error: com.esotericsoftware.kryo.KryoExcep

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-09 Thread Sean Owen
Again this is probably not the place for CDH-specific questions, and this one is already answered at http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-3-0-container-cannot-be-fetched-because-of/m-p/23497#M478 On Fri, Jan 9, 2015 at 9:23 AM, Mukesh Jha wrote: > I am using pre

Re: Parallel execution on one node

2015-01-09 Thread Sean Owen
The repartitioning has some overhead. Do your results show that perhaps this is what is taking the extra time? You can try reading the source file with more partitions instead of partitioning it after the fact. See the minPartitions argument to loadLibSVMFile. But before that, I'd assess whether t

Re: PermGen issues on AWS

2015-01-09 Thread Joe Wass
Thanks, I noticed this after posting. I'll try that. I also think that perhaps Clojure might be creating more classes than the equivalent Java would, so I'll nudge it a bit higher. On 9 January 2015 at 11:45, Sean Owen wrote: > It's normal for PermGen to be a bit more of an issue with Spark than

Accidental kill in UI

2015-01-09 Thread Joe Wass
So I had a Spark job with various failures, and I decided to kill it and start again. I clicked the 'kill' link in the web console, restarted the job on the command line and headed back to the web console and refreshed to see how my job was doing... the URL at the time was: /stages/stage/kill?id=1

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Priya Ch
Please find the attached worker log. I could see stream closed exception On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng wrote: > Could you attach the executor log? That may help identify the root > cause. -Xiangrui > > On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch > wrote: > > Hi All, > > > > Word

Re: Accidental kill in UI

2015-01-09 Thread Sean Owen
(FWIW yes I think this should certainly be a POST. The link can become a miniature form to achieve this and then the endpoint just needs to accept POST only. You should propose a pull request.) On Fri, Jan 9, 2015 at 12:51 PM, Joe Wass wrote: > So I had a Spark job with various failures, and I de

Play Scala Spark Exmaple

2015-01-09 Thread Eduardo Cusa
Hi guys, I running the following example : https://github.com/knoldus/Play-Spark-Scala in the same machine as the spark master, and the spark cluster was lauched with ec2 script. I'm stuck with this errors, any idea how to fix it? Regards Eduardo call the play app prints the following exceptio

Newbie Question on How Tasks are Executed

2015-01-09 Thread mixtou
I am writing my first project in spark. It is the implementation of the "Space Saving" counting algorithm. I am trying to understand how tasks are executed in partitions. As you can see from my code the algorithms keeps in memory only a small amount of words for example 100. The top-k ones not all

Re: Mapping directory structure to columns in SparkSQL

2015-01-09 Thread Michael Davies
Hi Michael, I have got the directory based column support working at least in a trial. I have put the trial code here - DirIndexParquet.scala it has involved me

Re: Data locality running Spark on Mesos

2015-01-09 Thread Michael V Le
Hi Tim, Thanks for your response. The benchmark I used just reads data in from HDFS and builds the Linear Regression model using methods from the MLlib. Unfortunately, for various reasons, I can't open the source code for the benchmark at this time. I will try to replicate the problem using some

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Cheng Lian
Hey Nathan, Thanks for sharing, this is a very interesting post :) My comments are inlined below. Cheng On 1/7/15 11:53 AM, Nathan McCarthy wrote: Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple exa

Cleaning up spark.local.dir automatically

2015-01-09 Thread michael.england
Hi, Is there a way of automatically cleaning up the spark.local.dir after a job has been run? I have noticed a large number of temporary files have been stored here and are not cleaned up. The only solution I can think of is to run some sort of cron job to delete files older than a few days. I

How to access OpenHashSet in my standalone program?

2015-01-09 Thread Tae-Hyuk Ahn
Hi, I would like to use OpenHashSet (org.apache.spark.util.collection.OpenHashSet) in my standalone program. I can import it without error as: import org.apache.spark.util.collection.OpenHashSet However, when I try to access it, I am getting an error as: object OpenHashSet in package collection

Re: Cleaning up spark.local.dir automatically

2015-01-09 Thread Raghavendra Pandey
You may like to look at spark.cleaner.ttl configuration which is infinite by default. Spark has that configuration to delete temp files time to time. On Fri Jan 09 2015 at 8:34:10 PM wrote: > Hi, > > > > Is there a way of automatically cleaning up the spark.local.dir after a > job has been run?

RE: Cleaning up spark.local.dir automatically

2015-01-09 Thread michael.england
Thanks, I imagine this will kill any cached RDDs if their files are beyond the ttl? Thanks From: Raghavendra Pandey [mailto:raghavendra.pan...@gmail.com] Sent: 09 January 2015 15:29 To: England, Michael (IT/UK); user@spark.apache.org Subject: Re: Cleaning up spark.local.dir automatically You m

DeepLearning and Spark ?

2015-01-09 Thread Jaonary Rabarisoa
Hi all, DeepLearning algorithms are popular and achieve many state of the art performance in several real world machine learning problems. Currently there are no DL implementation in spark and I wonder if there is an ongoing work on this topics. We can do DL in spark Sparkling water and H2O but t

Re: DeepLearning and Spark ?

2015-01-09 Thread Marco Shaw
Pretty vague on details: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199 > On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa wrote: > > Hi all, > > DeepLearning algorithms are popular and achieve many state of the art > performance in several real world machine learn

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Michael Armbrust
The other thing to note here is that Spark SQL defensively copies rows when we switch into user code. This probably explains the difference between 1 & 2. The difference between 1 & 3 is likely the cost of decompressing the column buffers vs. accessing a bunch of uncompressed primitive objects.

Re: Parquet compression codecs not applied

2015-01-09 Thread Michael Armbrust
This is a little confusing, but that code path is actually going through hive. So the spark sql configuration does not help. Perhaps, try: set parquet.compression=GZIP; On Fri, Jan 9, 2015 at 2:41 AM, Ayoub wrote: > Hello, > > I tried to save a table created via the hive context as a parquet f

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-09 Thread Jerry Lam
Hi Raghavendra, This makes a lot of sense. Thank you. The problem is that I'm using Spark SQL right now to generate the parquet file. What I think I need to do is to use Spark directly and transform all rows from SchemaRDD to avro objects and supply it to use saveAsNewAPIHadoopFile (from the Pair

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Xiangrui Meng
This is worker log, not executor log. The executor log can be found in folders like /newdisk2/rta/rtauser/workerdir/app-20150109182514-0001/0/ . -Xiangrui On Fri, Jan 9, 2015 at 5:03 AM, Priya Ch wrote: > Please find the attached worker log. > I could see stream closed exception > > On Wed, Jan

Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-09 Thread Davies Liu
In the current implementation of TorrentBroadcast, the blocks are fetched one-by-one in single thread, so it can not fully utilize the network bandwidth. Davies On Fri, Jan 9, 2015 at 2:11 AM, Jun Yang wrote: > Guys, > > I have a question regarding to Spark 1.1 broadcast implementation. > > In o

Re: DeepLearning and Spark ?

2015-01-09 Thread Peng Cheng
You are not the first :) probably not the fifth to have the question. parameter server is not included in spark framework and I've seen all kinds of hacking to improvise it: REST api, HDFS, tachyon, etc. Not sure if an 'official' benchmark & implementation will be released soon On 9 January 2015 a

RE: RowMatrix.multiply() ?

2015-01-09 Thread Adrian Mocanu
I’m resurrecting this thread because I’m interested in doing transpose on a RowMatrix. There is this other thread too: http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-td12562.html Which presents https://issues.apache.org/jira/browse/SPARK-3434 which is still in

Re: java.io.IOException: Mkdirs failed to create file:/some/path/myapp.csv while using rdd.saveAsTextFile(fileAddress) Spark

2015-01-09 Thread firemonk9
I am facing same exception in saveAsObjectFile. Have you found any solution ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Mkdirs-failed-to-create-file-some-path-myapp-csv-while-using-rdd-saveAsTextFile-k-tp20994p21066.html Sent from

Re: Failed to save RDD as text file to local file system

2015-01-09 Thread firemonk9
Have you found any resolution for this issue ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-save-RDD-as-text-file-to-local-file-system-tp21050p21067.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Parquet predicate pushdown troubles

2015-01-09 Thread Yana Kadiyska
I am running the following (connecting to an external Hive Metastore) /a/shark/spark/bin/spark-shell --master spark://ip:7077 --conf *spark.sql.parquet.filterPushdown=true* val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) and then ran two queries: sqlContext.sql("select count(*)

RE: Failed to save RDD as text file to local file system

2015-01-09 Thread NingjunWang
No, do you have any idea? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: firemonk9 [via Apache Spark User List] [mailto:ml-node+s1001560n21067...@n3.nabble.com] Sent: Friday, January 09, 2015 2:56 PM To: Wang, Ningjun (LNG-NPV)

updateStateByKey: cleaning up state for keys not in current window

2015-01-09 Thread Simone Franzini
I know that in order to clean up the state for a key I have to return None when I call updateStateByKey. However, as far as I understand, updateStateByKey only gets called for new keys (i.e. keys in current batch), not for all keys in the DStream. So, how can I clear the state for those keys in thi

Re: RowMatrix.multiply() ?

2015-01-09 Thread Reza Zadeh
Yes that is the correct JIRA. It should make it to 1.3. Best, Reza On Fri, Jan 9, 2015 at 11:13 AM, Adrian Mocanu wrote: > I’m resurrecting this thread because I’m interested in doing transpose > on a RowMatrix. > > There is this other thread too: > http://apache-spark-user-list.1001560.n3.nabb

Re: RDD Moving Average

2015-01-09 Thread Mohit Jaggi
Read this: http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E

Questions about Spark and HDFS co-location

2015-01-09 Thread zfry
I am running Spark 1.1.1 built against CDH4 and have a few questions regarding Spark performance related to co-location with HDFS nodes. I want to know whether (and how efficiently) Spark takes advantage of being co-located with a HDFS node? What I mean by this is: if a file is being read by

Re: Reading HBase data - Exception

2015-01-09 Thread lmsiva
This problem got resolved. In spark-submit I was using only --driver-class-path option, once i added --jars option so that the workers are aware of the dependant jar files, the problem went away. Need to check if there are any worker logs that gives better information than the exception I was getti

Re: Cleaning up spark.local.dir automatically

2015-01-09 Thread Andrew Ash
That's a worker setting which cleans up the files left behind by executors, so spark.cleaner.ttl isn't at the RDD level. After https://issues.apache.org/jira/browse/SPARK-1860 the cleaner won't clean up directories left by running executors. On Fri, Jan 9, 2015 at 7:38 AM, wrote: > Thanks, I

Re: DeepLearning and Spark ?

2015-01-09 Thread Andrei
Does it makes sense to use Spark's actor system (e.g. via SparkContext.env.actorSystem) to create parameter server? On Fri, Jan 9, 2015 at 10:09 PM, Peng Cheng wrote: > You are not the first :) probably not the fifth to have the question. > parameter server is not included in spark framework and

Re: DeepLearning and Spark ?

2015-01-09 Thread Krishna Sankar
I am also looking at this domain. We could potentially use the broadcast capability in Spark to distribute the parameters. Haven't thought thru yet. Cheers On Fri, Jan 9, 2015 at 2:56 PM, Andrei wrote: > Does it makes sense to use Spark's actor system (e.g. via > SparkContext.env.actorSystem) t

Re:Re: EventBatch and SparkFlumeProtocol not found in spark codebase?

2015-01-09 Thread Todd
Thanks Sean. I follow the guide, import the codebase into IntellijIdea as Maven project, with the profiles:hadoop2.4 and yarn. In the maven project view, I run Maven Install against the module: Spark Project Parent POM(root).After a pretty long time, all the modules are built successfully. But

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Sean Owen
Spark uses MapReduce InputFormat implementations to read data from disk, so in that sense it has access to, and uses, the same locality info that things like MR do. Yes, tasks go to the data, and you want to run Spark on top of the HDFS DataNodes. (Locality isn't always the only priority that deter

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Ted Yu
I was looking for related information and found: http://spark-summit.org/wp-content/uploads/2013/10/Spark-Ops-Final.pptx See also http://hbase.apache.org/book.html#perf.hdfs.configs.localread for how short circuit read is enabled. Cheers On Fri, Jan 9, 2015 at 3:50 PM, Sean Owen wrote: > Spark

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Andrew Ash
Note also for short circuit reads that early versions are actually net-negative in performance. Only after a second hadoop release of the feature did it turn towards being a positive change. See earlier threads on this mailing list where short circuit reads are discussed. On Fri, Jan 9, 2015 at

How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-09 Thread nishanthps
Hi, The userId's and productId's in my data are bigInts, what is the best way to run collaborative filtering on this data. Should I modify MLlib's implementation to support more types? or is there an easy way. Thanks!, Nishanth -- View this message in context: http://apache-spark-user-list.10

Re: DeepLearning and Spark ?

2015-01-09 Thread Peng Cheng
Not if broadcast can only be used between stages. To enable this you have to at least make broadcast asynchronous & non-blocking. On 9 January 2015 at 18:02, Krishna Sankar wrote: > I am also looking at this domain. We could potentially use the broadcast > capability in Spark to distribute the p

Rank for SQL and ORDER BY?

2015-01-09 Thread Kevin Burton
I’m trying to do simple graph sort in Spark which I mostly have working. The one problem I have now is that I need to order them and then assign a rank position. So the top item should have rank 0, the next one should have rank 1, etc. Hive and Pig support this with the RANK operator. I *think*

OOM exception during row deserialization

2015-01-09 Thread Pala M Muthaia
Hi, I am using Spark 1.0.1. I am trying to debug a OOM exception i saw during a join step. Basically, i have a RDD of rows, that i am joining with another RDD of tuples. Some of the tasks succeed but a fair number failed with OOM exception with stack below. The stack belongs to the 'reducer' tha

Web Service + Spark

2015-01-09 Thread Cui Lin
Hello, All, What’s the best practice on deploying/publishing spark-based scientific applications into a web service? Similar to Shiny on R. Thanks! Best regards, Cui Lin

Re: Discrepancy in PCA values

2015-01-09 Thread Upul Bandara
Hi Xiangrui, Thanks for the reply. Julia code is also using the covariance matrix: (1/n)*X'*X ; Thanks, Upul On Fri, Jan 9, 2015 at 2:11 AM, Xiangrui Meng wrote: > The Julia code is computing the SVD of the Gram matrix. PCA should be > applied to the covariance matrix. -Xiangrui > > On Thu, J

Issue writing to Cassandra from Spark

2015-01-09 Thread Ankur Srivastava
Hi, We are currently using spark to join data in Cassandra and then write the results back into Cassandra. While reads happen with out any error during the writes we see many exceptions like below. Our environment details are: - Spark v 1.1.0 - spark-cassandra-connector-java_2.10 v 1.1.0 We are

Re: Web Service + Spark

2015-01-09 Thread Corey Nolet
Cui Lin, The solution largely depends on how you want your services deployed (Java web container, Spray framework, etc...) and if you are using a cluster manager like Yarn or Mesos vs. just firing up your own executors and master. I recently worked on an example for deploying Spark services insid

Re: Accidental kill in UI

2015-01-09 Thread Nicholas Chammas
As Sean said, this definitely sounds like something worth a JIRA issue (and PR). On Fri Jan 09 2015 at 8:17:34 AM Sean Owen wrote: > (FWIW yes I think this should certainly be a POST. The link can become > a miniature form to achieve this and then the endpoint just needs to > accept POST only. Y

Re: Web Service + Spark

2015-01-09 Thread gtinside
You can also look at Spark Job Server https://github.com/spark-jobserver/spark-jobserver - Gaurav > On Jan 9, 2015, at 10:25 PM, Corey Nolet wrote: > > Cui Lin, > > The solution largely depends on how you want your services deployed (Java web > container, Spray framework, etc...) and if you

RE: Implement customized Join for SparkSQL

2015-01-09 Thread Dai, Kevin
Hi, Rishi You are right. But the ids may be tens of thousands and B is a database with index for id, which means querying by id is very fast. In fact we load A and B as separate schemaRDDs as you suggested. But we hope we can extend the join implementation to achieve it in the parsing stage.

Re: Discrepancy in PCA values

2015-01-09 Thread Xiangrui Meng
You need to subtract mean values to obtain the covariance matrix (http://en.wikipedia.org/wiki/Covariance_matrix). On Fri, Jan 9, 2015 at 6:41 PM, Upul Bandara wrote: > Hi Xiangrui, > > Thanks for the reply. > > Julia code is also using the covariance matrix: > (1/n)*X'*X ; > > Thanks, > Upul > >

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-09 Thread Xiangrui Meng
Do you have more than 2 billion users/products? If not, you can pair each user/product id with an integer (check RDD.zipWithUniqueId), use them in ALS, and then join the original bigInt IDs back after training. -Xiangrui On Fri, Jan 9, 2015 at 5:12 PM, nishanthps wrote: > Hi, > > The userId's and

Re: Zipping RDDs of equal size not possible

2015-01-09 Thread Xiangrui Meng
"sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out" How do you guarantee that the two RDDs have the same size? -Xiangrui On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke <1wil...@informatik.uni-hamburg.de> wrote: > Hi Spark community, > >

Re: OptionalDataException during Naive Bayes Training

2015-01-09 Thread Xiangrui Meng
How big is your data? Did you see other error messages from executors? It seems to me like a shuffle communication error. This thread may be relevant: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccalrnvjuvtgae_ag1rqey_cod1nmrlfpesxgsb7g8r21h0bm...@mail.gmail.com%3E -Xiangru

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Xiangrui Meng
colStats() computes the mean values along with several other summary statistics, which makes it slower. How is the performance if you don't use kryo? -Xiangrui On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar wrote: > thanks for the suggestion -- however, looks like this is even slower. With > the smal

Submitting SparkContext and seeing driverPropsFetcher exception

2015-01-09 Thread Corey Nolet
I'm seeing this exception when creating a new SparkContext in YARN: [ERROR] AssociationError [akka.tcp://sparkdri...@coreys-mbp.home:58243] <- [akka.tcp://driverpropsfetc...@coreys-mbp.home:58453]: Error [Shut down address: akka.tcp://driverpropsfetc...@coreys-mbp.home:58453] [ akka.remote.ShutDo

/tmp directory fills up

2015-01-09 Thread Alessandro Baretta
Gents, I'm building spark using the current master branch and deploying in to Google Compute Engine on top of Hadoop 2.4/YARN via bdutil, Google's Hadoop cluster provisioning tool. bdutils configures Spark with spark.local.dir=/hadoop/spark/tmp, but this option is ignored in combination with YAR