Re: Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i tried, but no effect Qin Wei wrote > try the complete path > > > qinwei >  From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set > spark.executor.memory and heap sizethank you, i add setJars, but nothing > changes >   >     val conf = new SparkConf() >   .setMaster("spark://12

Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-24 Thread Christophe Préaud
Good to know, thanks for pointing this out to me! On 23/04/2014 19:55, Sandy Ryza wrote: Ah, you're right about SPARK_CLASSPATH and ADD_JARS. My bad. SPARK_YARN_APP_JAR is going away entirely - https://issues.apache.org/jira/browse/SPARK-1053 On Wed, Apr 23, 2014 at 8:07 AM, Christophe Préau

Re: Need help about how hadoop works.

2014-04-24 Thread Carter
Thank you very much for your help Prashant. Sorry I still have another question about your answer: "however if the file("/home/scalatest.txt") is present on the same path on all systems it will be processed on all nodes." When presenting the file to the same path on all nodes, do we just simply c

Re: Need help about how hadoop works.

2014-04-24 Thread Prashant Sharma
It is the same file and hadoop library that we use for splitting takes care of assigning the right split to each node. Prashant Sharma On Thu, Apr 24, 2014 at 1:36 PM, Carter wrote: > Thank you very much for your help Prashant. > > Sorry I still have another question about your answer: "howeve

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i think maybe it's the problem of read local file val logFile = "/home/wxhsdp/spark/example/standalone/README.md" val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully can't i read a

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Adnan Yaqoob
You need to use proper url format: file://home/wxhsdp/spark/example/standalone/README.md On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wrote: > i think maybe it's the problem of read local file > > val logFile = "/home/wxhsdp/spark/example/standalone/README.md" > val logData = sc.textFile(logFile).c

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Adnan Yaqoob
Sorry wrong format: file:///home/wxhsdp/spark/example/standalone/README.md An extra / is needed at the start. On Thu, Apr 24, 2014 at 1:46 PM, Adnan Yaqoob wrote: > You need to use proper url format: > > file://home/wxhsdp/spark/example/standalone/README.md > > > On Thu, Apr 24, 2014 at 1:29

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
thanks for your reply, adnan, i tried val logFile = "file:///home/wxhsdp/spark/example/standalone/README.md" i think there needs three left slash behind file: it's just the same as val logFile = "home/wxhsdp/spark/example/standalone/README.md" the error remains:( -- View this message in context

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Arpit Tak
Hi, You should be able to read it, file://or file:/// not even required for reading locally , just path is enough.. what error message you getting on spark-shell while reading... for local: Also read the same from hdfs file also ... put your README file there and read , it works in both ways..

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
hi arpit, on spark shell, i can read local file properly, but when i use sbt run, error occurs. the sbt error message is in the beginning of the thread Arpit Tak-2 wrote > Hi, > > You should be able to read it, file://or file:/// not even required for > reading locally , just path is enough.. >

RE: Need help about how hadoop works.

2014-04-24 Thread Carter
Thank you very much Prashant. Date: Thu, 24 Apr 2014 01:24:39 -0700 From: ml-node+s1001560n4739...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: Need help about how hadoop works. It is the same file and hadoop library that we use for splitting takes care of assigning the right spl

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Arpit Tak
Okk fine, try like this , i tried and it works.. specify spark path also in constructor... and also export SPARK_JAVA_OPTS="-Xms300m -Xmx512m -XX:MaxPermSize=1g" import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[Stri

Re: error in mllib lr example code

2014-04-24 Thread Arpit Tak
Also try out these examples, all of them works http://docs.sigmoidanalytics.com/index.php/MLlib if you spot any problems in those, let us know. Regards, arpit On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia wrote: > See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more > re

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
Hi All, Finally i wrote the following code, which is felt does optimally if not the most optimum one. Using file pointers, seeking the byte after the last \n but backwards !! This is memory efficient and i hope even unix tail implementation should be something similar !! import java.io.RandomAcces

Re: SparkPi performance-3 cluster standalone mode

2014-04-24 Thread Adnan
Hi, Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
it seems that it's nothing about settings, i tried take action, and find it's ok, but error occurs when i tried count and collect val a = sc.textFile("any file") a.take(n).foreach(println) //ok a.count() //failed a.collect()//failed val b = sc.parallelize((Array(1,2,3,4)) b.take(n).foreach(pri

Re: Access Last Element of RDD

2014-04-24 Thread Cheng Lian
You may try this: val lastOption = sc.textFile("input").mapPartitions { iterator => if (iterator.isEmpty) { iterator } else { Iterator .continually((iterator.next(), iterator.hasNext())) .collect { case (value, false) => value } .take(1) } }.collect().lastOption It

Re: Is Spark a good choice for geospatial/GIS applications? Is a community volunteer needed in this area?

2014-04-24 Thread neveroutgunned
Thanks for the info. It seems like the JTS library is exactly what I need (I'm not doing any raster processing at this point). So, once they successfully finish the Scala wrappers for JTS, I would theoretically be able to use Scala to write a Spark job that includes the JTS library, and then run i

Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Shubhabrata
I am stuck with an issue for last two days and did not find any solution after several hours of googling. Here is the details. The following is a simple python code (Temp.py): import sys from random import random from operator import add from pyspark import SparkContext from pyspark import Spar

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
Hi Matei, I checked out the git repository and built it. However, I'm still getting below error. It couldn't find those SQL packages. Please advice. package org.apache.spark.sql.api.java does not exist [ERROR] /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDr

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Shubhabrata
Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp :/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Shubhabrata
Moreover it seems all the workers are registered and have sufficient memory (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are running on the slaves. But on the termial it is still the same error "Initial job has not accepted any resources; check your cluster UI to ensure that

How to see org.apache.spark.executor.Executor logs

2014-04-24 Thread amit karmakar
I have changed the log level in log4j to ALL. Still i cannot see any log comming from org.apache.spark.executor.Executor Is there something i am missing ?

reduceByKeyAndWindow - spark internals

2014-04-24 Thread Adrian Mocanu
If I have this code: val stream1= doublesInputStream.window(Seconds(10), Seconds(2)) val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10)) Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10 second window? Example, in the first 10 secs stream1 will have

IDE for sparkR

2014-04-24 Thread phoenixbai
I am new to R and I am trying to learn sparkR source code. so, i am wondering what is the IDE for sparkR project? is it rstudio? is there any IDE like intellij idea? I tried, but intellij idea won`t be able to import sparkR as a project. please help! thank you! -- View this message in context

Re: How do I access the SPARK SQL

2014-04-24 Thread Andrew Or
Did you build it with SPARK_HIVE=true? On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru wrote: > Hi Matei, > > I checked out the git repository and built it. However, I'm still getting > below error. It couldn't find those SQL packages. Please advice. > > package org.apache.spark.sql.api.java do

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust
You shouldn't need to set SPARK_HIVE=true unless you want to use the JavaHiveContext. You should be able to access org.apache.spark.sql.api.java.JavaSQLContext with the default build. How are you building your application? Michael On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or wrote: > Did you b

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
It's a simple application based on the "People" example. I'm using Maven for building and below is the pom.xml. Perhaps, I need to change the version? Uthay.Test.App test-app 4.0.0 TestApp jar 1.0 Akka repository

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust
Yeah, you'll need to run `sbt publish-local` to push the jars to your local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru wrote: > It's a simple application based on the "People" example. > > I'm using Maven for building and b

Re: How do I access the SPARK SQL

2014-04-24 Thread Aaron Davidson
Looks like you're depending on Spark 0.9.1, which doesn't have Spark SQL. Assuming you've downloaded Spark, just run 'mvn install' to publish Spark locally, and depend on Spark version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru wrote: > It's a simple application based on th

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust
Oh, and you'll also need to add a dependency on "spark-sql_2.10". On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust wrote: > Yeah, you'll need to run `sbt publish-local` to push the jars to your > local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT. > > > On Thu, Apr 24, 20

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
Many thanks for your prompt reply. I'll try your suggestions and will get back to you. On 24 April 2014 18:17, Michael Armbrust wrote: > Oh, and you'll also need to add a dependency on "spark-sql_2.10". > > > On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust > wrote: > >> Yeah, you'll need

Re: IDE for sparkR

2014-04-24 Thread maxpar
Rstudio should be fine. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/IDE-for-sparkR-tp4764p4772.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
Thanks Cheng !! On Thu, Apr 24, 2014 at 5:43 PM, Cheng Lian wrote: > You may try this: > > val lastOption = sc.textFile("input").mapPartitions { iterator => > if (iterator.isEmpty) { > iterator > } else { > Iterator > .continually((iterator.next(), iterator.hasNext())) >

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread John King
Same problem. On Thu, Apr 24, 2014 at 10:54 AM, Shubhabrata wrote: > Moreover it seems all the workers are registered and have sufficient memory > (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are > running on the slaves. But on the termial it is still the same error > "I

Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
I receive this error: Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py", line 178, in train ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_) File "/home/ubuntu/spark-1.0.0-rc2/pyt

Spark mllib throwing error

2014-04-24 Thread John King
./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main "$@" Any ideas?

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Matei Zaharia
Did you launch this using our EC2 scripts (http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set up the daemons? My guess is that their hostnames are not being resolved properly on all nodes, so executor processes can’t connect back to your driver app. This error message

Re: SparkPi performance-3 cluster standalone mode

2014-04-24 Thread Matei Zaharia
The problem is that SparkPi uses Math.random(), which is a synchronized method, so it can’t scale to multiple cores. In fact it will be slower on multiple cores due to lock contention. Try another example and you’ll see better scaling. I think we’ll have to update SparkPi to create a new Random

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King wrote: > ./spark-shell: line 153: 17654 Killed > $FWDIR/bin/spark-class org.apache.spark.repl.Main "$@" > > > Any ideas?

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread Xiangrui Meng
Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On Thu, Apr 24, 2014 at 11:38 AM, John King wrote: > I receive this error: > > Traceback (most recent call last): > > File "", line 1, in > > File > "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/ml

Re: Spark mllib throwing error

2014-04-24 Thread John King
Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng wrote: > Could you share the command you used and more of the error message? > Also, is it an MLlib specific problem? -Xiangrui > > On Thu, Apr 24, 2014 at 11:49 AM, John King > wrote: >

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
Yes, I got it running for large RDD (~7 million lines) and mapping. Just received this error when trying to classify. On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng wrote: > Is your Spark cluster running? Try to start with generating simple > RDDs and counting. -Xiangrui > > On Thu, Apr 24, 201

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread John King
This happens to me when using the EC2 scripts for v1.0.0rc2 recent release. The Master connects and then disconnects immediately, eventually saying Master disconnected from cluster. On Thu, Apr 24, 2014 at 4:01 PM, Matei Zaharia wrote: > Did you launch this using our EC2 scripts ( > http://spark

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
It worked!! Many thanks for your brilliant support. On 24 April 2014 18:20, diplomatic Guru wrote: > Many thanks for your prompt reply. I'll try your suggestions and will get > back to you. > > > > > On 24 April 2014 18:17, Michael Armbrust wrote: > >> Oh, and you'll also need to add a depend

Re: error in mllib lr example code

2014-04-24 Thread Mohit Jaggi
Thanks Xiangrui, Matei and Arpit. It does work fine after adding Vector.dense. I have a follow up question, I will post on a new thread. On Thu, Apr 24, 2014 at 2:49 AM, Arpit Tak wrote: > Also try out these examples, all of them works > > http://docs.sigmoidanalytics.com/index.php/MLlib > >

spark mllib to jblas calls..and comparison with VW

2014-04-24 Thread Mohit Jaggi
Folks, I am wondering how mllib interacts with jblas and lapack. Does it make copies of data from my RDD format to jblas's format? Does jblas copy it again before passing to lapack native code? I also saw some comparisons with VW and it seems mllib is slower on a single node but scales better and

Re: spark mllib to jblas calls..and comparison with VW

2014-04-24 Thread Xiangrui Meng
The data array in RDD is passed by reference to jblas, so data copying in this stage. However, if jblas uses the native interface, there is a copying overhead. I think jblas uses java implementation for at least Level 1 BLAS, and calling native interface for Level 2 & 3. -Xiangrui On Thu, Apr 24,

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread Xiangrui Meng
I tried locally with the example described in the latest guide: http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine. Do you mind sharing the code you used? -Xiangrui On Thu, Apr 24, 2014 at 1:57 PM, John King wrote: > Yes, I got it running for large RDD (~7 million lines) and ma

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
Do you mind sharing more code and error messages? The information you provided is too little to identify the problem. -Xiangrui On Thu, Apr 24, 2014 at 1:55 PM, John King wrote: > Last command was: > > val model = new NaiveBayes().run(points) > > > > On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
I was able to run simple examples as well. Which version of Spark? Did you use the most recent commit or from branch-1.0? Some background: I tried to build both on Amazon EC2, but the master kept disconnecting from the client and executors failed after connecting. So I tried to just use one machi

Re: Spark mllib throwing error

2014-04-24 Thread John King
In the other thread I had an issue with Python. In this issue, I tried switching to Scala. The code is: *import* org.apache.spark.mllib.regression.*LabeledPoint**;* *import org.apache.spark.mllib.linalg.SparseVector;* *import org.apache.spark.mllib.classification.NaiveBayes;* import scala.colle

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
Also when will the official 1.0 be released? On Thu, Apr 24, 2014 at 7:04 PM, John King wrote: > I was able to run simple examples as well. > > Which version of Spark? Did you use the most recent commit or from > branch-1.0? > > Some background: I tried to build both on Amazon EC2, but the maste

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
I don't see anything wrong with your code. Could you do points.count() to see how many training examples you have? Also, make sure you don't have negative feature values. The error message you sent did not say NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui On Thu, Apr 24, 2014 at

Re: Spark mllib throwing error

2014-04-24 Thread John King
It just displayed this error and stopped on its own. Do the lines of code mentioned in the error have anything to do with it? On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng wrote: > I don't see anything wrong with your code. Could you do points.count() > to see how many training examples you ha

compile spark 0.9.1 in hadoop 2.2 above exception

2014-04-24 Thread martin . ou
occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly 2.found Exception: found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] N

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
anyone knows the reason? i've googled a bit, and found some guys had the same problem, but with no replies... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html Sent from the Apache Spark User Lis

Finding bad data

2014-04-24 Thread Jim Blomo
I'm using PySpark to load some data and getting an error while parsing it. Is it possible to find the source file and line of the bad data? I imagine that this would be extremely tricky when dealing with multiple derived RRDs, so an answer with the caveat of "this only works when running .map() o

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i noticed that error occurs at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183) at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:28

Re: compile spark 0.9.1 in hadoop 2.2 above exception

2014-04-24 Thread Patrick Wendell
Try running sbt/sbt clean and re-compiling. Any luck? On Thu, Apr 24, 2014 at 5:33 PM, martin.ou wrote: > > > occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3 > > 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly > > > > 2.found Exception: > > found : org.apache

Re: Finding bad data

2014-04-24 Thread Matei Zaharia
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how to do it. Look at the stderr file of the executor on that machine, and you’ll see lines like this: 14/04/24 19:17:24 INFO HadoopRDD: Input split: file:/Users/matei/workspace/apache-spark/README.md:0+2000 This says wh

parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson Lu
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/word_mapping") this line is too slow. There are about 2 million elements in word_mapping. *Is there a good style for writing a large collection to hdfs?* import org.apache.spark._ > import SparkContext._ > import scala.io

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Matei Zaharia
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu wrote: > spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/w

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Matei Zaharia
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu wrote: > spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/w

Re: Ooyala Server - plans to merge it into Apache ?

2014-04-24 Thread All In A Days Work
Thanks Andrew. Comments from Ooyala/Spark folks on reasons behind this? Cheers, On Sun, Apr 20, 2014 at 9:14 AM, Andrew Ash wrote: > The homepage for Ooyala's job server is here: > https://github.com/ooyala/spark-jobserver > > They decided (I think with input from the Spark team) that it made

what is the best way to do cartesian

2014-04-24 Thread Qin Wei
Hi All, I have a problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark. The basic flow is as below: (Item1, (User1 , Score1)) RDD1 ==>(Item2, (User2 , Score2))

Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-24 Thread Qin Wei
Hi All, I have a problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark. The basic flow is as below: (Item1, (User1 , Score1)) RDD1 ==>(Item2, (User2 , Score2))

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson
Kryo With Exception below: com.esotericsoftware.kryo.KryoException (com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1) com.esotericsoftware.kryo.io.Output.require(Output.java:138) com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446) com.esotericsof

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Sean Owen
On Fri, Apr 25, 2014 at 2:20 AM, wxhsdp wrote: > 14/04/25 08:38:36 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 14/04/25 08:38:36 WARN snappy.LoadSnappy: Snappy native library not loaded Since this comes up r

Re: Spark running slow for small hadoop files of 10 mb size

2014-04-24 Thread neeravsalaria
Thanks for the reply. It indeed increased the usage. There was another issue we found, we were broadcasting hadoop configuration by writing a wrapper class over it. But found the proper way in Spark Code sc.broadcast(new SerializableWritable(conf)) -- View this message in context: http://ap

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
I only see one risk: if your feature indices are not sorted, it might have undefined behavior. Other than that, I don't see any thing suspicious. -Xiangrui On Thu, Apr 24, 2014 at 4:56 PM, John King wrote: > It just displayed this error and stopped on its own. Do the lines of code > mentioned in

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread YouPeng Yang
Hi I am also curious about this question. The textFile function was supposed to read a hdfs file? In this case ,It is on local filesystem that the file was taken in.There are any recognization ways to identify the local filesystem and the hdfs in the textFile function? Beside, the OOM exe

Re: Pig on Spark

2014-04-24 Thread suman bharadwaj
Hey Mayur, We use HiveColumnarLoader and XMLLoader. Are these working as well ? Will try few things regarding porting Java MR. Regards, Suman Bharadwaj S On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi wrote: > Right now UDF is not working. Its in the top list though. You should be > able to s

Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-24 Thread Sean Owen
So you are computing all-pairs similarity over 20M users? This going to take about 200 trillion similarity computations, no? I don't think there's any way to make that fundamentally fast. I see you're copying the data set to all workers, which helps make it faster at the expense of memory consumpt

Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-24 Thread Sebastian Schelter
Quin, I'm not sure that I understand your source code correctly but the common problem with item-based collaborative filtering at scale is that the comparison of all pairs of item vectors needs quadratic effort and therefore does not scale. A common approach to this problem is to selectively d