Re: Spark Streaming compilation error: algebird not a member of package com.twitter

2014-09-20 Thread Tathagata Das
There is not artifact call spark-streaming-algebird . To use the algebird, you will have add the following dependency (in maven format) com.twitter algebird-core_${scala.binary.version} 0.1.11 This is what is used in spark/examples/pom.xml TD On Sat, Sep 20, 2014 at 6:2

Could you please add us to 'Powered by Spark' List

2014-09-20 Thread US Office Admin
Organization Name: Vectorum Inc. URL: http://www.vectorum.com List of Spark Components: Tachyon, Spark 1.1, Spark SQL, MLib (In works) w= ith Hadoop and Play Framework. Working on digital finger print search. Use Case: Using Machine data to predict machine failures. W

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Victor Tso-Guillen
1. Actually, I disagree that combineByKey requires that all values be held in memory for a key. Only the use case groupByKey does that, whereas reduceByKey, foldByKey, and the generic combineByKey do not necessarily make that requirement. If your combine logic really shrinks the result

Re: spark-submit command-line with --files

2014-09-20 Thread chinchu
Thanks Marcelo. The code trying to read the file always runs in the driver. I understand the problem with other master-deployment but will it work in local, yarn-client & yarn-cluster deployments.. that's all I care for now :-) Also what is the suggested way to do something like this ? Put the fil

Re: Example of Geoprocessing with Spark

2014-09-20 Thread Abel Coronado Iruegas
Thanks, Evan and Andy: Here a very functional version, i need to improve the syntax, but this works very well, the initial version takes around 36 hours in a 9 machines with 8 cores, and this version takes 36 minutes in a cluster with 7 machines with 8 cores : object SimpleApp { def main(

Re: spark-submit command-line with --files

2014-09-20 Thread Marcelo Vanzin
Hi chinchu, Where does the code trying to read the file run? Is it running on the driver or on some executor? If it's running on the driver, in yarn-cluster mode, the file should have been copied to the application's work directory before the driver is started. So hopefully just doing "new FileIn

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
Some more debug revealed that as Sean said I have to keep the dictionaries persisted till I am done with the RDD manipulation. Thanks Sean for the pointer...would it be possible to point me to the JIRA as well ? Are there plans to make it more transparent for the users ? Is it possible for t

Spark streaming twitter exception

2014-09-20 Thread Maisnam Ns
HI , Can somebody help me with adding library dependencies in my build.sbt so that the java.lang.NoClassDefFoundError issue can be resolved. My sbt (only the dependencies part) -> libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.0.1" , "org.apache.spark" %% "spark-strea

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
I changed zipWithIndex to zipWithUniqueId and that seems to be working... What's the difference between zipWithIndex vs zipWithUniqueId ? For zipWithIndex we don't need to run the count to compute the offset which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's not very clea

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
I did not persist / cache it as I assumed zipWithIndex will preserve order... There is also zipWithUniqueId...I am trying that...If that also shows the same issue, we should make it clear in the docs... On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen wrote: > From offline question - zipWithIndex is

Re: Distributed dictionary building

2014-09-20 Thread Sean Owen
>From offline question - zipWithIndex is being used to assign IDs. From a recent JIRA discussion I understand this is not deterministic within a partition so the index can be different when the RDD is reevaluated. If you need it fixed, persist the zipped RDD on disk or in memory. On Sep 20, 2014 8:

understanding rdd pipe() and bin/spark-submit --master

2014-09-20 Thread Andy Davidson
Hi I am new to spark and started writing some simple test code to figure out how things works. I am very interested in spark streaming and python. It appears that streaming is not supported in python yet. The work around I found by googling is to write your streaming code in either Scala or Jav

java.lang.NegativeArraySizeException in pyspark

2014-09-20 Thread Brad Miller
Hi All, I'm experiencing a java.lang.NegativeArraySizeException in a pyspark script I have. I've pasted the full traceback at the end of this email. I have isolated the line of code in my script which "causes" the exception to occur. Although the exception seems to occur deterministically, it is

Distributed dictionary building

2014-09-20 Thread Debasish Das
Hi, I am building a dictionary of RDD[(String, Long)] and after the dictionary is built and cached, I find key "almonds" at value 5187 using: rdd.filter{case(product, index) => product == "almonds"}.collect Output: Debug product almonds index 5187 Now I take the same dictionary and write it out

Setting serializer to KryoSerializer from command line for spark-shell

2014-09-20 Thread Soumya Simanta
Hi, I want to set the serializer for my spark-shell to Kyro. spark.serializer to org.apache.spark.serializer.KryoSerializer Can I do it without setting a new SparkConf? Thanks -Soumya

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Steve Lewis
OK so in Java - pardon the verbosity I might say something like the code below but I face the following issues 1) I need to store all values in memory as I run combineByKey - it I could return an RDD which consumed values that would be great but I don't know how to do that - 2) In my version of the

SparkSQL Thriftserver in Mesos

2014-09-20 Thread John Omernik
I am running the Thrift server in SparkSQL, and running it on the node I compiled spark on. When I run it, tasks only work if they landed on that node, other executors started on nodes I didn't compile spark on (and thus don't have the compile directory) fail. Should spark be distributed properly

org.eclipse.jetty.orbit#javax.transaction;working@localhost: not found

2014-09-20 Thread jinilover
I downloaded the spark-workshop in scala from https://github.com/deanwampler/spark-workshop. When I type sbt and then compile, I got the following errors [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] ::

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-20 Thread jatinpreet
Thanks Xangrui and RJ for the responses. RJ, I have created a Jira for the same. It would be great if you could look into this. Following is the link to the improvement task, https://issues.apache.org/jira/browse/SPARK-3614 Let me know if I can be of any help and please keep me posted! Thanks, J

Re: Problem with giving memory to executors on YARN

2014-09-20 Thread Sandy Ryza
I'm actually surprised your memory is that high. Spark only allocates spark.storage.memoryFraction for storing RDDs. This defaults to .6, so 32 GB * .6 * 10 executors should be a total of 192 GB. -Sandy On Sat, Sep 20, 2014 at 8:21 AM, Soumya Simanta wrote: > There 128 cores on each box. Yes t

secondary sort

2014-09-20 Thread Koert Kuipers
now that spark has a sort based shuffle, can we expect a secondary sort soon? there are some use cases where getting a sorted iterator of values per key is helpful.

Re: exception in spark 1.1.0

2014-09-20 Thread Ted Yu
Can you tell us how you installed native snappy ? See D.3.1.5 in: http://hbase.apache.org/book.html#snappy.compression.installation Cheers On Sat, Sep 20, 2014 at 2:12 AM, Chen Song wrote: > I have seen below exception from spark 1.1.0. Any insights on the snappy > exception? > > 14/09/18 16:4

Re: Problem with giving memory to executors on YARN

2014-09-20 Thread Soumya Simanta
There 128 cores on each box. Yes there are other applications running on the cluster. YARN is assigning two containers to my application. I'll investigate this a little more. PS: I'm new to YARN. On Fri, Sep 19, 2014 at 4:49 PM, Vipul Pandey wrote: > How many cores do you have in your boxes? >

Re: Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Moshe Beeri
Hi Sean, Thanks a lot for the answer , I loved your excellent book *​Mahout in Action *hope you'll keep on writing more books in the field of Big Data. The issue was with redundant Hadoop library, But now I am facing some other issue (s

Re: Avoid broacasting huge variables

2014-09-20 Thread Sean Owen
Joining in a side conversation - yes this is the way to go. The data is immutable so can be shared across all executors in one JVM in a singleton. How to load it depends on where it is but there is nothing special to Spark here. For instance if the file is on HDFS then you use HDFS APIs in some cl

Re: Avoid broacasting huge variables

2014-09-20 Thread octavian.ganea
Hi Martin, Thanks. That might be really useful. Can you give me a reference or an example so that I understand how to do it ? In my case, the nodes have access to the same shared folder, so I wouldn't have to copy the file multiple times. -- View this message in context: http://apache-spark-

Re: Avoid broacasting huge variables

2014-09-20 Thread Martin Goodson
We normally copy a file to the nodes and then explicitly load it in a function passed to mapPartitions. On 9/20/14, octavian.ganea wrote: > Anyone ? > > Is there any option to load data in each node before starting any > computation like it is the initialization of mappers in Hadoop ? > > > > --

exception in spark 1.1.0

2014-09-20 Thread Chen Song
I have seen below exception from spark 1.1.0. Any insights on the snappy exception? 14/09/18 16:45:11 ERROR executor.Executor: Exception in task 763.1 in stage 6.0 (TID 17035) java.io.IOException: PARSING_ERROR(2) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)

Re: Avoid broacasting huge variables

2014-09-20 Thread octavian.ganea
Anyone ? Is there any option to load data in each node before starting any computation like it is the initialization of mappers in Hadoop ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Avoid-broacasting-huge-variables-tp14696p14726.html Sent from the Apa

spark job hung up

2014-09-20 Thread Chen Song
I am testing my spark job on yarn spark: 1.0.0-cdh5.1.0 yarn: cdh5.1.0 Once a while the spark job hung up (stuck in some stage without any progress on driver and executors) after some failures. Below is the list of typical failures on driver and executor. ** on master/driver* 14/09/16 06:42:28 W

Re: Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Sean Owen
Spark does not require Hadoop 2 or YARN. This looks like a problem with the Hadoop installation as it is not funding native libraries it needs to make some security related system call. Check the installation. On Sep 20, 2014 9:13 AM, "Manu Suryavansh" wrote: > Hi Moshe, > > Spark needs a Hadoop

Re: Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Moshe Beeri
Hi Nanu/All Now I interfacing an other strange (relatively to new complex framework) error. I run ./sbin/start-all.sh (my computer name after John nash) and got the connection Connecting to master spark://nash:7077 running on my local machine yields java.lang.ClassNotFoundException: com.example.sc

Re: Example of Geoprocessing with Spark

2014-09-20 Thread andy petrella
It's probably slw as you say because it's actually also doing the map phase that will do the RTree search and so on, and only then saving to hdfs on 60 partition. If you want to see the time spent in saving to hdfs, you could do a count for instance before saving. Also saving from 60 partition

Re: Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Moshe Beeri
Thank Manu, I just saw I have included hadoop client 2.x in my pom.xml, removing it solved the problem. Thanks for you help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fails-to-run-simple-Spark-Hello-World-scala-program-tp14718p14721.html Sent from the

Re: spark-submit command-line with --files

2014-09-20 Thread chinchu
Thanks Andrew. I understand the problem a little better now. There was a typo in my earlier mail & a bug in the code (causing the NPE in SparkFiles). I am using the --master yarn-cluster (not local). And in this mode, the com.test.batch.modeltrainer.ModelTrainerMain - my main-class will run on the

Re: Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Manu Suryavansh
Hi Moshe, Spark needs a Hadoop 2.x/YARN cluster. Other wise you can run it without hadoop in the stand alone mode. Manu On Sat, Sep 20, 2014 at 12:55 AM, Moshe Beeri wrote: > object Nizoz { > > def connect(): Unit = { > val conf = new SparkConf().setAppName("nizoz").setMaster("master")

Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Moshe Beeri
object Nizoz { def connect(): Unit = { val conf = new SparkConf().setAppName("nizoz").setMaster("master"); val spark = new SparkContext(conf) val lines = spark.textFile("file:///home/moshe/store/frameworks/spark-1.1.0-bin-hadoop1/README.md") val lineLengths = lines.map(s => s.len