RE: Spark-submit and Windows / Linux mixed network

2014-11-12 Thread Ashic Mahtab
jar not found :( Seems if I create a directory sim link so that the share path in the same on the unix mount point as in windows, and submit from the drive where the mount point is, then it works. Granted, that's quite an ugly hack. Reverting to serving jar off http (i.e. using a relative path)

Best way of transforming stack traces

2014-11-12 Thread Kevin Kilroy
Hi, I have log files with lines that begin with a time stamp. However those lines continue onto new lines representing Java stack traces. I want to be able to search for the line & then pull out the corresponding stack traces. I was thinking of using either take(n) or reduce to 'peek' ahead at th

RE: scala.MatchError

2014-11-12 Thread Naveen Kumar Pokala
Hi, Do you mean with java, I shouldn’t have Issue class as a property (attribute) in Instrument Class? Ex : Class Issue { Int a; } Class Instrument { Issue issue; } How about scala? Does it support such user defined datatypes in classes Case class Issue . case class Issue( a:Int = 0) c

Nested Complex Type Data Parsing and Transforming to table

2014-11-12 Thread luohui20001
Hi I got a problem when reading a textfile which contains nested complex type data and got a type unmatch problem.Any hint will be appreciated. The problem take place at "map(s => s.map" as "type mismatch; found : scala.collection.immutable.IndexedSeq[Array[com.redhadoop.bean.S

Pass RDD to functions

2014-11-12 Thread Deep Pradhan
Hi, Can we pass RDD to functions? Like, can we do the following? *def func (temp: RDD[String]):RDD[String] = {* *//body of the function* *}* Thank You

Re: Scala vs Python performance differences

2014-11-12 Thread Andrew Ash
Jeremy, Did you complete this benchmark in a way that's shareable with those interested here? Andrew On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I'd also be interested in seeing such a benchmark. > > > On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira >

Re: Pass RDD to functions

2014-11-12 Thread Akhil Das
Yes you can create more and more pipelines with your RDDs Thanks Best Regards On Wed, Nov 12, 2014 at 3:24 PM, Deep Pradhan wrote: > Hi, > Can we pass RDD to functions? > Like, can we do the following? > > *def func (temp: RDD[String]):RDD[String] = {* > *//body of the function* > *}* > > > Tha

Re: Scala vs Python performance differences

2014-11-12 Thread Samarth Mailinglist
I was about to ask this question. On Wed, Nov 12, 2014 at 3:42 PM, Andrew Ash wrote: > Jeremy, > > Did you complete this benchmark in a way that's shareable with those > interested here? > > Andrew > > On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >>

Spark and insertion into RDBMS/NoSQL

2014-11-12 Thread nitinkalra2000
Hi All, We are exploring insertion into RDBMS(SQL Server) through Spark by JDBC Driver. The excerpt from the code is as follows : We are doing insertion inside an action : Integer res = flatMappedRDD.reduce(new Function2(){ //@Override public Int

Number of partitions in RDD for input DStreams

2014-11-12 Thread Juan Rodríguez Hortalá
Hi list, In an excelent blog post on Kafka and Spark Streaming integrartion ( http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/), Michael Noll poses an assumption about the number of partitions of the RDDs created by input DStreams. He says his hypothe

Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
[cid:image001.png@01CFFE9C.25904980] Hi, How to set the above properties on JavaSQLContext. I am not able to see setConf method on JavaSQLContext Object. I have added spark core jar and spark assembly jar to my build path. And I am using spark 1.1.0 and hadoop 2.4.0 --Naveen

Re: How to kill a Spark job running in cluster mode ?

2014-11-12 Thread Tao Xiao
Thanks for your replies. Actually we can kill a driver by the command "bin/spark-class org.apache.spark.deploy.Client kill " if you know the driver id. 2014-11-11 22:35 GMT+08:00 Ritesh Kumar Singh : > There is a property : >spark.ui.killEnabled > which needs to be set true for killing appl

Re: Nested Complex Type Data Parsing and Transforming to table

2014-11-12 Thread Shixiong Zhu
Could you give an example of your data? This line is wrong. p(1).trim.map(_.toString.split("\002")).map(s => s.map(_.toString.split("\003")).map(t => StructField1( For example, p(1) is a String, so in p(1).trim.map(x => x.toString.split("\002")), x is a Char. That should not be what you want. I

Re: Spark SQL configurations

2014-11-12 Thread Akhil Das
JavaSQLContext.sqlContext.setConf is available. Thanks Best Regards On Wed, Nov 12, 2014 at 5:14 PM, Naveen Kumar Pokala < npok...@spcapitaliq.com> wrote: > > > Hi, > > > > How to set the above properties on JavaSQLContext. I am not able to see > setConf method on JavaSQLContext Object. > > >

Re: Pass RDD to functions

2014-11-12 Thread qinwei
I think it‘s ok,feel free to treat RDD like common object qinwei  From: Deep PradhanDate: 2014-11-12 18:24To: user@spark.apache.orgSubject: Pass RDD to functionsHi, Can we pass RDD to functions?Like, can we do the following? def func (temp: RDD[String]):RDD[String] = {//body of the functio

Java client connection

2014-11-12 Thread Eduardo Cusa
HI guys, I starting to working with spark from java and when i run the folliwing code : SparkConf conf = new SparkConf().setMaster("spark://10.0.2.20:7077 ").setAppName("SparkTest"); JavaSparkContext sc = new JavaSparkContext(conf); I recived the following error and the java process exit ends:

RE: Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
Thanks Akhil. -Naveen From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, November 12, 2014 6:38 PM To: Naveen Kumar Pokala Cc: user@spark.apache.org Subject: Re: Spark SQL configurations JavaSQLContext.sqlContext.setConf is available. Thanks Best Regards On Wed, Nov 12, 2014 a

why flatmap has shuffle

2014-11-12 Thread qinwei
Hi, everyone!     I consider flatmap as a narrow dependency , but why it has shuffle? as shown on the web UI: my code is as below : val transferRDD = sc.textFile("hdfs://host:port/path") val rdd = transferRDD.map(line => { val trunks = line.split("\t")

Re: ISpark class not found

2014-11-12 Thread Laird, Benjamin
Sounds like ipython notebook issue, not an ISpark one. Might want to reinstall "pip install ipython[notebook]", which will grab the notebook necessary components like tornado. Try spinning up ispark console instead of notebook to see if the ISpark kernel is functioning. ipython console —profile

Snappy error with Spark SQL

2014-11-12 Thread Naveen Kumar Pokala
HI, I am facing the following problem when I am trying to save my RDD as parquet File. 14/11/12 07:43:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 48,): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null org.xerial.snappy.SnappyLoader.load(SnappyLo

Re: Status of MLLib exporting models to PMML

2014-11-12 Thread Villu Ruusmann
Hi DB, DB Tsai wrote > I also worry about that the author of JPMML changed the license of > jpmml-evaluator due to his interest of his commercial business, and he > might change the license of jpmml-model in the future. I am the principal author of the said Java PMML API projects and I want to a

Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread rprabhu
Hello, I'm trying to run a classification task using mllib decision trees. After successfully training the model, I was trying to test the model using some sample rows when I hit this exception. The code snippet that caused this error is : model = DecisionTree.trainClassifier(parsedData, numClasse

join 2 tables

2014-11-12 Thread Franco Barrientos
I have 2 tables in a hive context, and I want to select one field of each table where id’s of each table are equal. For example, val tmp2=sqlContext.sql("select a.ult_fecha,b.pri_fecha from fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id=b.id") but i get an error: F

Re: How to kill a Spark job running in cluster mode ?

2014-11-12 Thread Ritesh Kumar Singh
I remember there was some issue with the above command in previous veresions of spark. Its nice that its working now :) On Wed, Nov 12, 2014 at 5:50 PM, Tao Xiao wrote: > Thanks for your replies. > > Actually we can kill a driver by the command "bin/spark-class > org.apache.spark.deploy.Client k

RE: Snappy error with Spark SQL

2014-11-12 Thread Kapil Malik
Hi, Try adding this in spark-env.sh export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64 export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/usr/lib/hadoo

Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
We are running spark on yarn with combined memory > 1TB and when trying to cache a table partition(which is < 100G), seeing a lot of failed collect stages in the UI and this never succeeds. Because of the failed collect, it seems like the mapPartitions keep getting resubmitted. We have more than en

Re: Question about textFileStream

2014-11-12 Thread Saiph Kappa
What if the window is of 5 seconds, and the file takes longer than 5 seconds to be completely scanned? It will still attempt to load the whole file? On Mon, Nov 10, 2014 at 6:24 PM, Soumitra Kumar wrote: > Entire file in a window. > > On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa > wrote: > >> H

No module named pyspark - latest built

2014-11-12 Thread jamborta
Hi all, I am trying to run spark with the latest build (from branch-1.2), as far as I can see, all the paths are set and SparkContext starts up OK, however, I cannot run anything that goes to the nodes. I get the following error: Error from python worker: /usr/bin/python2.7: No module named pys

Re: SVMWithSGD default threshold

2014-11-12 Thread Caron
Sean, Thanks a lot for your reply! A few follow up questions: 1. numIterations should be 100, not 100*trainingSetSize, right? 2. My training set has 90k positive data points (with label 1) and 60k negative data points (with label 0). I set my numIterations to 100 as default. I still got the same

Re: SVMWithSGD default threshold

2014-11-12 Thread Sean Owen
OK, it's not class imbalance. Yes, 100 iterations. My other guess is that the stepSize of 1 is way too big for your data. I'd suggest you look at the weights / intercept of the resulting model to see if it makes any sense. You can call clearThreshold on the model, and then it will 'predict' the S

Re: Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread Davies Liu
This is a bug, will be fixed by https://github.com/apache/spark/pull/3230 On Wed, Nov 12, 2014 at 7:20 AM, rprabhu wrote: > Hello, > I'm trying to run a classification task using mllib decision trees. After > successfully training the model, I was trying to test the model using some > sample rows

Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi, This is my code, import org.apache.hadoop.hbase.CellUtil /** * JF: convert a Result object into a string with column family and qualifier names. Sth like * 'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2' etc. * k-v pairs are separated by ';'. different colum

Re: Question about textFileStream

2014-11-12 Thread Rishi Yadav
yes, can you always specify minimum number of partitions and that would force some parallelism ( assuming you have enough cores) On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa wrote: > What if the window is of 5 seconds, and the file takes longer than 5 > seconds to be completely scanned? It will

Re: Is there a way to clone a JavaRDD without persisting it

2014-11-12 Thread Daniel Siegmann
As far as I know you basically have two options: let partitions be recomputed (possibly caching / persisting memory only), or persist to disk (and memory) and suffer the cost of writing to disk. The question is which will be more expensive in your case. My experience is you're better off letting th

Re: join 2 tables

2014-11-12 Thread Rishi Yadav
please use join syntax. On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos < franco.barrien...@exalitica.com> wrote: > I have 2 tables in a hive context, and I want to select one field of each > table where id’s of each table are equal. For example, > > > > *val tmp2=sqlContext.sql("select a.ult_

using RDD result in another TDD

2014-11-12 Thread Adrian Mocanu
Hi I'd like to use the result of one RDD1 in another RDD2. Normally I would use something like a barrier so make the 2nd RDD wait till the computation of the 1st RDD is done then include the result from RDD1 in the closure for RDD2. Currently I create another RDD, RDD3, out of the result of RDD1

Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi Nick, I saw the HBase api has experienced lots of changes. If I remember correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is 0.98.1. To get the column family names and qualifier names, we need to call different methods for these two different versions. I don't know how

Re: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread spr
After comparing with previous code, I got it work by making the return a Some instead of Tuple2. Perhaps some day I will understand this. spr wrote > --code > > val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int, > Time)]) => { > val currentCount = if (va

Wildly varying "aggregate" performance depending on code location

2014-11-12 Thread Jim Carroll
Hello all, I have a really strange thing going on. I have a test data set with 500K lines in a gzipped csv file. I have an array of "column processors," one for each column in the dataset. A Processor tracks aggregate state and has a method "process(v : String)" I'm calling: val processors:

Reading from Hbase using python

2014-11-12 Thread Alan Prando
Hi all, I'm trying to read an hbase table using this an example from github ( https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py), however I have two qualifiers in a column family. Ex.: ROW COLUMN+CELL row1 column=f1:1, timestamp=1401883411986, value=valu

RE: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Adrian Mocanu
My understanding is that the reason you have an Option is so you could filter out tuples when None is returned. This way your state data won't grow forever. -Original Message- From: spr [mailto:s...@yarcdata.com] Sent: November-12-14 2:25 PM To: u...@spark.incubator.apache.org Subject: R

Re: using RDD result in another TDD

2014-11-12 Thread Sean Owen
You can't use RDDs inside of RDDs, so this won't work anyway. You could collect the result of RDD1 and broadcast it, perhaps. collect() blocks. On Wed, Nov 12, 2014 at 6:41 PM, Adrian Mocanu wrote: > Hi > > I’d like to use the result of one RDD1 in another RDD2. Normally I would > use something

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
This is the log output: 2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS SELECT * FROM xyz where date_prefix = 20141112' 2014-11-12 19:07:17,455 INFO Configuration.d

Building spark targz

2014-11-12 Thread Ashwin Shankar
Hi, I just cloned spark from the github and I'm trying to build to generate a tar ball. I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Although the build is successful, I don't see the targz generated. Am I running the wrong command ? -- Thanks, Ashw

Re: Spark and Play

2014-11-12 Thread Donald Szeto
Hi Akshat, If your application is to serve results directly from a SparkContext, you may want to take a look at http://prediction.io. It integrates Spark with spray.io (another REST/web toolkit by Typesafe). Some heavy lifting is done here: https://github.com/PredictionIO/PredictionIO/blob/develop

Re: Building spark targz

2014-11-12 Thread Sadhan Sood
Just making sure but are you looking for the tar in assembly/target dir ? On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar wrote: > Hi, > I just cloned spark from the github and I'm trying to build to generate a > tar ball. > I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive > -Ds

Re: spark streaming: stderr does not roll

2014-11-12 Thread Nguyen, Duc
I've also tried setting the aforementioned properties using System.setProperty() as well as on the command line while submitting the job using --conf key=value. All to no success. When I go to the Spark UI and click on that particular streaming job and then the "Environment" tab, I can see the prop

Re: Building spark targz

2014-11-12 Thread Ashwin Shankar
Yes, I'm looking at assembly/target. I don't see the tar ball. I only see scala-2.10/spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar ,classes,test-classes, maven-shared-archive-resources,spark-test-classpath.txt. On Wed, Nov 12, 2014 at 12:16 PM, Sadhan Sood wrote: > Just making sure but are you l

Re: Building spark targz

2014-11-12 Thread Sean Owen
mvn package doesn't make tarballs. It creates artifacts that will generally appear in target/ and subdirectories, and likewise within modules. Look at make-distribution.sh On Wed, Nov 12, 2014 at 8:14 PM, Ashwin Shankar wrote: > Hi, > I just cloned spark from the github and I'm trying to build t

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
Can you give us a bit more detail: hbase release you're using. whether you can reproduce using hbase shell. I did the following using hbase shell against 0.98.4: hbase(main):001:0> create 'test', 'f1' 0 row(s) in 2.9140 seconds => Hbase::Table - test hbase(main):002:0> put 'test', 'row1', 'f1:1

Re: Building spark targz

2014-11-12 Thread Sadhan Sood
I think you can provide -Pbigtop-dist to build the tar. On Wed, Nov 12, 2014 at 3:21 PM, Sean Owen wrote: > mvn package doesn't make tarballs. It creates artifacts that will > generally appear in target/ and subdirectories, and likewise within > modules. Look at make-distribution.sh > > On Wed,

Re: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Yana Kadiyska
Adrian, do you know if this is documented somewhere? I was also under the impression that setting a key's value to None would cause the key to be discarded (without any explicit filtering on the user's part) but can not find any official documentation to that effect On Wed, Nov 12, 2014 at 2:43 PM

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
To my knowledge, Spark 1.1 comes with HBase 0.94 To utilize HBase 0.98, you will need: https://issues.apache.org/jira/browse/SPARK-1297 You can apply the patch and build Spark yourself. Cheers On Wed, Nov 12, 2014 at 12:57 PM, Alan Prando wrote: > Hi Ted! Thanks for anwsering... > > Maybe I di

How can my java code executing on a slave find the task id?

2014-11-12 Thread Steve Lewis
I am trying to determine how effective partitioning is at parallelizing my tasks. So far I suspect it that all work is done in one task. My plan is to create a number of accumulators - one for each task and have functions increment the accumulator for the appropriate task (or slave) the values cou

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I tried that, but that did not resolve the problem. All the executors for partitions except one have no shuffle reads and finish within 20-30 ms. one executor has a complete shuffle read of the previous stage. Any other ideas on debugging this? -- View this message in context: http://apache-spa

Re: SVMWithSGD default threshold

2014-11-12 Thread Xiangrui Meng
regParam=1.0 may penalize too much, because we use the average loss instead of total loss. I just sent a PR to lower the default: https://github.com/apache/spark/pull/3232 You can try LogisticRegressionWithLBFGS (and configure parameters through its optimizer), which should converge faster than SG

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
Looking at HBaseResultToStringConverter : override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] Bytes.toStringBinary(result.value()) } Here is the code for Result.value(): public byte [] value() { if (isEmpty()) { return null; } retur

ec2 script and SPARK_LOCAL_DIRS not created

2014-11-12 Thread Darin McBeath
I'm using spark 1.1 and the provided ec2 scripts to start my cluster (r3.8xlarge machines).  From the spark-shell, I can verify that the environment variables are set scala> System.getenv("SPARK_LOCAL_DIRS")res0: String = /mnt/spark,/mnt2/spark However, when I look on the workers, the directories

Re: Wildly varying "aggregate" performance depending on code location

2014-11-12 Thread Jim Carroll
Well it looks like this is a scala problem after all. I loaded the file using pure scala and ran the exact same Processors without Spark and I got 20 seconds (with the code in the same file as the 'main') vs 30 seconds (with the exact same code in a different file) on the 500K rows. -- View thi

Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Corey Nolet
I'm loading sequence files containing json blobs in the value, transforming them into RDD[String] and then using hiveContext.jsonRDD(). It looks like Spark reads the files twice- once when I I define the jsonRDD() and then again when I actually make my call to hiveContext.sql(). Looking @ the code

RE: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Adrian Mocanu
You are correct; the filtering I’m talking about is done implicitly. You don’t have to do it yourself. Spark will do it for you and remove those entries from the state collection. From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: November-12-14 3:50 PM To: Adrian Mocanu Cc: spr; u...@sp

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread jpl
Hi Xiangrui, thank you very much for your response. I looked for the .so as you suggested. It is not here: $ jar tf assembly/target/spark-assembly_2.10-1.1.0-dist/spark-assembly-1.1.0-hadoop2.4.0.jar | grep netlib-native_system-linux-x86_64.so or here: $ jar tf assembly/target/spark-assembly

Re: Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Michael Armbrust
There are a few things you can do here: - Infer the schema on a subset of the data, pass that inferred schema (schemaRDD.schema) as the second argument of jsonRDD. - Hand construct a schema and pass it as the second argument including the fields you are interested in. - Instead load the data as

Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
forgot to mention, that this setup works in spark standalone mode, only problem when I run on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18777.html Sent from the Apache Spark User List mailing list arch

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
: > This is the log output: > > 2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation > (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS > SELECT * FROM xyz where date_prefix = 20141112' > > 2014-11-12 19:07:17,455

How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
JavaSparkContext currentContext = ...; Accumulator accumulator = currentContext.accumulator(0, "MyAccumulator"); will create an Accumulator of Integers. For many large Data problems Integer is too small and Long is a better type. I see a call like the following AccumulatorPar

Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in compressed sequence file format that it gets uncompressed in memory when getting cached. Is there a way to turn this off and cache the compressed data as is ?

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread Xiangrui Meng
That means the "-Pnetlib-lgpl" option didn't work. Could you use sbt to build the assembly jar and see whether the ".so" file is inside the assembly jar? Which system and Java version are you using? -Xiangrui On Wed, Nov 12, 2014 at 2:22 PM, jpl wrote: > Hi Xiangrui, thank you very much for your

spark.parallelize seems broken on type

2014-11-12 Thread mod0
Interesting result here. I'm trying to parallelize a list for some simple tests with spark and Ganglia. It seems that spark.parallelize doesn't create partitions except for on the master node on our cluster. The image below shows the CPU utilization per node over three tests. The first two compute

Map output statuses exceeds frameSize

2014-11-12 Thread pouryas
Hey all I am doing a groupby on nearly 2TB of data and I am getting this error: 2014-11-13 00:25:30 ERROR org.apache.spark.MapOutputTrackerMasterActor - Map output statuses were 32163619 bytes which exceeds spark.akka.frameSize (10485760 bytes). org.apache.spark.SparkException: Map output statuse

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the co

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the co

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Tobias Pfeiffer
Bill, However, when I am currently using Spark 1.1.0. the Spark streaming job > cannot receive any messages from Kafka. I have not made any change to the > code. > Do you see any suspicious messages in the log output? Tobias

Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following command: sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean publish-local assembly however the maven command works OK: mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop

Re: Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread rprabhu
Hey Thanks for responding so fast. I ran the code with the fix and it works great. Regards, Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-py4j-protocol-Py4JError-An-error-occurred-while-calling-o39-predict-while-doing-batch-predics-tp18730p

Re: No module named pyspark - latest built

2014-11-12 Thread Tamas Jambor
Thanks. Will it work with sbt at some point? On Thu, 13 Nov 2014 01:03 Xiangrui Meng wrote: > You need to use maven to include python files. See > https://github.com/apache/spark/pull/1223 . -Xiangrui > > On Wed, Nov 12, 2014 at 4:48 PM, jamborta wrote: > > I have figured out that building the

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, Thanks for the information. I am running Spark streaming in a yarn cluster and the configuration should be correct. I followed the KafkaWordCount to write the current code three months ago. It has been working for several months. The messages are in json format. Actually, this code worked

Re: No module named pyspark - latest built

2014-11-12 Thread Xiangrui Meng
You need to use maven to include python files. See https://github.com/apache/spark/pull/1223 . -Xiangrui On Wed, Nov 12, 2014 at 4:48 PM, jamborta wrote: > I have figured out that building the fat jar with sbt does not seem to > included the pyspark scripts using the following command: > > sbt/sb

Re: MEMORY_ONLY_SER question

2014-11-12 Thread Mohit Jaggi
thanks jerry and tathagata. does anyone know how kryo compresses data? are there any other serializers that work with spark and have good compression for basic data types? On Tue, Nov 4, 2014 at 10:29 PM, Shao, Saisai wrote: > From my understanding, the Spark code use Kryo as a streaming manner

Cannot summit Spark app to cluster, stuck on “UNDEFINED”

2014-11-12 Thread brother rain
I use this command to summit *spark application* to *yarn cluster* export YARN_CONF_DIR=conf bin/spark-submit --class "Mining" --master yarn-cluster --executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar *In Web UI, it stuck on* UNDEFINED [image: enter image description here] *I

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
Adding a call to rdd.repartition() after randomizing the keys has no effect either. code - //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_stream.transform(rdd => { rdd.map({ kv => val

Re: No module named pyspark - latest built

2014-11-12 Thread Andrew Or
Hey Jamborta, What java version did you build the jar with? 2014-11-12 16:48 GMT-08:00 jamborta : > I have figured out that building the fat jar with sbt does not seem to > included the pyspark scripts using the following command: > > sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phi

flatMap followed by mapPartitions

2014-11-12 Thread Debasish Das
Hi, I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ? Thanks. Deb

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I have made some progress - the partitioning is very uneven, and everything goes to one partition. I see that spark partitions by key, so I tried this: //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_st

Using data in RDD to specify HDFS directory to write to

2014-11-12 Thread jschindler
I am having a problem trying to figure out how to solve a problem. I would like to stream events from Kafka to my Spark Streaming app and write the contents of each RDD out to a HDFS directory. Each event that comes into the app via kafka will be JSON and have an event field with the name of the

Re: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Steve Reinhardt
I'm missing something simpler (I think). That is, why do I need a Some instead of Tuple2? Because a Some might or might not be there, but a Tuple2 must be there? Or something like that? From: Adrian Mocanu mailto:amoc...@verticalscope.com>> You are correct; the filtering I’m talking about i

RE: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Shao, Saisai
Did you configure Spark master as local, it should be local[n], n > 1 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Thursday, Nov

Re: Unit testing jar request

2014-11-12 Thread nightwolf
+1 I agree we need this too. Looks like there is already an issue for it here; https://spark-project.atlassian.net/browse/SPARK-750 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-jar-request-tp16475p18801.html Sent from the Apache Spark User L

Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. Spark SQL uses in-memory columnar format while caching table rows, so we must read all the raw data and convert them into columnar format. However, you can enable in-memory columnar compression by setting |spark.sql.inMemo

Assigning input files to spark partitions

2014-11-12 Thread Pala M Muthaia
Hi, I have a set of input files for a spark program, with each file corresponding to a logical data partition. What is the API/mechanism to assign each input file (or a set of files) to a spark partition, when initializing RDDs? When i create a spark RDD pointing to the directory of files, my und

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread jpl
Thanks! I used sbt (command below) and the .so file is now there (shown below). Now that I have this new assembly.jar, how do I run the spark-shell so that it can see the .so file when I call the kmeans function? Thanks again for your help with this. sbt/sbt -Dhadoop.version=2.4.0 -Pyarn -Phive

Re: Pyspark Error when broadcast numpy array

2014-11-12 Thread bliuab
Dear Liu: I have tested this issue under Spark-1.1.0. The problem is solved under this newer version. On Wed, Nov 12, 2014 at 3:18 PM, Bo Liu wrote: > Dear Liu: > > Thank you for your replay. I will set up an experimental environment for > spark-1.1 and test it. > > On Wed, Nov 12, 2014 at 2:3

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Sean Owen
It's the exact same API you've already found, and it's documented: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.AccumulatorParam JavaSparkContext has helper methods for int and double but not long. You can just make your own little implementation of AccumulatorParam ri

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
The fact that the caching percentage went down is highly suspicious. It should generally not decrease unless other cached data took its place, or if unless executors were dying. Do you know if either of these were the case? On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld < nkronenf...@oculusinf

Re: flatMap followed by mapPartitions

2014-11-12 Thread Mayur Rustagi
flatmap would have to shuffle data only if output RDD is expected to be partitioned by some key. RDD[X].flatmap(X=>RDD[Y]) If it has to shuffle it should be local. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Nov 1

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
I see Javadoc Style documentation but nothing that looks like a code sample I tried the following before asking public static class LongAccumulableParam implements AccumulableParam,Serializable { @Override public Long addAccumulator(final Long r, final Long t) { re

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Sean Owen
Look again, the type is AccumulatorParam, not AccumulableParam. But yes that's what you do. On Thu, Nov 13, 2014 at 4:32 AM, Steve Lewis wrote: > I see Javadoc Style documentation but nothing that looks like a code sample > I tried the following before asking > > public static class LongAccum

Re: data locality, task distribution

2014-11-12 Thread Nathan Kronenfeld
Sorry, I think I was not clear in what I meant. I didn't mean it went down within a run, with the same instance. I meant I'd run the whole app, and one time, it would cache 100%, and the next run, it might cache only 83% Within a run, it doesn't change. On Wed, Nov 12, 2014 at 11:31 PM, Aaron Da

Query from two or more tables Spark Sql .I have done this . Is there any simpler solution.

2014-11-12 Thread akshayhazari
As of now my approach is to fetch all data from tables located in different databases in separate RDD's and then make a union of them and then query on them together. I want to know whether I can perform a query on it directly along with creating an RDD. i.e. Instead of creating two RDDs , firing a

Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Kevin Burton
We have all our data in Cassandra so I’d prefer to not have to bring up Hadoop/HDFS as that’s just another thing that can break. But I’m reading that spark requires a shared filesystem like HDFS or S3… Can I use Tachyon or this or something simple for a shared filesystem? -- Founder/CEO Spinn3

  1   2   >