Re: Building spark targz

2014-11-12 Thread Akhil Das
You need to run the make-distribution.sh to get the tar ball. Thanks Best Regards On Thu, Nov 13, 2014 at 1:44 AM, Ashwin Shankar wrote: > Hi, > I just cloned spark from the github and I'm trying to build to generate a > tar bal

Saving RDD into DB & then Reading back from DB

2014-11-12 Thread nsareen
Hi All, I know that Spark has integration with cassandra DB. Can the RDD be persisted into DB, be read back into the same state, on server boot ? If yes, are there any examples which would demonstrate how it's done ? We have a requirement, where we are currently saving a snapshot of many rows in

Re: Joined RDD

2014-11-12 Thread qinwei
 I think it is because A.join(B) is a shuffle map stage, whose result is stored  temporarily (i'm not sure it's in memeory or in disk)I saw the word "map output" in the log of my spark application, i think it is the intermediate result of my application, and according to the log, it is stor

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread jpl
Hi Xiangrui, All is well. Got it working now, I just recompiled with sbt with the additional "package" flag and that created all the /bin files. Then when I start spark-shell, the webUI environment show the assembly jar is in spark's classpath entries and now the kmeans function finds it -- no mo

Joined RDD

2014-11-12 Thread ajay garg
Hi, I have two RDDs A and B which are created from reading file from HDFS. I have a third RDD C which is created by taking join of A and B. All three RDDs (A, B and C ) are not cached. Now if I perform any action on C (let say collect), action is served without reading any data from the disk.

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
Spark's scheduling is pretty simple: it will allocate tasks to open cores on executors, preferring ones where the data is local. It even performs "delay scheduling", which means waiting a bit to see if an executor where the data resides locally becomes available. Are yours tasks seeing very skewed

Re: Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Kevin Burton
Yes. That’s what I was planning on using actually. I was just curious whether intermediate data had to be kept in HDFS but this answers my question. thanks. On Wed, Nov 12, 2014 at 9:33 PM, Harold Nguyen wrote: > Hi Kevin, > > Yes, Spark can read and write to Cassandra without Hadoop. Have yo

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Jay Vyas
Yup , very important that n>1 for spark streaming jobs, If local use local[2] The thing to remember is that your spark receiver will take a thread to itself and produce data , so u need another thread to consume it . In a cluster manager like yarn or mesos, the word thread Is not used any

Re: RDD to DStream

2014-11-12 Thread Jianshi Huang
I also discussed with Liancheng two weeks ago. And he suggested to use toLocalIterator to collect partitions of RDD to driver (same order if RDD is sorted), and then turn each partition to a RDD and put them in the queue. So: To turn RDD[(timestamp, value)] to DStream 1) Group by timestamp/window

Re: Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Harold Nguyen
Hi Kevin, Yes, Spark can read and write to Cassandra without Hadoop. Have you seen this: https://github.com/datastax/spark-cassandra-connector Harold On Wed, Nov 12, 2014 at 9:28 PM, Kevin Burton wrote: > We have all our data in Cassandra so I’d prefer to not have to bring up > Hadoop/HDFS as

Can spark read and write to cassandra without HDFS?

2014-11-12 Thread Kevin Burton
We have all our data in Cassandra so I’d prefer to not have to bring up Hadoop/HDFS as that’s just another thing that can break. But I’m reading that spark requires a shared filesystem like HDFS or S3… Can I use Tachyon or this or something simple for a shared filesystem? -- Founder/CEO Spinn3

Query from two or more tables Spark Sql .I have done this . Is there any simpler solution.

2014-11-12 Thread akshayhazari
As of now my approach is to fetch all data from tables located in different databases in separate RDD's and then make a union of them and then query on them together. I want to know whether I can perform a query on it directly along with creating an RDD. i.e. Instead of creating two RDDs , firing a

Re: data locality, task distribution

2014-11-12 Thread Nathan Kronenfeld
Sorry, I think I was not clear in what I meant. I didn't mean it went down within a run, with the same instance. I meant I'd run the whole app, and one time, it would cache 100%, and the next run, it might cache only 83% Within a run, it doesn't change. On Wed, Nov 12, 2014 at 11:31 PM, Aaron Da

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Sean Owen
Look again, the type is AccumulatorParam, not AccumulableParam. But yes that's what you do. On Thu, Nov 13, 2014 at 4:32 AM, Steve Lewis wrote: > I see Javadoc Style documentation but nothing that looks like a code sample > I tried the following before asking > > public static class LongAccum

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
I see Javadoc Style documentation but nothing that looks like a code sample I tried the following before asking public static class LongAccumulableParam implements AccumulableParam,Serializable { @Override public Long addAccumulator(final Long r, final Long t) { re

Re: flatMap followed by mapPartitions

2014-11-12 Thread Mayur Rustagi
flatmap would have to shuffle data only if output RDD is expected to be partitioned by some key. RDD[X].flatmap(X=>RDD[Y]) If it has to shuffle it should be local. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Nov 1

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
The fact that the caching percentage went down is highly suspicious. It should generally not decrease unless other cached data took its place, or if unless executors were dying. Do you know if either of these were the case? On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld < nkronenf...@oculusinf

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Sean Owen
It's the exact same API you've already found, and it's documented: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.AccumulatorParam JavaSparkContext has helper methods for int and double but not long. You can just make your own little implementation of AccumulatorParam ri

Re: Pyspark Error when broadcast numpy array

2014-11-12 Thread bliuab
Dear Liu: I have tested this issue under Spark-1.1.0. The problem is solved under this newer version. On Wed, Nov 12, 2014 at 3:18 PM, Bo Liu wrote: > Dear Liu: > > Thank you for your replay. I will set up an experimental environment for > spark-1.1 and test it. > > On Wed, Nov 12, 2014 at 2:3

Assigning input files to spark partitions

2014-11-12 Thread Pala M Muthaia
Hi, I have a set of input files for a spark program, with each file corresponding to a logical data partition. What is the API/mechanism to assign each input file (or a set of files) to a spark partition, when initializing RDDs? When i create a spark RDD pointing to the directory of files, my und

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread jpl
Thanks! I used sbt (command below) and the .so file is now there (shown below). Now that I have this new assembly.jar, how do I run the spark-shell so that it can see the .so file when I call the kmeans function? Thanks again for your help with this. sbt/sbt -Dhadoop.version=2.4.0 -Pyarn -Phive

Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. Spark SQL uses in-memory columnar format while caching table rows, so we must read all the raw data and convert them into columnar format. However, you can enable in-memory columnar compression by setting |spark.sql.inMemo

Re: Unit testing jar request

2014-11-12 Thread nightwolf
+1 I agree we need this too. Looks like there is already an issue for it here; https://spark-project.atlassian.net/browse/SPARK-750 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-jar-request-tp16475p18801.html Sent from the Apache Spark User L

RE: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Shao, Saisai
Did you configure Spark master as local, it should be local[n], n > 1 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Thursday, Nov

Re: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Steve Reinhardt
I'm missing something simpler (I think). That is, why do I need a Some instead of Tuple2? Because a Some might or might not be there, but a Tuple2 must be there? Or something like that? From: Adrian Mocanu mailto:amoc...@verticalscope.com>> You are correct; the filtering I’m talking about i

Using data in RDD to specify HDFS directory to write to

2014-11-12 Thread jschindler
I am having a problem trying to figure out how to solve a problem. I would like to stream events from Kafka to my Spark Streaming app and write the contents of each RDD out to a HDFS directory. Each event that comes into the app via kafka will be JSON and have an event field with the name of the

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I have made some progress - the partitioning is very uneven, and everything goes to one partition. I see that spark partitions by key, so I tried this: //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_st

flatMap followed by mapPartitions

2014-11-12 Thread Debasish Das
Hi, I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ? Thanks. Deb

Re: No module named pyspark - latest built

2014-11-12 Thread Andrew Or
Hey Jamborta, What java version did you build the jar with? 2014-11-12 16:48 GMT-08:00 jamborta : > I have figured out that building the fat jar with sbt does not seem to > included the pyspark scripts using the following command: > > sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phi

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
Adding a call to rdd.repartition() after randomizing the keys has no effect either. code - //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_stream.transform(rdd => { rdd.map({ kv => val

Re: MEMORY_ONLY_SER question

2014-11-12 Thread Mohit Jaggi
thanks jerry and tathagata. does anyone know how kryo compresses data? are there any other serializers that work with spark and have good compression for basic data types? On Tue, Nov 4, 2014 at 10:29 PM, Shao, Saisai wrote: > From my understanding, the Spark code use Kryo as a streaming manner

Cannot summit Spark app to cluster, stuck on “UNDEFINED”

2014-11-12 Thread brother rain
I use this command to summit *spark application* to *yarn cluster* export YARN_CONF_DIR=conf bin/spark-submit --class "Mining" --master yarn-cluster --executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar *In Web UI, it stuck on* UNDEFINED [image: enter image description here] *I

Re: No module named pyspark - latest built

2014-11-12 Thread Xiangrui Meng
You need to use maven to include python files. See https://github.com/apache/spark/pull/1223 . -Xiangrui On Wed, Nov 12, 2014 at 4:48 PM, jamborta wrote: > I have figured out that building the fat jar with sbt does not seem to > included the pyspark scripts using the following command: > > sbt/sb

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, Thanks for the information. I am running Spark streaming in a yarn cluster and the configuration should be correct. I followed the KafkaWordCount to write the current code three months ago. It has been working for several months. The messages are in json format. Actually, this code worked

Re: No module named pyspark - latest built

2014-11-12 Thread Tamas Jambor
Thanks. Will it work with sbt at some point? On Thu, 13 Nov 2014 01:03 Xiangrui Meng wrote: > You need to use maven to include python files. See > https://github.com/apache/spark/pull/1223 . -Xiangrui > > On Wed, Nov 12, 2014 at 4:48 PM, jamborta wrote: > > I have figured out that building the

Re: Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread rprabhu
Hey Thanks for responding so fast. I ran the code with the fix and it works great. Regards, Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-py4j-protocol-Py4JError-An-error-occurred-while-calling-o39-predict-while-doing-batch-predics-tp18730p

Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following command: sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean publish-local assembly however the maven command works OK: mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Tobias Pfeiffer
Bill, However, when I am currently using Spark 1.1.0. the Spark streaming job > cannot receive any messages from Kafka. I have not made any change to the > code. > Do you see any suspicious messages in the log output? Tobias

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the co

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the co

Map output statuses exceeds frameSize

2014-11-12 Thread pouryas
Hey all I am doing a groupby on nearly 2TB of data and I am getting this error: 2014-11-13 00:25:30 ERROR org.apache.spark.MapOutputTrackerMasterActor - Map output statuses were 32163619 bytes which exceeds spark.akka.frameSize (10485760 bytes). org.apache.spark.SparkException: Map output statuse

spark.parallelize seems broken on type

2014-11-12 Thread mod0
Interesting result here. I'm trying to parallelize a list for some simple tests with spark and Ganglia. It seems that spark.parallelize doesn't create partitions except for on the master node on our cluster. The image below shows the CPU utilization per node over three tests. The first two compute

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread Xiangrui Meng
That means the "-Pnetlib-lgpl" option didn't work. Could you use sbt to build the assembly jar and see whether the ".so" file is inside the assembly jar? Which system and Java version are you using? -Xiangrui On Wed, Nov 12, 2014 at 2:22 PM, jpl wrote: > Hi Xiangrui, thank you very much for your

Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Sadhan Sood
We noticed while caching data from our hive tables which contain data in compressed sequence file format that it gets uncompressed in memory when getting cached. Is there a way to turn this off and cache the compressed data as is ?

How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
JavaSparkContext currentContext = ...; Accumulator accumulator = currentContext.accumulator(0, "MyAccumulator"); will create an Accumulator of Integers. For many large Data problems Integer is too small and Long is a better type. I see a call like the following AccumulatorPar

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
: > This is the log output: > > 2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation > (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS > SELECT * FROM xyz where date_prefix = 20141112' > > 2014-11-12 19:07:17,455

Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
forgot to mention, that this setup works in spark standalone mode, only problem when I run on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18777.html Sent from the Apache Spark User List mailing list arch

Re: Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Michael Armbrust
There are a few things you can do here: - Infer the schema on a subset of the data, pass that inferred schema (schemaRDD.schema) as the second argument of jsonRDD. - Hand construct a schema and pass it as the second argument including the fields you are interested in. - Instead load the data as

Re: MLLIB usage: BLAS dependency warning

2014-11-12 Thread jpl
Hi Xiangrui, thank you very much for your response. I looked for the .so as you suggested. It is not here: $ jar tf assembly/target/spark-assembly_2.10-1.1.0-dist/spark-assembly-1.1.0-hadoop2.4.0.jar | grep netlib-native_system-linux-x86_64.so or here: $ jar tf assembly/target/spark-assembly

RE: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Adrian Mocanu
You are correct; the filtering I’m talking about is done implicitly. You don’t have to do it yourself. Spark will do it for you and remove those entries from the state collection. From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: November-12-14 3:50 PM To: Adrian Mocanu Cc: spr; u...@sp

Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Corey Nolet
I'm loading sequence files containing json blobs in the value, transforming them into RDD[String] and then using hiveContext.jsonRDD(). It looks like Spark reads the files twice- once when I I define the jsonRDD() and then again when I actually make my call to hiveContext.sql(). Looking @ the code

Re: Wildly varying "aggregate" performance depending on code location

2014-11-12 Thread Jim Carroll
Well it looks like this is a scala problem after all. I loaded the file using pure scala and ran the exact same Processors without Spark and I got 20 seconds (with the code in the same file as the 'main') vs 30 seconds (with the exact same code in a different file) on the 500K rows. -- View thi

ec2 script and SPARK_LOCAL_DIRS not created

2014-11-12 Thread Darin McBeath
I'm using spark 1.1 and the provided ec2 scripts to start my cluster (r3.8xlarge machines).  From the spark-shell, I can verify that the environment variables are set scala> System.getenv("SPARK_LOCAL_DIRS")res0: String = /mnt/spark,/mnt2/spark However, when I look on the workers, the directories

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
Looking at HBaseResultToStringConverter : override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] Bytes.toStringBinary(result.value()) } Here is the code for Result.value(): public byte [] value() { if (isEmpty()) { return null; } retur

Re: SVMWithSGD default threshold

2014-11-12 Thread Xiangrui Meng
regParam=1.0 may penalize too much, because we use the average loss instead of total loss. I just sent a PR to lower the default: https://github.com/apache/spark/pull/3232 You can try LogisticRegressionWithLBFGS (and configure parameters through its optimizer), which should converge faster than SG

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I tried that, but that did not resolve the problem. All the executors for partitions except one have no shuffle reads and finish within 20-30 ms. one executor has a complete shuffle read of the previous stage. Any other ideas on debugging this? -- View this message in context: http://apache-spa

How can my java code executing on a slave find the task id?

2014-11-12 Thread Steve Lewis
I am trying to determine how effective partitioning is at parallelizing my tasks. So far I suspect it that all work is done in one task. My plan is to create a number of accumulators - one for each task and have functions increment the accumulator for the appropriate task (or slave) the values cou

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
To my knowledge, Spark 1.1 comes with HBase 0.94 To utilize HBase 0.98, you will need: https://issues.apache.org/jira/browse/SPARK-1297 You can apply the patch and build Spark yourself. Cheers On Wed, Nov 12, 2014 at 12:57 PM, Alan Prando wrote: > Hi Ted! Thanks for anwsering... > > Maybe I di

Re: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Yana Kadiyska
Adrian, do you know if this is documented somewhere? I was also under the impression that setting a key's value to None would cause the key to be discarded (without any explicit filtering on the user's part) but can not find any official documentation to that effect On Wed, Nov 12, 2014 at 2:43 PM

Re: Building spark targz

2014-11-12 Thread Sadhan Sood
I think you can provide -Pbigtop-dist to build the tar. On Wed, Nov 12, 2014 at 3:21 PM, Sean Owen wrote: > mvn package doesn't make tarballs. It creates artifacts that will > generally appear in target/ and subdirectories, and likewise within > modules. Look at make-distribution.sh > > On Wed,

Re: Reading from Hbase using python

2014-11-12 Thread Ted Yu
Can you give us a bit more detail: hbase release you're using. whether you can reproduce using hbase shell. I did the following using hbase shell against 0.98.4: hbase(main):001:0> create 'test', 'f1' 0 row(s) in 2.9140 seconds => Hbase::Table - test hbase(main):002:0> put 'test', 'row1', 'f1:1

Re: Building spark targz

2014-11-12 Thread Sean Owen
mvn package doesn't make tarballs. It creates artifacts that will generally appear in target/ and subdirectories, and likewise within modules. Look at make-distribution.sh On Wed, Nov 12, 2014 at 8:14 PM, Ashwin Shankar wrote: > Hi, > I just cloned spark from the github and I'm trying to build t

Re: Building spark targz

2014-11-12 Thread Ashwin Shankar
Yes, I'm looking at assembly/target. I don't see the tar ball. I only see scala-2.10/spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar ,classes,test-classes, maven-shared-archive-resources,spark-test-classpath.txt. On Wed, Nov 12, 2014 at 12:16 PM, Sadhan Sood wrote: > Just making sure but are you l

Re: spark streaming: stderr does not roll

2014-11-12 Thread Nguyen, Duc
I've also tried setting the aforementioned properties using System.setProperty() as well as on the command line while submitting the job using --conf key=value. All to no success. When I go to the Spark UI and click on that particular streaming job and then the "Environment" tab, I can see the prop

Re: Building spark targz

2014-11-12 Thread Sadhan Sood
Just making sure but are you looking for the tar in assembly/target dir ? On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar wrote: > Hi, > I just cloned spark from the github and I'm trying to build to generate a > tar ball. > I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive > -Ds

Re: Spark and Play

2014-11-12 Thread Donald Szeto
Hi Akshat, If your application is to serve results directly from a SparkContext, you may want to take a look at http://prediction.io. It integrates Spark with spray.io (another REST/web toolkit by Typesafe). Some heavy lifting is done here: https://github.com/PredictionIO/PredictionIO/blob/develop

Building spark targz

2014-11-12 Thread Ashwin Shankar
Hi, I just cloned spark from the github and I'm trying to build to generate a tar ball. I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Although the build is successful, I don't see the targz generated. Am I running the wrong command ? -- Thanks, Ashw

Re: Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
This is the log output: 2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation (Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS SELECT * FROM xyz where date_prefix = 20141112' 2014-11-12 19:07:17,455 INFO Configuration.d

Re: using RDD result in another TDD

2014-11-12 Thread Sean Owen
You can't use RDDs inside of RDDs, so this won't work anyway. You could collect the result of RDD1 and broadcast it, perhaps. collect() blocks. On Wed, Nov 12, 2014 at 6:41 PM, Adrian Mocanu wrote: > Hi > > I’d like to use the result of one RDD1 in another RDD2. Normally I would > use something

RE: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Adrian Mocanu
My understanding is that the reason you have an Option is so you could filter out tuples when None is returned. This way your state data won't grow forever. -Original Message- From: spr [mailto:s...@yarcdata.com] Sent: November-12-14 2:25 PM To: u...@spark.incubator.apache.org Subject: R

Reading from Hbase using python

2014-11-12 Thread Alan Prando
Hi all, I'm trying to read an hbase table using this an example from github ( https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py), however I have two qualifiers in a column family. Ex.: ROW COLUMN+CELL row1 column=f1:1, timestamp=1401883411986, value=valu

Wildly varying "aggregate" performance depending on code location

2014-11-12 Thread Jim Carroll
Hello all, I have a really strange thing going on. I have a test data set with 500K lines in a gzipped csv file. I have an array of "column processors," one for each column in the dataset. A Processor tracks aggregate state and has a method "process(v : String)" I'm calling: val processors:

Re: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread spr
After comparing with previous code, I got it work by making the return a Some instead of Tuple2. Perhaps some day I will understand this. spr wrote > --code > > val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int, > Time)]) => { > val currentCount = if (va

Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi Nick, I saw the HBase api has experienced lots of changes. If I remember correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is 0.98.1. To get the column family names and qualifier names, we need to call different methods for these two different versions. I don't know how

using RDD result in another TDD

2014-11-12 Thread Adrian Mocanu
Hi I'd like to use the result of one RDD1 in another RDD2. Normally I would use something like a barrier so make the 2nd RDD wait till the computation of the 1st RDD is done then include the result from RDD1 in the closure for RDD2. Currently I create another RDD, RDD3, out of the result of RDD1

Re: join 2 tables

2014-11-12 Thread Rishi Yadav
please use join syntax. On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos < franco.barrien...@exalitica.com> wrote: > I have 2 tables in a hive context, and I want to select one field of each > table where id’s of each table are equal. For example, > > > > *val tmp2=sqlContext.sql("select a.ult_

Re: Is there a way to clone a JavaRDD without persisting it

2014-11-12 Thread Daniel Siegmann
As far as I know you basically have two options: let partitions be recomputed (possibly caching / persisting memory only), or persist to disk (and memory) and suffer the cost of writing to disk. The question is which will be more expensive in your case. My experience is you're better off letting th

Re: Question about textFileStream

2014-11-12 Thread Rishi Yadav
yes, can you always specify minimum number of partitions and that would force some parallelism ( assuming you have enough cores) On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa wrote: > What if the window is of 5 seconds, and the file takes longer than 5 > seconds to be completely scanned? It will

Re: pyspark get column family and qualifier names from hbase table

2014-11-12 Thread freedafeng
Hi, This is my code, import org.apache.hadoop.hbase.CellUtil /** * JF: convert a Result object into a string with column family and qualifier names. Sth like * 'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2' etc. * k-v pairs are separated by ';'. different colum

Re: Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread Davies Liu
This is a bug, will be fixed by https://github.com/apache/spark/pull/3230 On Wed, Nov 12, 2014 at 7:20 AM, rprabhu wrote: > Hello, > I'm trying to run a classification task using mllib decision trees. After > successfully training the model, I was trying to test the model using some > sample rows

Re: SVMWithSGD default threshold

2014-11-12 Thread Sean Owen
OK, it's not class imbalance. Yes, 100 iterations. My other guess is that the stepSize of 1 is way too big for your data. I'd suggest you look at the weights / intercept of the resulting model to see if it makes any sense. You can call clearThreshold on the model, and then it will 'predict' the S

Re: SVMWithSGD default threshold

2014-11-12 Thread Caron
Sean, Thanks a lot for your reply! A few follow up questions: 1. numIterations should be 100, not 100*trainingSetSize, right? 2. My training set has 90k positive data points (with label 1) and 60k negative data points (with label 0). I set my numIterations to 100 as default. I still got the same

No module named pyspark - latest built

2014-11-12 Thread jamborta
Hi all, I am trying to run spark with the latest build (from branch-1.2), as far as I can see, all the paths are set and SparkContext starts up OK, however, I cannot run anything that goes to the nodes. I get the following error: Error from python worker: /usr/bin/python2.7: No module named pys

Re: Question about textFileStream

2014-11-12 Thread Saiph Kappa
What if the window is of 5 seconds, and the file takes longer than 5 seconds to be completely scanned? It will still attempt to load the whole file? On Mon, Nov 10, 2014 at 6:24 PM, Soumitra Kumar wrote: > Entire file in a window. > > On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa > wrote: > >> H

Too many failed collects when trying to cache a table in SparkSQL

2014-11-12 Thread Sadhan Sood
We are running spark on yarn with combined memory > 1TB and when trying to cache a table partition(which is < 100G), seeing a lot of failed collect stages in the UI and this never succeeds. Because of the failed collect, it seems like the mapPartitions keep getting resubmitted. We have more than en

RE: Snappy error with Spark SQL

2014-11-12 Thread Kapil Malik
Hi, Try adding this in spark-env.sh export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64 export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/usr/lib/hadoo

Re: How to kill a Spark job running in cluster mode ?

2014-11-12 Thread Ritesh Kumar Singh
I remember there was some issue with the above command in previous veresions of spark. Its nice that its working now :) On Wed, Nov 12, 2014 at 5:50 PM, Tao Xiao wrote: > Thanks for your replies. > > Actually we can kill a driver by the command "bin/spark-class > org.apache.spark.deploy.Client k

join 2 tables

2014-11-12 Thread Franco Barrientos
I have 2 tables in a hive context, and I want to select one field of each table where id’s of each table are equal. For example, val tmp2=sqlContext.sql("select a.ult_fecha,b.pri_fecha from fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id=b.id") but i get an error: F

Getting py4j.protocol.Py4JError: An error occurred while calling o39.predict. while doing batch prediction using decision trees

2014-11-12 Thread rprabhu
Hello, I'm trying to run a classification task using mllib decision trees. After successfully training the model, I was trying to test the model using some sample rows when I hit this exception. The code snippet that caused this error is : model = DecisionTree.trainClassifier(parsedData, numClasse

Re: Status of MLLib exporting models to PMML

2014-11-12 Thread Villu Ruusmann
Hi DB, DB Tsai wrote > I also worry about that the author of JPMML changed the license of > jpmml-evaluator due to his interest of his commercial business, and he > might change the license of jpmml-model in the future. I am the principal author of the said Java PMML API projects and I want to a

Snappy error with Spark SQL

2014-11-12 Thread Naveen Kumar Pokala
HI, I am facing the following problem when I am trying to save my RDD as parquet File. 14/11/12 07:43:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 48,): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null org.xerial.snappy.SnappyLoader.load(SnappyLo

Re: ISpark class not found

2014-11-12 Thread Laird, Benjamin
Sounds like ipython notebook issue, not an ISpark one. Might want to reinstall "pip install ipython[notebook]", which will grab the notebook necessary components like tornado. Try spinning up ispark console instead of notebook to see if the ISpark kernel is functioning. ipython console —profile

why flatmap has shuffle

2014-11-12 Thread qinwei
Hi, everyone!     I consider flatmap as a narrow dependency , but why it has shuffle? as shown on the web UI: my code is as below : val transferRDD = sc.textFile("hdfs://host:port/path") val rdd = transferRDD.map(line => { val trunks = line.split("\t")

RE: Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
Thanks Akhil. -Naveen From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, November 12, 2014 6:38 PM To: Naveen Kumar Pokala Cc: user@spark.apache.org Subject: Re: Spark SQL configurations JavaSQLContext.sqlContext.setConf is available. Thanks Best Regards On Wed, Nov 12, 2014 a

Java client connection

2014-11-12 Thread Eduardo Cusa
HI guys, I starting to working with spark from java and when i run the folliwing code : SparkConf conf = new SparkConf().setMaster("spark://10.0.2.20:7077 ").setAppName("SparkTest"); JavaSparkContext sc = new JavaSparkContext(conf); I recived the following error and the java process exit ends:

Re: Pass RDD to functions

2014-11-12 Thread qinwei
I think it‘s ok,feel free to treat RDD like common object qinwei  From: Deep PradhanDate: 2014-11-12 18:24To: user@spark.apache.orgSubject: Pass RDD to functionsHi, Can we pass RDD to functions?Like, can we do the following? def func (temp: RDD[String]):RDD[String] = {//body of the functio

Re: Spark SQL configurations

2014-11-12 Thread Akhil Das
JavaSQLContext.sqlContext.setConf is available. Thanks Best Regards On Wed, Nov 12, 2014 at 5:14 PM, Naveen Kumar Pokala < npok...@spcapitaliq.com> wrote: > > > Hi, > > > > How to set the above properties on JavaSQLContext. I am not able to see > setConf method on JavaSQLContext Object. > > >

Re: Nested Complex Type Data Parsing and Transforming to table

2014-11-12 Thread Shixiong Zhu
Could you give an example of your data? This line is wrong. p(1).trim.map(_.toString.split("\002")).map(s => s.map(_.toString.split("\003")).map(t => StructField1( For example, p(1) is a String, so in p(1).trim.map(x => x.toString.split("\002")), x is a Char. That should not be what you want. I

Re: How to kill a Spark job running in cluster mode ?

2014-11-12 Thread Tao Xiao
Thanks for your replies. Actually we can kill a driver by the command "bin/spark-class org.apache.spark.deploy.Client kill " if you know the driver id. 2014-11-11 22:35 GMT+08:00 Ritesh Kumar Singh : > There is a property : >spark.ui.killEnabled > which needs to be set true for killing appl

Spark SQL configurations

2014-11-12 Thread Naveen Kumar Pokala
[cid:image001.png@01CFFE9C.25904980] Hi, How to set the above properties on JavaSQLContext. I am not able to see setConf method on JavaSQLContext Object. I have added spark core jar and spark assembly jar to my build path. And I am using spark 1.1.0 and hadoop 2.4.0 --Naveen

  1   2   >