Debugging Spark job in Eclipse

2015-08-05 Thread Deepesh Maheshwari
Hi, As spark job is executed when you run start() method of JavaStreamingContext. All the job like map, flatMap is already defined earlier but even though you put breakpoints in the function ,breakpoint doesn't stop there , then how can i debug the spark jobs. JavaDStream words=lines.flatMap(new

Implementing algorithms in GraphX pregel

2015-08-05 Thread Krish
Hi, I was recently looking into spark graphx as one of the frameworks that can help me solve some graph related problems. The 'think-like-a-vertex' paradigm is something new to me and I cannot wrap my head over how to implement simple algorithms like Depth First or Breadth First or even getting a

RE: Spark SQL support for Hive 0.14

2015-08-05 Thread Ishwardeep Singh
Thanks Steve and Michael for your response. Is there a tentative release date for Spark 1.5? From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesday, August 4, 2015 11:53 PM To: Steve Loughran Cc: Ishwardeep Singh ; user@spark.apache.org Subject: Re: Spark SQL support for Hive 0.14

Fwd: MLLIB MulticlassMetrics Unable to find class key

2015-08-05 Thread Hayri Volkan Agun
Hi everyone, After a successfull prediction, MulticlassMetrics returns an error for unable to find the key for the class for precision. Is there a way to check whether the metrics contains the class key? -- Hayri Volkan Agun PhD. Student - Anadolu University

Re: Combining Spark Files with saveAsTextFile

2015-08-05 Thread Igor Berman
seems that coallesce do work, see following thread https://www.mail-archive.com/user%40spark.apache.org/msg00928.html On 5 August 2015 at 09:47, Igor Berman wrote: > using coalesce might be dangerous, since 1 worker process will need to > handle whole file and if the file is huge you'll get OOM,

Spark Streaming - CheckPointing issue

2015-08-05 Thread Sadaf
Hi i've done the twitter streaming using twitter's streaming user api and spark streaming. this runs successfully on my local machine. but when i run this program on cluster in local mode. it just run successfully for the very first time. later on it gives the following exception. "Exception in t

Re?? About memory leak in spark 1.4.1

2015-08-05 Thread Sea
No one help me... I help myself, I split the cluster to two cluster 1.4.1 and 1.3.0 -- -- ??: "Ted Yu";; : 2015??8??4??(??) 10:28 ??: "Igor Berman"; : "Sea"<261810...@qq.com>; "Barak Gitsis"; "user@spark.apache.org"; "r

Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Hayri Volkan Agun
The results in MulticlassMetrics is totally wrong. They are improperly calculated. Confusion matrix may be true I don't know but for each label scores are wrong. -- Hayri Volkan Agun PhD. Student - Anadolu University

Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
Hi, Thanks a lot for your reply. It seems that it is because of the slowness of the second code. I rewrite code as list(set([i.items for i in a] + [i.items for i in b])). The program returns normal. By the way, I find that when the computation is running, UI will show scheduler delay. However, i

Re: Spark Streaming - CheckPointing issue

2015-08-05 Thread Saisai Shao
Hi, What Spark version do you use? it looks like a problem of configuration recovery, not sure is it a twitter streaming specific problem, I tried Kafka streaming with checkpoint enabled in my local machine, seems no such issue. Did you try to set these configurations in somewhere? Thanks Saisai

Re: Debugging Spark job in Eclipse

2015-08-05 Thread Eugene Morozov
Deepesh, you have to call an action to start actual processing. words.count() would do the trick. On 05 Aug 2015, at 11:42, Deepesh Maheshwari wrote: > Hi, > > As spark job is executed when you run start() method of JavaStreamingContext. > All the job like map, flatMap is already defined ea

Best practices to call hiveContext in DataFrame.foreach in executor program or how to have a for loop in driver program

2015-08-05 Thread unk1102
Hi I have the following code which fires hiveContext.sql() most of the time. My task is I want to create few table and insert values into after processing for all hive table partition. So I first fire show partitions and using its output in a for loop I call few methods which creates table if not e

Spark SQL Hive - merge small files

2015-08-05 Thread Brandon White
Hello, I would love to have hive merge the small files in my managed hive context after every query. Right now, I am setting the hive configuration in my Spark Job configuration but hive is not managing the files. Do I need to set the hive fields in around place? How do you set Hive configurations

Memory allocation error with Spark 1.5

2015-08-05 Thread Alexis Seigneurin
Hi, I'm receiving a memory allocation error with a recent build of Spark 1.5: java.io.IOException: Unable to acquire 67108864 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:348) at org.apache.spark.util.coll

Re: No Twitter Input from Kafka to Spark Streaming

2015-08-05 Thread narendra
Thanks Akash for the answer. I added endpoint to the listener and now it is working. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-Twitter-Input-from-Kafka-to-Spark-Streaming-tp24131p24142.html Sent from the Apache Spark User List mailing list archive a

Re: Spark SQL support for Hive 0.14

2015-08-05 Thread Steve Loughran
On 5 Aug 2015, at 02:08, Ishwardeep Singh mailto:ishwardeep.si...@impetus.co.in>> wrote: Thanks Steve and Michael for your response. Is there a tentative release date for Spark 1.5? The branch is of and going through its bug stabilisation/test phase. This is where you can help contribute to t

Re: Unable to load native-hadoop library for your platform

2015-08-05 Thread Steve Loughran
> On 4 Aug 2015, at 12:26, Sean Owen wrote: > > Oh good point, does the Windows integration need native libs for > POSIX-y file system access? I know there are some binaries shipped for > this purpose but wasn't sure if that's part of what's covered in the > native libs message. On unix its a w

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
Hi Hayri, Can you provide a sample of the expected and actual results? Feynman On Wed, Aug 5, 2015 at 6:19 AM, Hayri Volkan Agun wrote: > The results in MulticlassMetrics is totally wrong. They are improperly > calculated. > Confusion matrix may be true I don't know but for each label scores a

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
Also, what version of Spark are you using? On Wed, Aug 5, 2015 at 9:57 AM, Feynman Liang wrote: > Hi Hayri, > > Can you provide a sample of the expected and actual results? > > Feynman > > On Wed, Aug 5, 2015 at 6:19 AM, Hayri Volkan Agun > wrote: > >> The results in MulticlassMetrics is totall

Re: Spark SQL Hive - merge small files

2015-08-05 Thread Michael Armbrust
This feature isn't currently supported. On Wed, Aug 5, 2015 at 8:43 AM, Brandon White wrote: > Hello, > > I would love to have hive merge the small files in my managed hive context > after every query. Right now, I am setting the hive configuration in my > Spark Job configuration but hive is not

Re: Memory allocation error with Spark 1.5

2015-08-05 Thread Reynold Xin
In Spark 1.5, we have a new way to manage memory (part of Project Tungsten). The default unit of memory allocation is 64MB, which is way too high when you have 1G of memory allocated in total and have more than 4 threads. We will reduce the default page size before releasing 1.5. For now, you can

Re: Spark SQL Hive - merge small files

2015-08-05 Thread Brandon White
So there is no good way to merge spark files in a manage hive table right now? On Wed, Aug 5, 2015 at 10:02 AM, Michael Armbrust wrote: > This feature isn't currently supported. > > On Wed, Aug 5, 2015 at 8:43 AM, Brandon White > wrote: > >> Hello, >> >> I would love to have hive merge the smal

Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-05 Thread Jim Green
We tested Spark 1.2 and 1.3 , and this issue is gone. I know starting from 1.2, Spark uses netty instead of nio. So you mean that bypass this issue? Another question is , why this error message did not show in Spark 0.9 or older version? On Tue, Aug 4, 2015 at 11:01 PM, Aaron Davidson wrote: >

Streaming and calculated-once semantics

2015-08-05 Thread Dimitris Kouzis - Loukas
Hello, here's a simple program that demonstrates my problem: ssc = StreamingContext(sc, 1) input = [ [("k1",12), ("k2",14)], [("k1",22)] ] rawData = ssc.queueStream([sc.parallelize(d, 1) for d in input]) runningRawData = rawData.updateStateByKey(lambda nv, prev: reduce(sum, nv, prev or 0)) de

Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Daniel Haviv
Hi, Is it possible to start the Spark SQL thrift server from with a streaming app so the streamed data could be queried as it's goes in ? Thank you. Daniel

Spark MLib v/s SparkR

2015-08-05 Thread praveen S
I was wondering when one should go for MLib or SparkR. What is the criteria or what should be considered before choosing either of the solutions for data analysis? or What is the advantages of Spark MLib over Spark R or advantages of SparkR over MLib?

Re: Spark MLib v/s SparkR

2015-08-05 Thread Benamara, Ali
SparkR doesn't support all the ML algorithms yet, the next 1.5 will have more support but still not all the algorithms that are currently supported in Mllib. SparkR is more of a convenience for R users to get acquainted with Spark at this point. -Ali From: praveen S mailto:mylogi...@gmail.com>

Re: Spark MLib v/s SparkR

2015-08-05 Thread Krishna Sankar
A few points to consider: a) SparkR gives the union of R_in_a_single_machine and the distributed_computing_of_Spark: b) It also gives the ability to wrangle with data in R, that is in the Spark eco system c) Coming to MLlib, the question is MLlib and R (not MLlib or R) - depending on the scale, dat

spark hangs at broadcasting during a filter

2015-08-05 Thread AlexG
I'm trying to load a 1 Tb file whose lines i,j,v represent the values of a matrix given as A_{ij} = v so I can convert it to a Parquet file. Only some of the rows of A are relevant, so the following code first loads the triplets are text, splits them into Tuple3[Int, Int, Double], drops triplets wh

Re: Spark MLib v/s SparkR

2015-08-05 Thread Charles Earl
What machine learning algorithms are you interested in exploring or using? Start from there or better yet the problem you are trying to solve, and then the selection may be evident. On Wednesday, August 5, 2015, praveen S wrote: > I was wondering when one should go for MLib or SparkR. What is t

Set Job Descriptions for Scala application

2015-08-05 Thread Rares Vernica
Hello, My Spark application is written in Scala and submitted to a Spark cluster in standalone mode. The Spark Jobs for my application are listed in the Spark UI like this: Job Id Description ... 6 saveAsTextFile at Foo.scala:202 5 saveAsTextFile at Foo.scala:201 4

Re: Set Job Descriptions for Scala application

2015-08-05 Thread Mark Hamstra
SparkContext#setJobDescription or SparkContext#setJobGroup On Wed, Aug 5, 2015 at 12:29 PM, Rares Vernica wrote: > Hello, > > My Spark application is written in Scala and submitted to a Spark cluster > in standalone mode. The Spark Jobs for my application are listed in the > Spark UI like this:

Re: spark hangs at broadcasting during a filter

2015-08-05 Thread Philip Weaver
How big is droprows? Try explicitly broadcasting it like this: val broadcastDropRows = sc.broadcast(dropRows) val valsrows = ... .filter(x => !broadcastDropRows.value.contains(x._1)) - Philip On Wed, Aug 5, 2015 at 11:54 AM, AlexG wrote: > I'm trying to load a 1 Tb file whose lines i,j,

HiveContext error

2015-08-05 Thread Stefan Panayotov
Hello, I am trying to define an external Hive table from Spark HiveContext like the following: import org.apache.spark.sql.hive.HiveContext val hiveCtx = new HiveContext(sc) hiveCtx.sql(s"""CREATE EXTERNAL TABLE IF NOT EXISTS Rentrak_Ratings (Version string, Gen_Date string, Market_Number s

Re: Label based MLLib MulticlassMetrics is buggy

2015-08-05 Thread Feynman Liang
1.5 has not yet been released; what is the commit hash that you are building? On Wed, Aug 5, 2015 at 10:29 AM, Hayri Volkan Agun wrote: > Hi, > > In Spark 1.5 I saw a result for precision 1.0 and recall 0.01 for decision > tree classification. > While precision a hundred the recall shouldn't be

SparkConf "ignoring" keys

2015-08-05 Thread Corey Nolet
I've been using SparkConf on my project for quite some time now to store configuration information for its various components. This has worked very well thus far in situations where I have control over the creation of the SparkContext & the SparkConf. I have run into a bit of a problem trying to i

Re: large scheduler delay in pyspark

2015-08-05 Thread ayan guha
It seems you want to dedupe your data after the merge so set(a+b) should also work..you may ditch the list comprehensiion operation. On 5 Aug 2015 23:55, "gen tang" wrote: > Hi, > Thanks a lot for your reply. > > > It seems that it is because of the slowness of the second code. > I rewrite code a

Windows function examples in pyspark

2015-08-05 Thread jegordon
Hi to all, Im trying to use some windows functions (ntile and percentRank) for a Dataframe but i dont know how to use them. Does anyone can help me with this please? in the Python API documentation there are no examples about it. In specific, im trying to get quantiles of a numeric field in my

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Todd Nist
Hi Danniel, It is possible to create an instance of the SparkSQL Thrift server, however seems like this project is what you may be looking for: https://github.com/Intel-bigdata/spark-streamingsql Not 100% sure of your use case is, but you can always convert the data into DF then issue a query ag

Re: trying to understand yarn-client mode

2015-08-05 Thread nir
Hi DB Tsai-2, I am trying to run singleton sparkcontext in my container (spring-boot tomcat container). When my application bootstrap I used to create sparkContext and keep the reference for future job submission. I got it working with standalone spark perfectly but I am having trouble with yarn m

Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
Hi, Consider I'm running WordCount with 100m of data on 4 node cluster. Assuming my RAM size on each node is 200g and i'm giving my executors 100g (just enough memory for 100m data) 1. If I have enough memory, can Spark 100% avoid writing to disk? 2. During shuffle, where results have to b

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Saisai Shao
Hi Muler, Shuffle data will be written to disk, no matter how large memory you have, large memory could alleviate shuffle spill where temporary file will be generated if memory is not enough. Yes, each node writes shuffle data to file and pulled from disk in reduce stage from network framework (d

Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread foobar
Hi, I have a question about sampling Spark Streaming data, or getting part of the data. For every minute, I only want the data read in during the first 10 seconds, and discard all data in the next 50 seconds. Is there any way to pause reading and discard data in that period? I'm doing this to sampl

Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread Heath Guo
Hi, I have a question about sampling Spark Streaming data, or getting part of the data. For every minute, I only want the data read in during the first 10 seconds, and discard all data in the next 50 seconds. Is there any way to pause reading and discard data in that period? I'm doing this to sa

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
thanks, so if I have enough large memory (with enough spark.shuffle.memory) then shuffle (in-memory shuffle) spill doesn't happen (per node) but still shuffle data has to be ultimately written to disk so that reduce stage pulls if across network? On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao wrote:

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Saisai Shao
Yes, finally shuffle data will be written to disk for reduce stage to pull, no matter how large you set to shuffle memory fraction. Thanks Saisai On Thu, Aug 6, 2015 at 7:50 AM, Muler wrote: > thanks, so if I have enough large memory (with enough > spark.shuffle.memory) then shuffle (in-memory

Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-05 Thread Philip Weaver
I have a parquet directory that was produced by partitioning by two keys, e.g. like this: df.write.partitionBy("a", "b").parquet("asdf") There are 35 values of "a", and about 1100-1200 values of "b" for each value of "a", for a total of over 40,000 partitions. Before running any transformations

Re: Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread Dimitris Kouzis - Loukas
What driver do you use? Sounds like something you should do before the driver... On Thu, Aug 6, 2015 at 12:50 AM, Heath Guo wrote: > Hi, I have a question about sampling Spark Streaming data, or getting part > of the data. For every minute, I only want the data read in during the > first 10 seco

Re: Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread Heath Guo
Hi Dimitris, Thanks for your reply. Just wondering – are you asking about my streaming input source? I implemented a custom receiver and have been using that. Thanks. From: Dimitris Kouzis - Loukas mailto:look...@gmail.com>> Date: Wednesday, August 5, 2015 at 5:27 PM To: Heath Guo mailto:heath..

PySpark in Pycharm- unable to connect to remote server

2015-08-05 Thread Ashish Dutt
Use Case: I want to use my laptop (using Win 7 Professional) to connect to the CentOS 6.4 master server using PyCharm. Objective: To write the code in Pycharm on the laptop and then send the job to the server which will do the processing and should then return the result back to the laptop or to a

Reliable Streaming Receiver

2015-08-05 Thread Sourabh Chandak
Hi, I am trying to replicate the Kafka Streaming Receiver for a custom version of Kafka and want to create a Reliable receiver. The current implementation uses BlockGenerator which is a private class inside Spark streaming hence I can't use that in my code. Can someone help me with some resources

How to connect to remote HDFS programmatically to retrieve data, analyse it and then write the data back to HDFS?

2015-08-05 Thread Ashish Dutt
*Use Case:* To automate the process of data extraction (HDFS), data analysis (pySpark/sparkR) and saving the data back to HDFS programmatically. *Prospective solutions:* 1. Create a remote server connectivity program in an IDE like pyCharm or RStudio and use it to retrieve the data from HDFS or e

Re: Reliable Streaming Receiver

2015-08-05 Thread Tathagata Das
You could very easily strip out the BlockGenerator code from the Spark source code and use it directly in the same way the Reliable Kafka Receiver uses it. BTW, you should know that we will be deprecating the receiver based approach for the Direct Kafka approach. That is quite flexible, can give e

Re: Reliable Streaming Receiver

2015-08-05 Thread Sourabh Chandak
Thanks Tathagata. I tried that but BlockGenerator internally uses SystemClock which is again private. We are using DSE so stuck with Spark 1.2 hence can't use the receiver-less version. Is it possible to use the same code as a separate API with 1.2? Thanks, Sourabh On Wed, Aug 5, 2015 at 6:13 PM

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
Thanks! On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao wrote: > Yes, finally shuffle data will be written to disk for reduce stage to > pull, no matter how large you set to shuffle memory fraction. > > Thanks > Saisai > > On Thu, Aug 6, 2015 at 7:50 AM, Muler wrote: > >> thanks, so if I have enoug

Re: Reliable Streaming Receiver

2015-08-05 Thread Dibyendu Bhattacharya
Hi, You can try This Kafka Consumer for Spark which is also part of Spark Packages . https://github.com/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Thu, Aug 6, 2015 at 6:48 AM, Sourabh Chandak wrote: > Thanks Tathagata. I tried that but BlockGenerator internally uses > SystemClock which

Re: How to connect to remote HDFS programmatically to retrieve data, analyse it and then write the data back to HDFS?

2015-08-05 Thread Ted Yu
Please see the comments at the tail of SPARK-2356 Cheers On Wed, Aug 5, 2015 at 6:04 PM, Ashish Dutt wrote: > *Use Case:* To automate the process of data extraction (HDFS), data > analysis (pySpark/sparkR) and saving the data back to HDFS > programmatically. > > *Prospective solutions:* > > 1.

How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
I have csv data that is embedded in gzip format on HDFS. *With Pig* a = load '/user/zeppelin/aggregatedsummary/2015/08/03/regular/part-m-3.gz' using PigStorage(); b = limit a 10 (2015-07-27,12459,,31243,6,Daily,-999,2099-01-01,2099-01-02,4,0,0.1,0,1,203,4810370.0,1.4090459061723766,1.01

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread Philip Weaver
The parallelize method does not read the contents of a file. It simply takes a collection and distributes it to the cluster. In this case, the String is a collection 67 characters. Use sc.textFile instead of sc.parallelize, and it should work as you want. On Wed, Aug 5, 2015 at 8:12 PM, ÐΞ€ρ@Ҝ (๏

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
That seem to be working. however i see a new exception Code: def formatStringAsDate(dateStr: String) = new SimpleDateFormat("-MM-dd").parse(dateStr) //(2015-07-27,12459,,31242,6,Daily,-999,2099-01-01,2099-01-02,1,0,0.1,0,1,-1,isGeo,,,204,694.0,1.9236856708701322E-4,0.0,-4.48,0.0,0.0,0.0,) val

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread Philip Weaver
This message means that java.util.Date is not supported by Spark DataFrame. You'll need to use java.sql.Date, I believe. On Wed, Aug 5, 2015 at 8:29 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > That seem to be working. however i see a new exception > > Code: > def formatStringAsDate(dateStr: String) = new > Simpl

Re: Upgrade of Spark-Streaming application

2015-08-05 Thread Shushant Arora
Hi For checkpointing and using fromOffsets arguments- Say for the first time when my app starts I don't have any prev state stored and I want to start consuming from largest offset 1. is it possible to specify that in fromOffsets api- I don't want to use another api which returs JavaPairInputDS

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-05 Thread Cheng Lian
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 Could you give it a shot to see whether it helps in your case? We've observed ~50x performance boost with schema merging turned on. Cheng On 8/6/15 8:26 AM, Philip Weaver wrote: I have a parquet directory that was produce

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-05 Thread Philip Weaver
Absolutely, thanks! On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian wrote: > We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 > > Could you give it a shot to see whether it helps in your case? We've > observed ~50x performance boost with schema merging turned on. > > Cheng > >

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
how do i persist the RDD to HDFS ? On Wed, Aug 5, 2015 at 8:32 PM, Philip Weaver wrote: > This message means that java.util.Date is not supported by Spark > DataFrame. You'll need to use java.sql.Date, I believe. > > On Wed, Aug 5, 2015 at 8:29 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > >> That seem to be work

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
Code: val summary = rowStructText.map(s => s.split(",")).map( { s => Summary(formatStringAsDate(s(0)), s(1).replaceAll("\"", "").toLong, s(3).replaceAll("\"", "").toLong, s(4).replaceAll("\"", "").toInt, s(5).replaceAll("\"", ""),

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread Philip Weaver
I encourage you to find the answer this this on your own :). On Wed, Aug 5, 2015 at 9:43 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Code: > > val summary = rowStructText.map(s => s.split(",")).map( > { > s => > Summary(formatStringAsDate(s(0)), > s(1).replaceAll("\"", "").toLong, >

RE: How to read gzip data in Spark - Simple question

2015-08-05 Thread Ganelin, Ilya
Have you tried reading the spark documentation? http://spark.apache.org/docs/latest/programming-guide.html Thank you, Ilya Ganelin -Original Message- From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.com] Sent: Thursday, August 06, 2015 12:41 AM Eastern Standard Time T

spark on mesos with docker from private repository

2015-08-05 Thread Eyal Fink
Hi, My Spark set up is a cluster on top of mesos using docker containers. I want to pull the docker images from a private repository (currently gcr.io ), and I can't get the authentication to work. I know how to generate a .dockercfg file (running on GCE, using gcloud docker -a). My problem is th

subscribe

2015-08-05 Thread Franc Carter
subscribe

Re: How to read gzip data in Spark - Simple question

2015-08-05 Thread ๏̯͡๏
I got it running by myself On Wed, Aug 5, 2015 at 10:27 PM, Ganelin, Ilya wrote: > Have you tried reading the spark documentation? > > http://spark.apache.org/docs/latest/programming-guide.html > > > > Thank you, > Ilya Ganelin > > > > > -Original Message- > *From: *ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj

spark job not accepting resources from worker

2015-08-05 Thread Kushal Chokhani
Hi I have a spark/cassandra setup where I am using a spark cassandra java connector to query on a table. So far, I have 1 spark master node (2 cores) and 1 worker node (4 cores). Both of them have following spark-env.sh under conf/: |#!/usr/bin/env bash export SPARK_LOCAL_IP=127.0.0.1

Re: Extremely poor predictive performance with RF in mllib

2015-08-05 Thread Yanbo Liang
I can reproduce this issue, so looks like a bug of Random Forest, I will try to find some clue. 2015-08-05 1:34 GMT+08:00 Patrick Lam : > Yes, I rechecked and the label is correct. As you can see in the code > posted, I used the exact same features for all the classifiers so unless rf > somehow s

Re: spark hangs at broadcasting during a filter

2015-08-05 Thread Alex Gittens
Thanks. Repartitioning to a smaller number of partitions seems to fix my issue, but I'll keep broadcasting in mind (droprows is an integer array with about 4 million entries). On Wed, Aug 5, 2015 at 12:34 PM, Philip Weaver wrote: > How big is droprows? > > Try explicitly broadcasting it like thi

Unable to persist RDD to HDFS

2015-08-05 Thread ๏̯͡๏
Code: import java.text.SimpleDateFormat import java.util.Calendar import java.sql.Date import org.apache.spark.storage.StorageLevel def formatStringAsDate(dateStr: String) = new java.sql.Date(new SimpleDateFormat("-MM-dd").parse(dateStr).getTime()) //(2015-07-27,12459,,31242,6,Daily,-999,2099

Re: Unable to persist RDD to HDFS

2015-08-05 Thread Philip Weaver
This isn't really a Spark question. You're trying to parse a string to an integer, but it contains an invalid character. The exception message explains this. On Wed, Aug 5, 2015 at 11:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Code: > import java.text.SimpleDateFormat > import java.util.Calendar > import jav

Re: Multiple UpdateStateByKey Functions in the same job?

2015-08-05 Thread Akhil Das
I think you can. Give it a try and see. Thanks Best Regards On Tue, Aug 4, 2015 at 7:02 AM, swetha wrote: > Hi, > > Can I use multiple UpdateStateByKey Functions in the Streaming job? Suppose > I need to maintain the state of the user session in the form of a Json and > counts of various other

Re: Twitter Connector-Spark Streaming

2015-08-05 Thread Akhil Das
> Hi Sadaf > > Which version of spark are you using? And whats in the spark-env.sh file? > I think you are using both SPARK_CLASSPATH (which is deprecated) and > spark.executor.extraClasspath (may be set in spark-defaults.sh file). > > Thanks > Best Regards > > On Wed, Aug 5, 2015 at 6:22 PM, Sadaf

Talk on Deep dive in Spark Dataframe API

2015-08-05 Thread madhu phatak
Hi, Recently I gave a talk on a deep dive into data frame api and sql catalyst . Video of the same is available on Youtube with slides and code. Please have a look if you are interested. *http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/

Re: NoSuchMethodError : org.apache.spark.streaming.scheduler.StreamingListenerBus.start()V

2015-08-05 Thread Akhil Das
For some reason you are having two different versions of spark jars in your classpath. Thanks Best Regards On Tue, Aug 4, 2015 at 12:37 PM, Deepesh Maheshwari < deepesh.maheshwar...@gmail.com> wrote: > Hi, > > I am trying to read data from kafka and process it using spark. > i have attached my s