Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Hi, I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL without problems: val input_file = "s3:///test_data.txt" val rawdata = sc.textFile(input_file) val test = rawdata.collect but when I try to run a simple standalone application reading the same data, I get an erro

Re: Spark streaming at-least once guarantee

2014-08-06 Thread lalit1303
Hi TD, Thanks a lot for your reply :) I am already looking into creating a new DStream for SQS messages. It would be very helpful if you can provide with some guidance regarding the same. The main motive of integrating SQS with spark streaming is to make my Jobs run in high availability. As of no

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
I'm getting the same "Input path does not exist" error also after setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using the format "s3:///test_data.txt" for the input file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/P

Re: Problem reading from S3 in standalone application

2014-08-06 Thread Evan Sparks
Try s3n:// > On Aug 6, 2014, at 12:22 AM, sparkuser2345 wrote: > > I'm getting the same "Input path does not exist" error also after setting the > AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using > the format "s3:///test_data.txt" for the input file. > > > > -- >

Re: Setting spark.executor.memory problem

2014-08-06 Thread Grzegorz Białek
Hi Andrew, Thank you very much for your solution, it works like a charm, and for very clear explanation. Grzegorz

Re: master=local vs master=local[*]

2014-08-06 Thread grzegorz-bialek
Hi André, thank you very much for your explanation. I have 4 cores so 128MB for executor was to less. I thought that when spark.executor.memory is set to 512m each worker gets 512MB but when I run application through spark-submit I have provide --driver-memory argument for spark-submit (setting sp

[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-06 Thread Bin
Hi All, Finally I found that the problem occured when I called the graphx lib: " Exception in thread "main" java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56)

fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Lizhengbing (bing, BIPA)
1 I don't use spark_submit to run my problem and use spark context directly val conf = new SparkConf() .setMaster("spark://123d101suse11sp3:7077") .setAppName("LBFGS") .set("spark.executor.memory", "30g") .set("spark.akka.frameSize","20") val sc =

Spark GraphX remembering old message

2014-08-06 Thread Arun Kumar
Hi I am trying to find belief for a graph using GraphX Pragel implementation .My use case is like if vertex 2,3,4 are sending message m2,m3,m4 to vertex 6 .In vertex 6 I will multiple all the messages (m2*m3*m4) =m6 and then from vertex6 the message (m6/m2) will be send to vertex 2,m6/m3 to vert

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
Hi Tathagata, I got rid of parts of this problem today. It was embarassingly simple: somehow, my synchronisation script had failed and one of the two jars wasn't sent out to all slave nodes. The java error disappears, however, I still don't receive any tweets. This might well be firewall relate

can't submit my application on standalone spark cluster

2014-08-06 Thread Andres Gomez Ferrer
Hi all, My name is Andres and I'm starting to use Apache Spark. I try to submit my spark.jar to my cluster using this: spark-submit --class "net.redborder.spark.RedBorderApplication" --master spark://pablo02:7077 redborder-spark-selfcontained.jar But when I did it .. My worker die .. and my dr

Re: can't submit my application on standalone spark cluster

2014-08-06 Thread Akhil Das
Looks like a netty conflict there, most likely you are having mutiple versions of netty jars (eg: netty-3.6.6.Final.jar, netty-3.2.2.Final.jar, netty-all-4.0.13.Final.jar), you only require 3.6.6 i believe. a quick fix would be to remove the rest of them. Thanks Best Regards On Wed, Aug 6, 2014

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Evan R. Sparks wrote > Try s3n:// Thanks, that works! In REPL, I can succesfully load the data using both s3:// and s3n://, why the difference? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-reading-from-S3-in-standalone-application-tp11524p11537.

Spark SQL (version 1.1.0-SNAPSHOT) should allow SELECT with duplicated columns

2014-08-06 Thread Jianshi Huang
Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.(dataTypes.scala:317)

Re: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-06 Thread Mahebub Sayyed
Hello, I have referred link "https://github.com/dibbhatt/kafka-spark-consumer"; and I have successfully consumed tuples from kafka. Tuples are JSON objects and I want to store that objects in HDFS as parque format. Please suggest me any sample example for that. Thanks in advance. On Tue, Aug

SparkR : lapplyPartition transforms the data in vertical format

2014-08-06 Thread Pranay Dave
Hello As per documentation, lapply works on single records and lapplyPartition works on partition However the format of output does not change When I use lapplypartition, the data is converted to vertical format Here is my code library(SparkR) sc <- sparkR.init("local") lines <- textFile(sc,"/

Spark Hbase job taking long time

2014-08-06 Thread Amit Singh Hora
Hi All, I am trying to run a SQL query on HBase using spark job ,till now i am able to get the desierd results but as the data set size increases Spark job is taking a long time I believe i am doing something wrong,as after going through documentation and videos discussing on spark performance

how spark read hdfs data with kerberose

2014-08-06 Thread Zhanfeng Huo
Hi, all: My yarn cluster is managered by kerberos. Is there some friend can give me an example that how to config a stand alone spark cluster with this HDFS? In addition, I have configed spark.authenticate and spark.authenticate.secret in spark-defaults.conf as forllowing. But it

Re: Save an RDD to a SQL Database

2014-08-06 Thread Ron Gonzalez
Hi Vida, It's possible to save an RDD as a hadoop file using hadoop output formats. It might be worthwhile to investigate using DBOutputFormat and see if this will work for you. I haven't personally written to a db, but I'd imagine this would be one way to do it. Thanks, Ron Sent from my i

Re: Problem reading from S3 in standalone application

2014-08-06 Thread Nicholas Chammas
See here: https://wiki.apache.org/hadoop/AmazonS3 s3:// refers to a block storage system and is deprecated . Use s3n:// for "regular" files you can see in the S3 web console. Nick On Wed, Aug 6, 2014

[Streaming] updateStateByKey trouble

2014-08-06 Thread Yana Kadiyska
Hi folks, hoping someone who works with Streaming can help me out. I have the following snippet: val stateDstream = data.map(x => (x, 1)) .updateStateByKey[State](updateFunc) stateDstream.saveAsTextFiles(checkpointDirectory, "partitions_test") where data is a RDD of case class Stat

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-06 Thread Nicholas Chammas
Nice catch Brad and thanks to Yin and Davies for getting on it so quickly. On Wed, Aug 6, 2014 at 2:45 AM, Davies Liu wrote: > There is a PR to fix this: https://github.com/apache/spark/pull/1802 > > On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller > wrote: > > I concur that printSchema works; it

Re: fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Xiangrui Meng
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver? I think the problem is with the feature dimension. KDD data has more than 20M features and in v1.0.1, the driver collects the partial gradients one by one, sums them up, does the update, and then sends the new weights back t

Re: Save an RDD to a SQL Database

2014-08-06 Thread Yana
Hi Vida, I am writing to a DB -- or trying to :). I believe the best practice for this (you can search the mailing list archives) is to do a combination of mapPartitions and use a grouped iterator. Look at this thread, esp. the comment from A. Boisvert and Matei's comment above it: https://groups

Runnning a Spark Shell locally against EC2

2014-08-06 Thread Gary Malouf
We have Spark 1.0.1 on Mesos deployed as a cluster in EC2. Our Devops lead tells me that Spark jobs can not be submitted from local machines due to the complexity of opening the right ports to the world etc. Are other people running the shell locally in a production environment?

PySpark, numpy arrays and binary data

2014-08-06 Thread Rok Roskar
Hello, I'm interested in getting started with Spark to scale our scientific analysis package (http://pynbody.github.io) to larger data sets. The package is written in Python and makes heavy use of numpy/scipy and related frameworks. I've got a couple of questions that I have not been able to fi

Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
My company is leaning towards moving much of their analytics work from our own Spark/Mesos/HDFS/Cassandra set up to RedShift. To date, I have been the internal advocate for using Spark for analytics, but a number of good points have been brought up to me. The reasons being pushed are: - RedShift

Re: Spark shell creating a local SparkContext instead of connecting to connecting to Spark Master

2014-08-06 Thread Aniket Bhatnagar
Thanks. This worked :). I am thinking I should add this in spark-env.sh so that spark-shell always connects to master be default. On Aug 6, 2014 12:04 AM, "Akhil Das" wrote: > ​You can always start your spark-shell by specifying the master as > > MASTER=spark://*whatever*:7077 $SPARK_HOME/bin/spa

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas
> > 1) We get tooling out of the box from RedShift (specifically, stable JDBC > access) - Spark we often are waiting for devops to get the right combo of > tools working or for libraries to support sequence files. The arguments about JDBC access and simpler setup definitely make sense. My first n

Re: PySpark, numpy arrays and binary data

2014-08-06 Thread Davies Liu
numpy array only can support basic types, so we can not use it during collect() by default. Could you give a short example about how numpy array is used in your project? On Wed, Aug 6, 2014 at 8:41 AM, Rok Roskar wrote: > Hello, > > I'm interested in getting started with Spark to scale our scien

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
Update: I can get it to work by disabling iptables temporarily. I can, however, not figure out on which port I have to accept traffic. 4040 and any of the Master or Worker ports mentioned in the previous post don't work. Can it be one of the randomly assigned ones in the 30k to 60k range? Those ap

RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Just to point out that the benchmark you point to has Redshift running on HDD machines instead of SSD, and it is still faster than Shark in all but one case. Like Gary, I'm also interested in replacing something we have on Redshift with Spark SQL, as it will give me much greater capability to pr

Re: can't submit my application on standalone spark cluster

2014-08-06 Thread Andrew Or
Hi Andres, If you're using the EC2 scripts to start your standalone cluster, you can use "~/spark-ec2/copy-dir --delete ~/spark" to sync your jars across the cluster. Note that you will need to restart the Master and the Workers afterwards through "sbin/start-all.sh" and "sbin/stop-all.sh". If you

Re: Submitting to a cluster behind a VPN, configuring different IP address

2014-08-06 Thread nunarob
Hi, I'm having the exact same problem - I'm on a VPN and I'm trying to set the proproperties spark.httpBroadcast.uri and spark.fileserver.uri so that they bind to my VPN ip instead of my regular network IP. Were you ever able to get this working? Cheers, -Rob -- View this message in conte

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread Andrew Or
Hi Simon, The drivers and executors currently choose random ports to talk to each other, so the Spark nodes will have to have full TCP access to each other. This is changed in a very recent commit, where all of these random ports will become configurable: https://github.com/apache/spark/commit/09f

GraphX Pagerank application

2014-08-06 Thread AlexanderRiggers
I want to use pagerank on a 3GB textfile, which contains a bipartite list with variables "id" and "brand". Example: id,brand 86246,15343 86246,27873 86246,14647 86246,55172 86246,3293 86246,2820 86246,3830 86246,2820 86246,5603 86246,72482 To perform the page rank I have to create a graph object

Re: Runnning a Spark Shell locally against EC2

2014-08-06 Thread Andrew Or
Hi Gary, This has indeed been a limitation of Spark, in that drivers and executors use random ephemeral ports to talk to each other. If you are submitting a Spark job from your local machine in client mode (meaning, the driver runs on your machine), you will need to open up all TCP ports from your

Re: Runnning a Spark Shell locally against EC2

2014-08-06 Thread Gary Malouf
This will be awesome - it's been one of the major issues for our analytics team as they hope to use their own python libraries. On Wed, Aug 6, 2014 at 2:40 PM, Andrew Or wrote: > Hi Gary, > > This has indeed been a limitation of Spark, in that drivers and executors > use random ephemeral ports

Re: issue with spark and bson input

2014-08-06 Thread Dmitriy Selivanov
Finally I made it work. The trick was in "asSubclass" method: val mongoRDD = sc.newAPIHadoopFile("file:///root/jobs/dump/input.bson", classOf[BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, BSONObject]]), classOf[Object], classOf[BSONObject], co

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
Hi Andrew, for this test I only have one machine which provides the master and only worker. So all I'd need is communication to the Internet to access the twitter API. I've tried assigning a specific port to the driver and creating iptables rules for this port, but that didn't work. Best rega

heterogeneous cluster hardware

2014-08-06 Thread anthonyjschu...@gmail.com
I'm sure this must be a fairly common use-case for spark, yet I have not found a satisfactory discussion of it on the spark website or forum: I work at a company with a lot of previous-generation server hardware sitting idle-- I want to add this hardware to my spark cluster to increase performance

Spark memory management

2014-08-06 Thread Gary Malouf
I have a few questions about managing Spark memory: 1) In a standalone setup, is their any cpu prioritization across users running jobs? If so, what is the behavior here? 2) With Spark 1.1, users will more easily be able to run drivers/shells from remote locations that do not cause firewall head

Spark build error

2014-08-06 Thread Priya Ch
Hi, I am trying to build jars using the command : mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package Execution of the above command is throwing the following error: [INFO] Spark Project Core . FAILURE [ 0.295 s] [INFO] Spark Project Bagel .

Re: Unit Test for Spark Streaming

2014-08-06 Thread JiajiaJing
Thank you TD, I have worked around that problem and now the test compiles. However, I don't actually see that test running. As when I do "mvn test", it just says "BUILD SUCCESS", without any TEST section on stdout. Are we suppose to use "mvn test" to run the test? Are there any other methods can

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Forgot to cc the mailing list :) On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG) < r.dan...@elsevier.com> wrote: > Agreed. Being able to use SQL to make a table, pass it to a graph > algorithm, pass that output to a machine learning algorithm, being able to > invoke user defined python

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)< r.dan...@elsevier.com> wrote: > Mostly I was just objecting to " Redshift does very well, but Shark is on > par or better than it in most of the tests " when that was not how I read > the results, and Redshift was on HDDs. My bad. You are

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Also, regarding something like redshift not having MLlib built in, much of that could be done on the derived results. On Aug 6, 2014 4:07 PM, "Nicholas Chammas" wrote: > On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)< > r.dan...@elsevier.com> wrote: > >> Mostly I was just objecting to "

Help with debugging a performance issue

2014-08-06 Thread Sameer Tilak
Hi All, I am running my spark-job using spark-submit (yarn-client mode). In this PoC phase, am loading a ~200MB TSV file and then doing computation over strings. I generate few small files (in KB). The goal is then to load up around 250GB input file rather than 200 MB and run the analytics. In

UpdateStateByKey - How to improve performance?

2014-08-06 Thread Venkat Subramanian
The method def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) => Option[S] ): DStream[(K, S)] takes Dstream (K,V) and Produces DStream (K,S) in Spark Streaming We have a input Dstream(K,V) that has 40,000 elements. We update on average of 1000 elements of them in every 3 secon

RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Well yes, MLlib-like routines or pretty much anything else could be run on the derived results, but you have to unload the results from Redshift and then load them into some other tool. So it's nicer to leave them in memory and operate on them there. Major architectural advantage to Spark. Ron

Re: Writing to RabbitMQ

2014-08-06 Thread Tathagata Das
Yeah, I have observed this common problem in the design a number of times in the mailing list. Tobias, what you linked to is also an additional problem, that occurs with mapPartitions, but not with foreachPartitions (which is relevant here). But I do get your point. I think there was an attempt mad

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas
On Wed, Aug 6, 2014 at 4:30 PM, Daniel, Ronald (ELS-SDG) < r.dan...@elsevier.com> wrote: > Major architectural advantage to Spark. Amen to that. For a really cool and succinct demonstration of this, check out Aaron's demo at the Hadoop Summit earlier this e

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-06 Thread Shivaram Venkataraman
The output of lapply and lapplyPartition should the same by design -- The only difference is that in lapply the user-defined function returns a row, while it returns a list in lapplyPartition. Could you given an example of a small input and output that you expect to see for the above program ? Sh

Re: Spark streaming at-least once guarantee

2014-08-06 Thread Tathagata Das
Why cant you keep a persistent queue of S3 files to process? The process that you are running has two threads Thread 1: Continuously gets SQS messages and write to the queue. This queue is persisted to the reliable storage, like HDFS / S3. Thread 2: Peek into the queue, and whenever there is any me

Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Michael Malak
Depending on the density of your keys, the alternative signature def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ? Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)] at least iterates by key rather than

Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Tathagata Das
Hello Venkat, Your thoughts are quite spot on. The current implementation was designed to allow the functionality of timing out a state. For this to be possible, the update function need to be called each key even if there is no new data, so that the function can check things like "last update tim

Re: Spark memory management

2014-08-06 Thread Andrew Or
Hey Gary, The answer to both of your questions is that much of it is up to the application. For (1), the standalone master can set "spark.deploy.defaultCores" to limit the number of cores each application can grab. However, the application can override this with the applications-specific "spark.c

Re: Unit Test for Spark Streaming

2014-08-06 Thread Tathagata Das
Does it not show the name of the testsuite on stdout, showing that it has passed? Can you try writing a small "test" unit-test, in the same way as your kafka unit test, and with print statements on stdout ... to see whether it works? I believe it is some configuration issue in maven, which is hard

Re: [Streaming] updateStateByKey trouble

2014-08-06 Thread Tathagata Das
This should not be happening :| Can you please reproduce this in a smaller program and a dataset that I can use to debug. Cant imaging how this could be happening. TD On Wed, Aug 6, 2014 at 7:51 AM, Yana Kadiyska wrote: > Hi folks, hoping someone who works with Streaming can help me out. > > I

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread lbustelo
I'm running on spark 1.0.0 and I see a similar problem when using the socketTextStream receiver. The ReceiverTracker task sticks around after a ssc.stop(false). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Stopping-StreamingContext-does-not-kill-receiver-

Re: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-06 Thread Tathagata Das
You can use SparkSQL for that very easily. You can convert the rdds you get from kafka input stream, convert them to a RDDs of case classes and save as parquet files. More information here. https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files On Wed, Aug 6, 2014 at 5:23 A

Re: Spark Streaming multiple streams problem

2014-08-06 Thread Tathagata Das
You probably have only 10 cores in your cluster on which you are executing your job. since each dstream / receiver take one core each, the system is not able to start all of them and so everything is blocked. On Wed, Aug 6, 2014 at 3:08 AM, Laeeq Ahmed wrote: > Hi, > > I am reading multiple str

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread Tathagata Das
Can you give the stack trace? This was the fix for the twitter stream. https://github.com/apache/spark/pull/1577/files You could try doing the same. TD On Wed, Aug 6, 2014 at 2:41 PM, lbustelo wrote: > I'm running on spark 1.0.0 and I see a similar problem when using the > socketTextStream

Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-06 Thread salemi
Hi, I have a DStream called eventData and it contains set of Data objects defined as followed: case class Data(startDate: Long, endDate: Long, className: String, id: String, state: String) How would the reducer and inverse reducer functions look like if I would like to add the data for current

Naive Bayes parameters

2014-08-06 Thread SK
1) How is the minPartitions parameter in NaiveBayes example used? What is the default value? 2) Why is the numFeatures specified as a parameter? Can this not be obtained from the data? This parameter is not specified for the other MLlib algorithms. thanks -- View this message in context:

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-06 Thread Tathagata Das
Why isnt a simple window function sufficient? eventData.window(Minutes(15), Seconds(3)) will keep generating RDDs every 3 second, each containing last 15 minutes of data. TD On Wed, Aug 6, 2014 at 3:43 PM, salemi wrote: > Hi, > I have a DStream called eventData and it contains set of Data o

Trying to make sense of the actual executed code

2014-08-06 Thread Tom
Hi, I am trying to look at for instance the following SQL query in Spark 1.1: SELECT table.key, table.value, table2.value FROM table2 JOIN table WHERE table2.key = table.key When I look at the output, I see that there are several stages, and several tasks per stage. The tasks have a TID, I do not

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread lbustelo
Sorry about the screenshot… but that is what I have handy right now. You can see that we get a WARN and it ultimately say that it stopped successfully. When looking that the application in Spark UI, it still shows the ReceiverTracker task running. It is easy to recreate. On the spark repl we are r

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread Tathagata Das
Okay let me give it a shot. On Wed, Aug 6, 2014 at 3:57 PM, lbustelo wrote: > Sorry about the screenshot… but that is what I have handy right now. You > can > see that we get a WARN and it ultimately say that it stopped successfully. > When looking that the application in Spark UI, it still sho

Using Python IDE for Spark Application Development

2014-08-06 Thread Sathish Kumaran Vairavelu
Hello, I am trying to use the python IDE PyCharm for Spark application development. How can I use pyspark with Python IDE? Can anyone help me with this? Thanks Sathish

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Mohit Singh
My naive set up.. Adding os.environ['SPARK_HOME'] = "/path/to/spark" sys.path.append("/path/to/spark/python") on top of my script. from pyspark import SparkContext from pyspark import SparkConf Execution works from within pycharm... Though my next step is to figure out autocompletion and I bet the

Hive 11 / CDH 4.6/ Spark 0.9.1 dilemmna

2014-08-06 Thread Anurag Tangri
I posted this in cdh-user mailing list yesterday and think this should have been the right audience for this: = Hi All, Not sure if anyone else faced this same issue or not. We installed CDH 4.6 that uses Hive 0.10. And we have Spark 0.9.1 that comes with Hive 11. Now our hive jobs that wo

Re: Hive 11 / CDH 4.6/ Spark 0.9.1 dilemmna

2014-08-06 Thread Sean Owen
I haven't tried any of this, mind you, but my guess is that your options are, from least painful and most likely to work onwards, are: - Get Spark / Shark to compile against Hive 0.10 - Shade Hive 0.11 into Spark - Update to CDH5.0+ I don't think there will be more updated releases of Shark or Sp

Regularization parameters

2014-08-06 Thread SK
Hi, I tried different regularization parameter values with Logistic Regression for binary classification of my dataset and would like to understand the following results: regType = L2, regParam = 0.0 , I am getting AUC = 0.80 and accuracy of 80% regType = L1, regParam = 0.0 , I am getting AUC =

Re: RDD to DStream

2014-08-06 Thread Tathagata Das
Hey Aniket, Great thoughts! I understand the usecase. But as you have realized yourself it is not trivial to cleanly stream a RDD as a DStream. Since RDD operations are defined to be scan based, it is not efficient to define RDD based on slices of data within a partition of another RDD, using pure

Re: Trying to make sense of the actual executed code

2014-08-06 Thread Michael Armbrust
This is maybe not exactly what you are asking for, but you might consider looking at the queryExecution (a developer API that shows how the query is analyzed / executed) sql("...").queryExecution On Wed, Aug 6, 2014 at 3:55 PM, Tom wrote: > Hi, > > I am trying to look at for instance the follo

Fwd: Trying to make sense of the actual executed code

2014-08-06 Thread Tobias Pfeiffer
(Forgot to include the mailing list in my reply. Here it is.) Hi, On Thu, Aug 7, 2014 at 7:55 AM, Tom wrote: > > When I look at the output, I see that there are several stages, and several > tasks per stage. The tasks have a TID, I do not see such a thing for a > stage. They should have. In m

Re: Naive Bayes parameters

2014-08-06 Thread Burak Yavuz
Hi, Could you please send the link for the example you are talking about? minPartitions and numFeatures do not exist in the current API for NaiveBayes as far as I know. So, I don't know how to answer your second question. Regarding your first question, guessing blindly, it should be related to

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread Tathagata Das
I narrowed down the error. Unfortunately this is not quick fix. I have opened a JIRA for this. https://issues.apache.org/jira/browse/SPARK-2892 On Wed, Aug 6, 2014 at 3:59 PM, Tathagata Das wrote: > Okay let me give it a shot. > > > On Wed, Aug 6, 2014 at 3:57 PM, lbustelo wrote: > >> Sorry ab

Re: Regularization parameters

2014-08-06 Thread Burak Yavuz
Hi, That is interesting. Would you please share some code on how you are setting the regularization type, regularization parameters and running Logistic Regression? Thanks, Burak - Original Message - From: "SK" To: u...@spark.incubator.apache.org Sent: Wednesday, August 6, 2014 6:18:4

PySpark + executor lost

2014-08-06 Thread Avishek Saha
Hi, I get a lot of executor lost error for "saveAsTextFile" with PySpark and Hadoop 2.4. For small datasets this error occurs but since the dataset is small it gets eventually written to the file. For large datasets, it takes forever to write the final output. Any help is appreciated. Avishek -

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Sathish Kumaran Vairavelu
Mohit, This doesn't seems to be working can you please provide more details? when I use "from pyspark import SparkContext" it is disabled in pycharm. I use pycharm community edition. Where should I set the environment variables in same python script or different python script? Also, should I run a

memory issue on standalone master

2014-08-06 Thread BQ
Hi There, I'm starting using spark and got a rookie problem. I used the standalone and master only, and here is what I did: ./sbin/start-master.sh ./bin/pyspark When I tried the example of wordcount.py, which my input file is a bit big, about I got the out of memory error, which I excerpted a

Re: Regularization parameters

2014-08-06 Thread Mohit Singh
One possible straightforward explanation might be your solution(s) might be stuck in local minima?? And depending on your weights initialization, you are getting different parameters? Maybe have same initial weights for both the runs... or I would probably test the execution with synthetic dataset

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-06 Thread salemi
Hi, The reason I am looking to do it differently is because the latency and batch processing times are bad about 40 sec. I took the times from the Streaming UI. As you suggested I tried the window as below and still the times are bad. val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, t

Column width limits?

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Assume I want to make a PairRDD whose keys are S3 URLs and whose values are Strings holding the contents of those (UTF-8) files, but NOT split into lines. Are there length limits on those files/Strings? 1 MB? 16 MB? 4 GB? 1 TB? Similarly, can such a thing be registered as a table so that I can us

答复: fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Lizhengbing (bing, BIPA)
I have test it in spark-1.1.0-SNAPSHOT. It is ok now 发件人: Xiangrui Meng [mailto:men...@gmail.com] 发送时间: 2014年8月6日 23:12 收件人: Lizhengbing (bing, BIPA) 抄送: user@spark.apache.org 主题: Re: fail to run LBFS in 5G KDD data in spark 1.0.1? Do you mind testing 1.1-SNAPSHOT and allocating more memory to th

Re: Column width limits?

2014-08-06 Thread Mayur Rustagi
Spark breaks data across machines at partition level, so realistic limit is on the partition size. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Aug 7, 2014 at 8:41 AM, Daniel, Ronald (ELS-SDG) < r.dan...@elsevier

spark-cassandra-connector issue

2014-08-06 Thread Gary Zhao
Hello I'm trying to modify Spark sample app to integrate with Cassandra, however I saw exception when submitting the app. Anyone knows why it happens? Exception in thread "main" java.lang.NoClassDefFoundError: com/datastax/spark/connector/rdd/reader/RowReaderFactory at SimpleApp.main(SimpleApp.sc

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-06 Thread Pranay Dave
Hello Shivram Thanks for your reply. Here is a simple data set input. This data is in file called "/sparkdev/datafiles/covariance.txt" 1,1 2,2 3,3 4,4 5,5 6,6 7,7 8,8 9,9 10,10 Output I would like to see is a total of columns. It can be done with reduce, but I wanted to test lapply. Output I wa

Spark SQL

2014-08-06 Thread vdiwakar.malladi
Hi, I'm new to Spark. I configured spark in standalone mode on my lap. Right now i'm in the process of executing the samples to understand more into the details to Spark components and how they work. When i'm trying to run the example program given for Spark SQL using Java API, i'm unable to reso

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Mohit Singh
Take a look at this gist https://gist.github.com/bigaidream/40fe0f8267a80e7c9cf8 That worked for me. On Wed, Aug 6, 2014 at 7:32 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Mohit, This doesn't seems to be working can you please provide more > details? when I use "from pys

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-06 Thread Tathagata Das
Okay, going back to your origin question, it wasnt clear what is the reduce function that you are trying to implement. Going by the 2nd example using window() operation, following by a count+filter (using sql), I am guessing you are trying to maintain a count of the all the active "states" in the l