Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das
You can also set these in the spark-env.sh file : export SPARK_WORKER_DIR="/mnt/spark/" export SPARK_LOCAL_DIR="/mnt/spark/" Thanks Best Regards On Mon, Jul 6, 2015 at 12:29 PM, Akhil Das wrote: > While the job is running, just look in the directory and see whats the > root cause of it (is i

How does Spark streaming move data around ?

2015-07-06 Thread Sela, Amit
I know that Spark is using data parallelism over, say, HDFS - optimally running computations on local data (aka data locality). I was wondering how Spark streaming moves data (messages) around? since the data is streamed in as DStreams and is not on a distributed FS like HDFS. Thanks!

Re: Unable to start spark-sql

2015-07-06 Thread Akhil Das
Its complaining for a jdbc driver. Add it in your driver classpath like: ./bin/spark-sql --driver-class-path /home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar Thanks Best Regards On Mon, Jul 6, 2015 at 11:42 AM, sandeep vura wrote: > Hi Sparkers, > > I am unable to start spark

Re: Unable to start spark-sql

2015-07-06 Thread sandeep vura
oK Let me try On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das wrote: > Its complaining for a jdbc driver. Add it in your driver classpath like: > > ./bin/spark-sql --driver-class-path > /home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar > > > Thanks > Best Regards > > On Mon, Jul 6, 2

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
Sorry, that should be shortest path, and diameter of the graph. I shouldn't write emails before I get my morning coffee... > On 06 Jul 2015, at 09:09, Jan-Paul Bultmann wrote: > > I would guess the opposite is true for highly iterative benchmarks (common in > graph processing and data-science).

Spark SQL queries hive table, real time ?

2015-07-06 Thread spierki
Hello, I'm actually asking my self about performance of using Spark SQL with Hive to do real time analytics. I know that Hive has been created for batch processing, and Spark is use to do fast queries. But, use Spark SQL with Hive will allow me to do real time queries ? Or it just will make fas

Re: Unable to start spark-sql

2015-07-06 Thread sandeep vura
Thanks alot AKhil On Mon, Jul 6, 2015 at 12:57 PM, sandeep vura wrote: > It Works !!! > > On Mon, Jul 6, 2015 at 12:40 PM, sandeep vura > wrote: > >> oK Let me try >> >> >> On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das >> wrote: >> >>> Its complaining for a jdbc driver. Add it in your driver clas

Re: Unable to start spark-sql

2015-07-06 Thread sandeep vura
It Works !!! On Mon, Jul 6, 2015 at 12:40 PM, sandeep vura wrote: > oK Let me try > > > On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das > wrote: > >> Its complaining for a jdbc driver. Add it in your driver classpath like: >> >> ./bin/spark-sql --driver-class-path >> /home/akhld/sigmoid/spark/lib/my

Split RDD into two in a single pass

2015-07-06 Thread Anand Nalya
Hi, I've a RDD which I want to split into two disjoint RDDs on with a boolean function. I can do this with the following val rdd1 = rdd.filter(f) val rdd2 = rdd.filter(fnot) I'm assuming that each of the above statement will traverse the RDD once thus resulting in 2 passes. Is there a way of do

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException in spark with mysql database

2015-07-06 Thread Hafiz Mujadid
Hi! I am trying to load data from my sql database using following code val query="select * from "+table+" " val url = "jdbc:mysql://" + dataBaseHost + ":" + dataBasePort + "/" + dataBaseName + "?user=" + db_user + "&password=" + db_pass val sc = new SparkContext(new SparkConf().setAppNam

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-07-06 Thread Juan Rodríguez Hortalá
Hi, I haven't been able to reproduce the error reliably, I will open a JIRA as soon as I can Greetings, Juan 2015-06-23 21:57 GMT+02:00 Tathagata Das : > Aaah this could be potentially major issue as it may prevent metrics from > restarted streaming context be not published. Can you make it a

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-07-06 Thread Tathagata Das
I have already opened a JIRA about this. https://issues.apache.org/jira/browse/SPARK-8743 On Mon, Jul 6, 2015 at 1:02 AM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi, > > I haven't been able to reproduce the error reliably, I will open a JIRA as > soon as I can > > Gre

Re: Split RDD into two in a single pass

2015-07-06 Thread Daniel Darabos
This comes up so often. I wonder if the documentation or the API could be changed to answer this question. The solution I found is from http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job. You basically write the items into two directories in a single p

How to shut down spark web UI?

2015-07-06 Thread luohui20001
Hello there, I heard that there is some way to shutdown Spark WEB UI, is there a configuration to support this? Thank you. Thanks&Best regards! San.Luo

Spark-CSV: Multiple delimiters and Null fields support

2015-07-06 Thread Anas Sherwani
Hi all, Apparently, we can only specify character delimiter for tokenizing data using Spark-CSV. But what if we have a log file with multiple delimiters or even a multi-character delimiter? e.g. (field1,field2:field3) with delimiters [,:] and (field1::field2::field3) with a single multi-character

Re: How to shut down spark web UI?

2015-07-06 Thread Shixiong Zhu
You can set "spark.ui.enabled" to "false" to disable the Web UI. Best Regards, Shixiong Zhu 2015-07-06 17:05 GMT+08:00 : > Hello there, > >I heard that there is some way to shutdown Spark WEB UI, is there a > configuration to support this? > > Thank you. > > ---

[SPARK-SQL] Re-use col alias in the select clause to avoid sub query

2015-07-06 Thread Hao Ren
Hi, I want to re-use column alias in the select clause to avoid sub query. For example: select check(key) as b, abs(b) as abs, value1, value2, ..., value30 from test The query above does not work, because b is not defined in the test's schema. In stead, I should change the query to the followi

Application jar file not found exception when submitting application

2015-07-06 Thread bit1...@163.com
Hi, I have following shell script that will submit the application to the cluster. But whenever I start the application, I encounter FileNotFoundException, after retrying for serveral times, I can successfully submit it! SPARK=/data/software/spark-1.3.1-bin-2.4.0 APP_HOME=/data/software/spark-

[SparkR] Float type coercion with hiveContext

2015-07-06 Thread Evgeny Sinelnikov
Hello, I'm got a trouble with float type coercion on SparkR with hiveContext. > result <- sql(hiveContext, "SELECT offset, percentage from data limit 100") > show(result) DataFrame[offset:float, percentage:float] > head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : canno

Re: Application jar file not found exception when submitting application

2015-07-06 Thread Shixiong Zhu
Before running your script, could you confirm that " /data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar" exists? You might forget to build this jar. Best Regards, Shixiong Zhu 2015-07-06 18:14 GMT+08:00 bit1...@163.com : > Hi, > I have following shell script th

Spark's equivalent for Analytical functions in Oracle

2015-07-06 Thread Gireesh Puthumana
Hi there, I would like to check with you whether there is any equivalent functions of Oracle's analytical functions in Spark SQL. For example, if I have following data set (table T): *EID|DEPT* *101|COMP* *102|COMP* *103|COMP* *104|MARK* In Oracle, I can do something like *select EID, DEPT, coun

Re: Re: Application jar file not found exception when submitting application

2015-07-06 Thread bit1...@163.com
Thanks Shixiong for the reply. Yes, I confirm that the file exists there ,simply checks with ls -l /data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar bit1...@163.com From: Shixiong Zhu Date: 2015-07-06 18:41 To: bit1...@163.com CC: user Subject: Re: Applica

Spark equivalent for Oracle's analytical functions

2015-07-06 Thread gireeshp
Is there any equivalent of Oracle's *analytical functions* in Spark SQL. For example, if I have following data set (say table T): /EID|DEPT 101|COMP 102|COMP 103|COMP 104|MARK/ In Oracle, I can do something like /select EID, DEPT, count(1) over (partition by DEPT) CNT from T;/ to get: /EID|DEPT|

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Denny Lee
I went ahead and tested your file and the results from the tests can be seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94. Basically, when running {Java 7, MaxPermSize = 256} or {Java 8, default} the query ran without any issues. I was able to recreate the issue with {Java

writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
I have a requirement to write in kafka queue from a spark streaming application. I am using spark 1.2 streaming. Since different executors in spark are allocated at each run so instantiating a new kafka producer at each run seems a costly operation .Is there a way to reuse objects in processing ex

Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Denny Lee
Within the context of your question, Spark SQL utilizing the Hive context is primarily about very fast queries. If you want to use real-time queries, I would utilize Spark Streaming. A couple of great resources on this topic include Guest Lecture on Spark Streaming in Stanford CME 323: Distribute

Re: [SparkR] Float type coercion with hiveContext

2015-07-06 Thread Evgeny Sinelnikov
I used spark 1.4.0 binaries from official site: http://spark.apache.org/downloads.html And running it on: * Hortonworks HDP 2.2.0.0-2041 * with Hive 0.14 * with disabled hooks for Application Timeline Servers (ATSHook) in hive-site.xml (commented hive.exec.failure.hooks, hive.exec.post.hooks, hive

Re: Spark's equivalent for Analytical functions in Oracle

2015-07-06 Thread ayan guha
Its available in Spark 1.4 under dataframe window operations. Apparently programming doc doesnot mention it, you need to look at the apis. On Mon, Jul 6, 2015 at 8:50 PM, Gireesh Puthumana < gireesh.puthum...@augmentiq.in> wrote: > Hi there, > > I would like to check with you whether there is any

kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
In spark streaming 1.2 , Is offset of kafka message consumed are updated in zookeeper only after writing in WAL if WAL and checkpointig are enabled or is it depends upon kafkaparams while initialing the kafkaDstream. Map kafkaParams = new HashMap(); kafkaParams.put("zookeeper.connect","ip:2181");

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
If you’re using WAL with Kafka, Spark Streaming will ignore this configuration(autocommit.enable) and explicitly call commitOffset to update offset to Kafka AFTER WAL is done. No matter what you’re setting with autocommit.enable, internally Spark Streaming will set it to false to turn off autoc

Re: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
And what if I disable WAL and use replication of receiver data using StorageLevel.MEMORY_ONLY2(). Will it commit offset after replicating the message or will it use autocommit.enable value. And if it uses this value what if autocommit.enable is set to false then when does receiver calls commitOffse

Re: Spark custom streaming receiver not storing data reliably?

2015-07-06 Thread Ajit Bhingarkar
The inconsistency is resolved; I can see rules getting fired consistently and reliably across a File based source, and a steam (of file data), and a JMS stream. I am running more tests till 50M facts/events, but looks like it is working now. Regards, Ajit On Mon, Jul 6, 2015 at 11:59 AM, Ajit Bhi

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
If you disable WAL, Spark Streaming itself will not manage any offset related things, is auto commit is enabled by true, Kafka itself will update offsets in a time-based way, if auto commit is disabled, no any part will call commitOffset, you need to call this API yourself. Also Kafka’s offset

Re: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
So If WAL is disabled, how developer can commit offset explicitly in spark streaming app since we don't write code which will be executed in receiver ? Plus since offset commitment is asynchronoous, is it possible -it may happen last offset is not commited yet and next stream batch started on rece

How Will Spark Execute below Code - Driver and Executors

2015-07-06 Thread Ashish Soni
Hi All , If some one can help me understand as which portion of the code gets executed on Driver and which portion will be executed on executor from the below code it would be a great help I have to load data from 10 Tables and then use that data in various manipulation and i am using SPARK SQL f

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
Please see the inline comments. From: Shushant Arora [mailto:shushantaror...@gmail.com] Sent: Monday, July 6, 2015 8:51 PM To: Shao, Saisai Cc: user Subject: Re: kafka offset commit in spark streaming 1.2 So If WAL is disabled, how developer can commit offset explicitly in spark streaming app si

Re: writing to kafka using spark streaming

2015-07-06 Thread Cody Koeninger
Use foreachPartition, and allocate whatever the costly resource is once per partition. On Mon, Jul 6, 2015 at 6:11 AM, Shushant Arora wrote: > I have a requirement to write in kafka queue from a spark streaming > application. > > I am using spark 1.2 streaming. Since different executors in spark

Converting spark JDBCRDD to DataFrame

2015-07-06 Thread Hafiz Mujadid
Hi all! what is the most efficient way to convert jdbcRDD to DataFrame. any example? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Converting-spark-JDBCRDD-to-DataFrame-tp23647.html Sent from the Apache Spark User List mailing list archive at Nab

Re: Restarting Spark Streaming Application with new code

2015-07-06 Thread Cody Koeninger
You shouldn't rely on being able to restart from a checkpoint after changing code, regardless of whether the change was explicitly related to serialization. If you are relying on checkpoints to hold state, specifically which offsets have been processed, that state will be lost if you can't recover

Re: How Will Spark Execute below Code - Driver and Executors

2015-07-06 Thread ayan guha
Join happens on executor. Else spark would not be much of a distributed computing engine :) Reads happen on executor too. Your options are passed to executors and conn objects are created in executors. On 6 Jul 2015 22:58, "Ashish Soni" wrote: > Hi All , > > If some one can help me understand as

Re: DESCRIBE FORMATTED doesn't work in Hive Thrift Server?

2015-07-06 Thread Ted Yu
What version of Hive and Spark are you using ? Cheers On Sun, Jul 5, 2015 at 10:53 PM, Rex Xiong wrote: > Hi, > > I try to use for one table created in spark, but it seems the results are > all empty, I want to get metadata for table, what's other options? > > Thanks > > +--

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
Hi Sim, I think the right way to set the PermGen Size is through driver extra JVM options, i.e. --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=256m" Can you try it? Without this conf, your driver's PermGen size is still 128m. Thanks, Yin On Mon, Jul 6, 2015 at 4:07 AM, Denny Lee wrote

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Simeon Simeonov
Yin, that did the trick. I'm curious what was the effect of the environment variable, however, as the behavior of the shell changed from hanging to quitting when the env var value got to 1g. /Sim Simeon Simeonov, Founder & CTO, Swoop @simeons | b

Re: Streaming: updating broadcast variables

2015-07-06 Thread Conor Fennell
Hi James, The code below shows one way how you can update the broadcast variable on the executors: // ... events stream setup var startTime = new Date().getTime() var hashMap = HashMap("1" -> ("1", 1), "2" -> ("2", 2)) var hashMapBroadcast = stream.context.sparkContext.broadcas

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
You meant "SPARK_REPL_OPTS"? I did a quick search. Looks like it has been removed since 1.0. I think it did not affect the behavior of the shell. On Mon, Jul 6, 2015 at 9:04 AM, Simeon Simeonov wrote: > Yin, that did the trick. > > I'm curious what was the effect of the environment variable,

How to call hiveContext.sql() on all the Hive partitions in parallel?

2015-07-06 Thread kachau
Hi I have to fire few insert into queries which uses Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using hiveContext as shown below query works fine hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sou

How do we control output part files created by Spark job?

2015-07-06 Thread kachau
Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC

How to create a LabeledPoint RDD from a Data Frame

2015-07-06 Thread Sourav Mazumder
Hi, I have a Dataframe which I want to use for creating a RandomForest model using MLLib. The RandonForest model needs a RDD with LabeledPoints. Wondering how do I convert the DataFrame to LabeledPointRDD Regards, Sourav

Re: writing to kafka using spark streaming

2015-07-06 Thread Tathagata Das
Yeah, creating a new producer at the granularity of partitions may not be that costly. On Mon, Jul 6, 2015 at 6:40 AM, Cody Koeninger wrote: > Use foreachPartition, and allocate whatever the costly resource is once > per partition. > > On Mon, Jul 6, 2015 at 6:11 AM, Shushant Arora > wrote: > >

Cluster sizing for recommendations

2015-07-06 Thread Danny Yates
Hi, I'm having trouble building a recommender and would appreciate a few pointers. I have 350,000,000 events which are stored in roughly 500,000 S3 files and are formatted as semi-structured JSON. These events are not all relevant to making recommendations. My code is (roughly): case class Even

Re: Converting spark JDBCRDD to DataFrame

2015-07-06 Thread Michael Armbrust
Use the built in JDBC data source: https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases On Mon, Jul 6, 2015 at 6:42 AM, Hafiz Mujadid wrote: > Hi all! > > what is the most efficient way to convert jdbcRDD to DataFrame. > > any example? > > Thanks > > > > -- >

Re: writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
whats the difference between foreachPartition vs mapPartitions for a Dtstream both works at partition granularity? One is an operation and another is action but if I call an opeartion afterwords mapPartitions also, which one is more efficient and recommeded? On Tue, Jul 7, 2015 at 12:21 AM, Tath

Re: writing to kafka using spark streaming

2015-07-06 Thread Tathagata Das
Both have same efficiency. The primary difference is that one is a transformation (hence is lazy, and requires another action to actually execute), and the other is an action. But it may be a slightly better design in general to have "transformations" be purely functional (that is, no external side

Re: How do we control output part files created by Spark job?

2015-07-06 Thread Sathish Kumaran Vairavelu
Try coalesce function to limit no of part files On Mon, Jul 6, 2015 at 1:23 PM kachau wrote: > Hi I am having couple of Spark jobs which processes thousands of files > every > day. File size may very from MBs to GBs. After finishing job I usually save > using the following code > > finalJavaRDD.s

Master doesn't start, no logs

2015-07-06 Thread maxdml
Hi, I've been compiling spark 1.4.0 with SBT, from the source tarball available on the official website. I cannot run spark's master, even tho I have built and run several other instance of spark on the same machine (spark 1.3, master branch, pre built 1.4, ...) /starting org.apache.spark.deploy.

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Tathagata Das
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been processed. Unless there are errors where the batch completely fails to get processed, in which case the point is moot. Just reinforcing the concept further. Additional information: This is true in the default configuration.

User Defined Functions - Execution on Clusters

2015-07-06 Thread Eskilson,Aleksander
Hi there, I’m trying to get a feel for how User Defined Functions from SparkSQL (as written in Python and registered using the udf function from pyspark.sql.functions) are run behind the scenes. Trying to grok the source it seems that the native Python function is serialized for distribution to

Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-06 Thread MorEru
I have a Spark standalone cluster with 2 workers - Master and one slave thread run on a single machine -- Machine 1 Another slave running on a separate machine -- Machine 2 I am running a spark shell in the 2nd machine that reads a file from hdfs and does some calculations on them and stores the

Spark application with a RESTful API

2015-07-06 Thread Sagi r
Hi, I've been researching spark for a couple of months now, and I strongly believe it can solve our problem. We are developing an application that allows the user to analyze various sources of information. We are dealing with non-technical users, so simply giving them and interactive shell won't

How to create empty RDD

2015-07-06 Thread ๏̯͡๏
I need to return an empty RDD of type val output: RDD[(DetailInputRecord, VISummary)] This does not work val output: RDD[(DetailInputRecord, VISummary)] = new RDD() as RDD is abstract class. How do i create empty RDD ? -- Deepak

Re: How to create empty RDD

2015-07-06 Thread Richard Marscher
This should work val output: RDD[(DetailInputRecord, VISummary)] = sc.paralellize(Seq.empty[(DetailInputRecord, VISummary)]) On Mon, Jul 6, 2015 at 5:11 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I need to return an empty RDD of type > > val output: RDD[(DetailInputRecord, VISummary)] > > > > This does not wor

Re: How to recover in case user errors in streaming

2015-07-06 Thread Tathagata Das
1. onBatchError is not a bad idea. 2. It works for all the Kafka Direct API and files as well. They are have batches. However you will not get the number of records for the file stream. 3. Mind giving an example of the exception you would like to see caught? TD On Wed, Jul 1, 2015 at 10:35 AM, Am

How does executor cores change the spark job behavior ?

2015-07-06 Thread ๏̯͡๏
I have a simple job , that reads data => union => filter => map and the count 1 Job started 2402 tasks read 149G of input. I started the job with different number of executors 1) 1 --> 8.3 mins 2) 2 --> 5.6 mins 3) 3 --> 3.1 mins 1) Why is increasing the cores speading up this app ? 2) I start

Job consistently failing after leftOuterJoin() - oddly sized / non-uniform partitions

2015-07-06 Thread Mohammed Omer
Afternoon all, Really loving this project and the community behind it. Thank you all for your hard work. This past week, though, I've been having a hard time getting my first deployed job to run without failing at the same point every time: Right after a leftOuterJoin, most partitions (600 total

Random Forest in MLLib

2015-07-06 Thread Sourav Mazumder
Hi, Is there a way to get variable importance for RandomForest model created using MLLib ? This way one can know among multiple features which are the one contributing the most to the dependent variable. Regards, Sourav

Re: Random Forest in MLLib

2015-07-06 Thread Feynman Liang
Not yet, though work on this feature has begun (SPARK-5133 ) On Mon, Jul 6, 2015 at 4:46 PM, Sourav Mazumder wrote: > Hi, > > Is there a way to get variable importance for RandomForest model created > using MLLib ? This way one can know among mul

Re: User Defined Functions - Execution on Clusters

2015-07-06 Thread Davies Liu
Currently, Python UDFs run in a Python instances, are MUCH slower than Scala ones (from 10 to 100x). There is JIRA to improve the performance: https://issues.apache.org/jira/browse/SPARK-8632, After that, they will be still much slower than Scala ones (because Python is lower and the overhead for c

How to debug java.io.OptionalDataException issues

2015-07-06 Thread Yana Kadiyska
Hi folks, suffering from a pretty strange issue: Is there a way to tell what object is being successfully serialized/deserialized? I have a maven-installed jar that works well when fat jarred within another, but shows the following stack when marked as provided and copied to the runtime classpath.

Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-06 Thread maxdml
Can you share your hadoop configuration file please? - etc/hadoop/core-site.xml - etc/hadoop/hdfs-site.xml - etc/hadoop/hadoo-env.sh AFAIK, the following properties should be configured: hadoop.tmp.dir, dfs.namenode.name.dir, dfs.datanode.data.dir and dfs.namenode.checkpoint.dir Otherwise, an H

Spark Unit tests - RDDBlockId not found

2015-07-06 Thread Malte
I am running unit tests on Spark 1.3.1 with sbt test and besides the unit tests being incredibly slow I keep running into java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId issues. Usually this means a dependency issue, but I wouldn't know from where... Any help is greatly appre

JVM is not ready after 10 seconds.

2015-07-06 Thread Ashish Dutt
Hi, I am trying to connect a worker to the master. The spark master is on cloudera manager and I know the master IP address and port number. I downloaded the spark binary for CDH4 on the worker machine and then when I try to invoke the command > sc = sparkR.init("master="ip address:port number") I

JVM is not ready after 10 seconds

2015-07-06 Thread ashishdutt
Hi, I am trying to connect a worker to the master. The spark master is on cloudera manager and I know the master IP address and port number. I downloaded the spark binary for CDH4 on the worker machine and then when I try to invoke the command > sc = sparkR.init("master="ip address:port number")

Re: Job consistently failing after leftOuterJoin() - oddly sized / non-uniform partitions

2015-07-06 Thread ayan guha
You can bump up number of partition by a parameter in join operator. However you have a data skew problem which you need to resolve using a reasonable partition by function On 7 Jul 2015 08:57, "Mohammed Omer" wrote: > Afternoon all, > > Really loving this project and the community behind it. Tha

Re: JVM is not ready after 10 seconds

2015-07-06 Thread Shivaram Venkataraman
When I've seen this error before it has been due to the spark-submit file (i.e. `C:\spark-1.4.0\bin/bin/spark-submit.cmd`) not having execute permissions. You can try to set execute permission and see if it fixes things. Also we have a PR open to fix a related problem at https://github.com/apache/

RE: Spark application with a RESTful API

2015-07-06 Thread Mohammed Guller
It is not a bad idea. Many people use this approach. Mohammed -Original Message- From: Sagi r [mailto:stsa...@gmail.com] Sent: Monday, July 6, 2015 1:58 PM To: user@spark.apache.org Subject: Spark application with a RESTful API Hi, I've been researching spark for a couple of months no

RE: How to create a LabeledPoint RDD from a Data Frame

2015-07-06 Thread Mohammed Guller
Have you looked at the new Spark ML library? You can use a DataFrame directly with the Spark ML API. https://spark.apache.org/docs/latest/ml-guide.html Mohammed From: Sourav Mazumder [mailto:sourav.mazumde...@gmail.com] Sent: Monday, July 6, 2015 10:29 AM To: user Subject: How to create a Labe

Re: JVM is not ready after 10 seconds

2015-07-06 Thread Ashish Dutt
Hello Shivaram, Thank you for your response. Being a novice at this stage can you also tell how to configure or set the execute permission for the spark-submit file? Thank you for your time. Sincerely, Ashish Dutt On Tue, Jul 7, 2015 at 9:21 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.ed

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Khaled Hammouda
Great! That's what I gathered from the thread titled "Serial batching with Spark Streaming", but thanks for confirming this again. On 6 July 2015 at 15:31, Tathagata Das wrote: > Yes, RDD of batch t+1 will be processed only after RDD of batch t has been > processed. Unless there are errors where

RE: How do we control output part files created by Spark job?

2015-07-06 Thread Mohammed Guller
You could repartition the dataframe before saving it. However, that would impact the parallelism of the next jobs that reads these file from HDFS. Mohammed -Original Message- From: kachau [mailto:umesh.ka...@gmail.com] Sent: Monday, July 6, 2015 10:23 AM To: user@spark.apache.org Subje

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-06 Thread Gylfi
Hi. Just a few quick comment on your question. If you drill into (click the link of the subtasks) you can get more detailed view of the tasks. One of the things reported is the time for serialization. If that is your dominant factor it should be reflected there, right? Are you sure the inpu

Re: JVM is not ready after 10 seconds

2015-07-06 Thread Ashish Dutt
Hi, These are the settings for my spark-conf file on the worker machine from where I am trying to access the spark server. I think I need to first configure the spark-submit file too but I do not know how,, Can somebody suggest me ? # Default system properties included when running spark-submit

Re: How do we control output part files created by Spark job?

2015-07-06 Thread Gylfi
Hi. Have you tried to repartition the finalRDD before saving? This link might help. http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter3/save_the_rdd_to_files.html Regards, Gylfi. -- View this message in context: http://apache-spark-user-

Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Dean Wampler
Here's our home page: http://www.meetup.com/Chicago-Spark-Users/ Thanks, Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogra

RE: Spark SQL queries hive table, real time ?

2015-07-06 Thread Mohammed Guller
Hi Florian, It depends on a number of factors. How much data are you querying? Where is the data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)? In theory, it is possible to use Spark SQL for real-time queries, but cost increases as the data size grows. If you can store all

Re: how to black list nodes on the cluster

2015-07-06 Thread Gylfi
Hi. Have you tried to enable speculative execution? This will allow Spark to run the same sub-task of the job on other available slots when slow tasks are encountered. This can be passed at execution time with the params are: spark.speculation spark.speculation.interval spark.spe

The auxService:spark_shuffle does not exist

2015-07-06 Thread roy
I am getting following error for simple spark job I am running following command /spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar/ but job doesn't show any p

回复:Re: How to shut down spark web UI?

2015-07-06 Thread luohui20001
got it ,thanks. Thanks&Best regards! San.Luo - 原始邮件 - 发件人:Shixiong Zhu 收件人:罗辉 抄送人:user 主题:Re: How to shut down spark web UI? 日期:2015年07月06日 17点31分 You can set "spark.ui.enabled" to "false" to disable the Web UI. Best Regards,Shixiong Zhu 2015-07-

Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Jörn Franke
Hive using tez has recently (1.2.0) become much faster (if you use the ORC format), so that for most of the use cases it will be sufficient. Alternatively you could use as well SparkSQL (if you have the memory) or apache phoenix. The latter one has - currently - a little bit less SQL support and re

Re: Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Denny Lee
Hey Dean, Sure, will take care of this. HTH, Denny On Tue, Jul 7, 2015 at 10:07 Dean Wampler wrote: > Here's our home page: http://www.meetup.com/Chicago-Spark-Users/ > > Thanks, > Dean > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition >

Re: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException in spark with mysql database

2015-07-06 Thread Sathish Kumaran Vairavelu
Try including alias in the query. val query="(select * from "+table+") a" On Mon, Jul 6, 2015 at 3:38 AM Hafiz Mujadid wrote: > Hi! > I am trying to load data from my sql database using following code > > val query="select * from "+table+" " > val url = "jdbc:mysql://" + dataBaseHost + "

Re: How to recover in case user errors in streaming

2015-07-06 Thread Li,ZhiChao
Hi Cody and TD, Just trying to understanding this under the hook, but cannot find any place for this specific logic: "once you reach max failures the whole stream will stop". If possible, could you point me to the right direction ? For my understanding, the exception throw from the job would n

Re: writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
On using foreachPartition jobs get created are not displayed on driver console but are visible on web ui. On driver it creates some stage statistics of form [Stage 2:> (0 + 2) / 5] and disappeared . I am using foreachPartition as : kafkaStream.foreachRDD

Re: How to create empty RDD

2015-07-06 Thread Wei Zhou
I userd val output: RDD[(DetailInputRecord, VISummary)] = sc.emptyRDD[(DetailInputRecord, VISummary)] to create empty RDD before. Give it a try, it might work for you too. 2015-07-06 14:11 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) : > I need to return an empty RDD of type > > val output: RDD[(DetailInputRecord, VI

Hibench build fail

2015-07-06 Thread luohui20001
Hi grace, recently I am trying Hibench to evaluate my spark cluster, however I got a problem in building Hibench, would you help to take a look? thanks. It fails at building Sparkbench, and you may check the attched pic for more info. My spark version :1.3.1,hadoop version :2.7.0 and