S3 vs HDFS

2015-07-08 Thread Brandon White
Are there any significant performance differences between reading text files from S3 and hdfs?

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Akhil Das
Did you try sc.shutdown and creating a new one? Thanks Best Regards On Wed, Jul 8, 2015 at 8:12 PM, Terry Hole wrote: > I am using spark 1.4.1rc1 with default hive settings > > Thanks > - Terry > > Hi All, > > I'd like to use the hive context in spark shell, i need to recreate the > hive meta d

Re: Using Hive UDF in spark

2015-07-08 Thread ayan guha
You are most likely confused because you are using the UDF using HiveContext. In your case, you are using Spark UDF, not Hive UDF. For a naive scenario, I can use spark UDFs without any hive installation in my cluster. sqlContext.udf.register is for UDF in spark. Hive UDFs are stored in Hive and y

Re: PySpark without PySpark

2015-07-08 Thread Ashish Dutt
Hi Sujit, Thanks for your response. So i opened a new notebook using the command ipython notebook --profile spark and tried the sequence of commands. i am getting errors. Attached is the screenshot of the same. Also I am attaching the 00-pyspark-setup.py for your reference. Looks like, I have wri

Re: Spark query

2015-07-08 Thread Brandon White
Convert the column to a column of java Timestamps. Then you can do the following import java.sql.Timestamp import java.util.Calendar def date_trunc(timestamp:Timestamp, timeField:String) = { timeField match { case "hour" => val cal = Calendar.getInstance() cal.setTimeInMillis

Using Hive UDF in spark

2015-07-08 Thread vinod kumar
Hi everyone Shall we use UDF defined in hive using spark sql? I've created a UDF in spark and registered it using sqlContext.udf.register,but when I restarted a service the UDF was not available. I've heared that Hive UDF's are permanently stored in hive.(Please Correct me if I am wrong). Thanks

Re: Writing data to hbase using Sparkstreaming

2015-07-08 Thread Ted Yu
bq. return new Tuple2(new ImmutableBytesWritable(), put); I don't think Put is serializable. FYI On Fri, Jun 12, 2015 at 6:40 AM, Vamshi Krishna wrote: > Hi I am trying to write data that is produced from kafka commandline > producer for

Re: Spark query

2015-07-08 Thread Harish Butani
try the spark-datetime package: https://github.com/SparklineData/spark-datetime Follow this example https://github.com/SparklineData/spark-datetime#a-basic-example to get the different attributes of a DateTime. On Wed, Jul 8, 2015 at 9:11 PM, prosp4300 wrote: > As mentioned in Spark sQL programm

SparkR dataFrame read.df fails to read from aws s3

2015-07-08 Thread Ben Spark
I have Spark 1.4 deployed on AWS EMR but methods of SparkR dataFrame read.df method cannot load data from aws s3. 1) "read.df" error message read.df(sqlContext,"s3://some-bucket/some.json","json") 15/07/09 04:07:01 ERROR r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed jav

Re:Spark query

2015-07-08 Thread prosp4300
As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs, please take a look below builtin UDFs of Hive, get day of year should be as simply as existing RDBMS https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions At 2015-07-09 12:

Re: PySpark without PySpark

2015-07-08 Thread Bhupendra Mishra
Very interesting and well organized post. Thanks for sharing On Wed, Jul 8, 2015 at 10:29 PM, Sujit Pal wrote: > Hi Julian, > > I recently built a Python+Spark application to do search relevance > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on > EC2 (so I don't use th

回复:Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread prosp4300
Seems what Feynman mentioned is the source code instead of documentation, vectorMean is private, see https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala At 2015-07-09 10:10:58, "诺铁" wrote: thanks, I understand now. but I c

Spark query

2015-07-08 Thread Ravisankar Mani
Hi everyone, I can't get 'day of year' when using spark query. Can you help any way to achieve day of year? Regards, Ravi

Re: Spark program throws NIO Buffer over flow error (TDigest - Ted Dunning lib)

2015-07-08 Thread Ted Yu
Doesn't seem to be Spark problem, assuming TDigest comes from mahout. Cheers On Wed, Jul 8, 2015 at 7:49 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Same exception with different values of compression (10,100) > > var digest: TDigest = TDigest.createAvlTreeDigest(100) > > On Wed, Jul 8, 2015 at 6:50 PM, ÐΞ€ρ@

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
Hi Ashish, >> Nice post. Agreed, kudos to the author of the post, Benjamin Benfort of District Labs. >> Following your post, I get this problem; Again, not my post. I did try setting up IPython with the Spark profile for the edX Intro to Spark course (because I didn't want to use the Vagrant con

Re: RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar
Thanks for the help. Following are the folders I was trying to write to *saveAsTextFile("*file:///home/someuser/test2/testupload/20150708/0/") *saveAsTextFile("f*ile:///home/someuser/test2/testupload/20150708/1/") *saveAsTextFile("*file:///home/someuser/te

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread ashishdutt
Hi, I get the error, "DLL load failed: %1 is not a valid win32 application" whenever I invoke pyspark. Attached is the screenshot of the same. Is there any way I can get rid of it. Still being new to PySpark and have had, a not so pleasant experience so far most probably because I am on a windows

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread Ashish Dutt
Hi, I get the error, "DLL load failed: %1 is not a valid win32 application" whenever I invoke pyspark. Attached is the screenshot of the same. Is there any way I can get rid of it. Still being new to PySpark and have had, a not so pleasant experience so far most probably because I am on a windows

Re: Spark program throws NIO Buffer over flow error (TDigest - Ted Dunning lib)

2015-07-08 Thread ๏̯͡๏
Same exception with different values of compression (10,100) var digest: TDigest = TDigest.createAvlTreeDigest(100) On Wed, Jul 8, 2015 at 6:50 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Any suggestions ? > Code: > > val dimQuantiles = genericRecordsAndKeys > > .map { > > case (keyToOutput,

Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread 诺铁
thanks, I understand now. but I can't find mllib.clustering.GaussianMixture#vectorMean , what version of spark do you use? On Thu, Jul 9, 2015 at 1:16 AM, Feynman Liang wrote: > A RDD[Double] is an abstraction for a large collection of doubles, > possibly distributed across multiple nodes. The

Error while taking union

2015-07-08 Thread anshu shukla
Hi all , I want to create union of 2 DStreams , in one of them *RDD is created per 1 second* , other is having RDD generated by reduceByWindowandKey with *duration set to 60 sec.* (slide duration also 60 sec .) - Main idea is to do some analysis for every minute data and emitting union

Spark program throws NIO Buffer over flow error (TDigest - Ted Dunning lib)

2015-07-08 Thread ๏̯͡๏
Any suggestions ? Code: val dimQuantiles = genericRecordsAndKeys .map { case (keyToOutput, rec) => var digest: TDigest = TDigest.createAvlTreeDigest(1) val fpPaidGMB = rec.get("fpPaidGMB").asInstanceOf[Double] digest.add(fpPaidGMB)

Re: PySpark without PySpark

2015-07-08 Thread Ashish Dutt
Hi Sujit, Nice post.. Exactly what I had been looking for. I am relatively a beginner with Spark and real time data processing. We have a server with CDH5.4 with 4 nodes. The spark version in our server is 1.3.0 On my laptop I have spark 1.3.0 too and its using Windows 7 environment. As per point 5

Re: Communication between driver, cluster and HiveServer

2015-07-08 Thread Eric Pederson
A couple of other things. The Spark Notebook application does have hive-site.xml in its classpath. It is a copy of the original $SPARK_HOME/conf/hive-site.xml that worked for spark-shell originally After the security tweaks were made to $SPARK_HOME/conf/hive-site.xml, Spark Notebook started work

Re: RDD saveAsTextFile() to local disk

2015-07-08 Thread canan chen
wing function > > saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") > > The dir (/home/someuser/dir2/testupload/) was created before running the > job. The error message is misleading. > > > org.apache.spark.SparkException: Job aborted due to stage

Re: pause and resume streaming app

2015-07-08 Thread Tathagata Das
Currently the only way to pause it is to stop it. The way I would do this is use the Direct Kafka API to access the Kafka offsets, and save them to a data store as batches finish. If you see a batch job failing because downstream is down, stop the context. When it comes back up, start a new streami

What does RDD lineage refer to ?

2015-07-08 Thread canan chen
Lots of places refer RDD lineage, I'd like to know what it refer to exactly. My understanding is that it means the RDD dependencies and the intermediate MapOutput info in MapOutputTracker. Correct me if I am wrong. Thanks

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza
Strange. Does the application show up at all in the YARN web UI? Does application_1436314873375_0030 show up at all in the YARN ResourceManager logs? -Sandy On Wed, Jul 8, 2015 at 3:32 PM, Juan Gordon wrote: > Hello Sandy, > > Yes I'm sure that YARN has the enought resources, i checked it in t

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Tathagata Das
This is also discussed in the programming guide. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg wrote: > Thanks, Sean. > > "are you asking about foreach vs foreachPartition? that's quite

Re: Problem in Understanding concept of Physical Cores

2015-07-08 Thread Tathagata Das
There are several levels of indirection going on here, let me clarify. In the local mode, Spark runs tasks (which includes receivers) using the number of threads defined in the master (either local, or local[2], or local[*]). local or local[1] = single thread, so only one task at a time local[2] =

Re: How to change hive database?

2015-07-08 Thread Arun Luthra
Thanks, it works. On Tue, Jul 7, 2015 at 11:15 AM, Ted Yu wrote: > See this thread http://search-hadoop.com/m/q3RTt0NFls1XATV02 > > Cheers > > On Tue, Jul 7, 2015 at 11:07 AM, Arun Luthra > wrote: > >> >> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveCo

Re: FW: MLLIB (Spark) Question.

2015-07-08 Thread DB Tsai
Hi Dhar, Disabling `standardization` feature is just merged in master. https://github.com/apache/spark/commit/57221934e0376e5bb8421dc35d4bf91db4deeca1 Let us know your feedback. Thanks. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com P

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza
Hi JG, One way this can occur is that YARN doesn't have enough resources to run your job. Have you verified that it does? Are you able to submit using the same command from a node on the cluster? -Sandy On Wed, Jul 8, 2015 at 3:19 PM, jegordon wrote: > I'm trying to submit a spark job from a

Remote spark-submit not working with YARN

2015-07-08 Thread jegordon
I'm trying to submit a spark job from a different server outside of my Spark Cluster (running spark 1.4.0, hadoop 2.4.0 and YARN) using the spark-submit script : spark/bin/spark-submit --master yarn-client --executor-memory 4G myjobScript.py The think is that my application never pass from the ac

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, It seems whether I'm doing a foreachRDD or foreachPartition, I'm able to create per-worker/per-JVM singletons. With 4 workers, I've got 4 singletons created. I wouldn't be able to use broadcast vars because the 3rd party objects are not serializable. The shuffling effect is basically whe

Re: Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt
The error is JVM has not responded after 10 seconds. On 08-Jul-2015 10:54 PM, "ayan guha" wrote: > What's the error you are getting? > On 9 Jul 2015 00:01, "Ashish Dutt" wrote: > >> Hi, >> >> We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two >> days I have been trying to

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Andrew Or
Hi Lincoln, I've noticed this myself. I believe it's a new issue that only affects local mode. I've filed a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-8911 2015-07-08 14:20 GMT-07:00 Lincoln Atkinson : > Brilliant! Thanks. > > > > *From:* Feynman Liang [mailto:fli...@databrick

Re: Real-time data visualization with Zeppelin

2015-07-08 Thread Brandon White
Can you use a con job to update it every X minutes? On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya wrote: > Hi all – I’m just wondering if anyone has had success integrating Spark > Streaming with Zeppelin and actually dynamically updating the data in near > real-time. From my investigation, it s

Real-time data visualization with Zeppelin

2015-07-08 Thread Ganelin, Ilya
Hi all – I’m just wondering if anyone has had success integrating Spark Streaming with Zeppelin and actually dynamically updating the data in near real-time. From my investigation, it seems that Zeppelin will only allow you to display a snapshot of data, not a continuously updating table. Has an

Requirement failed: Some of the DStreams have different slide durations

2015-07-08 Thread anshu shukla
Hi all , I want to create union of 2 DStreams , in one of them *RDD is created per 1 second* , other is having RDD generated by reduceByWindowandKey with *duration set to 60 sec.* (slide duration also 60 sec .) - Main idea is to do some analysis for every minute data and emitting union

RE: Disable heartbeat messages in REPL

2015-07-08 Thread Lincoln Atkinson
Brilliant! Thanks. From: Feynman Liang [mailto:fli...@databricks.com] Sent: Wednesday, July 08, 2015 2:15 PM To: Lincoln Atkinson Cc: user@spark.apache.org Subject: Re: Disable heartbeat messages in REPL I was thinking the same thing! Try sc.setLogLevel("ERROR") On Wed, Jul 8, 2015 at 2:01 PM, L

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Feynman Liang
I was thinking the same thing! Try sc.setLogLevel("ERROR") On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson wrote: > “WARN Executor: Told to re-register on heartbeat” is logged repeatedly > in the spark shell, which is very distracting and corrupts the display of > whatever set of commands I’m

Disable heartbeat messages in REPL

2015-07-08 Thread Lincoln Atkinson
"WARN Executor: Told to re-register on heartbeat" is logged repeatedly in the spark shell, which is very distracting and corrupts the display of whatever set of commands I'm currently typing out. Is there an option to disable the logging of this message? Thanks, -Lincoln

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
You are welcome Davies. Just to clarify, I didn't write the post (not sure if my earlier post gave that impression, apologize if so), although I agree its great :-). -sujit On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu wrote: > Great post, thanks for sharing with us! > > On Wed, Jul 8, 2015 at 9

Re: Create RDD from output of unix command

2015-07-08 Thread Richard Marscher
As a distributed data processing engine, Spark should be fine with millions of lines. It's built with the idea of massive data sets in mind. Do you have more details on how you anticipate the output of a unix command interacting with a running Spark application? Do you expect Spark to be continuous

RDD saveAsTextFile() to local disk

2015-07-08 Thread spok20nn
Getting exception when wrting RDD to local disk using following function saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") The dir (/home/someuser/dir2/testupload/) was created before running the job. The error message is misleading. org.apache.spark.SparkExce

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Richard Marscher
Ah, I see this is streaming. I haven't any practical experience with that side of Spark. But the foreachPartition idea is a good approach. I've used that pattern extensively, even though not for singletons, but just to create non-serializable objects like API and DB clients on the executor side. I

RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar
Getting exception when wrting RDD to local disk using following function saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") The dir (/home/someuser/dir2/testupload/) was created before running the job. The error message is misleading. org.apache.spark.SparkExce

Job completed successfully without processing anything

2015-07-08 Thread ๏̯͡๏
My job completed in 40 seconds that is not correct as there is no output.. I seee Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@10.115.86.24:54737/), Path(/user/OutputCommitCoordinator)] at akka.actor.ActorSelection$$anonfun$

pause and resume streaming app

2015-07-08 Thread Shushant Arora
Is it possible to pause and resume a streaming app? I have a streaming app which reads events from kafka and post to some external source. I want to pause the app when external source is down and resume it automatically when it comes back ? Is it possible to pause the app and is it possible to p

Re: spark benchmarking

2015-07-08 Thread Stephen Boesch
One option is the databricks/spark-perf project https://github.com/databricks/spark-perf 2015-07-08 11:23 GMT-07:00 MrAsanjar . : > Hi all, > What is the most common used tool/product to benchmark spark job? >

Communication between driver, cluster and HiveServer

2015-07-08 Thread Eric Pederson
All: I recently ran into a scenario where spark-shell could communicate with Hive but another application of mine (Spark Notebook) could not. When I tried to get a reference to a table using sql.table("tab") Spark Notebook threw an exception but spark-shell did not. I was trying to track down th

spark benchmarking

2015-07-08 Thread MrAsanjar .
Hi all, What is the most common used tool/product to benchmark spark job?

Re: spark core/streaming doubts

2015-07-08 Thread Tathagata Das
Responses inline. On Wed, Jul 8, 2015 at 10:26 AM, Shushant Arora wrote: > 1.Does creation of read only singleton object in each map function is same > as broadcast object as singleton never gets garbage collected unless > executor gets shutdown ? Aim is to avoid creation of complex object at ea

Re: SnappyCompressionCodec on the master

2015-07-08 Thread Josh Rosen
Can you file a JIRA? https://issues.apache.org/jira/browse/SPARK On Wed, Jul 8, 2015 at 12:47 AM, nizang wrote: > hi, > > I'm running spark standalone cluster (1.4.0). I have some applications > running with scheduler every hour. I found that on one of the executions, > the job got to be FINISH

Re: [SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-08 Thread Michael Armbrust
Here's a related JIRA: https://issues.apache.org/jira/browse/SPARK-7819 Typically you can work around this by making sure that the classes are shared across the isolation boundary, as discussed in the comments. On Tue, Jul 7, 2015 at 3:29 AM, Sea

Create RDD from output of unix command

2015-07-08 Thread foobar
What's the best practice of creating RDD from some external unix command output? I assume if the output size is large (say millions of lines), creating RDD from an array of all lines is not a good idea? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
Hey Jong, No I did answer the right question. What I explained did not change the JVM classes (that is the function is the same) but it still ensures that computation is different (the filters get updated with time). So you can checkpoint this and recover from it. This is ONE possible way to do dy

Re: Reading Avro files from Streaming

2015-07-08 Thread harris
Resolved that compilation issue using AvroKey and AvroKeyInputFormat. val avroDs = ssc.fileStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Avro-files-f

Re: Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
Hi TD, you answered a wrong question. If you read the subject, mine was specifically about checkpointing. I'll elaborate The checkpoint, which is a serialized DStream DAG, contains all the metadata and *logic*, like the function passed to e.g. DStream.transform() This is serialized as a anonymous

Re: PySpark without PySpark

2015-07-08 Thread Davies Liu
Great post, thanks for sharing with us! On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal wrote: > Hi Julian, > > I recently built a Python+Spark application to do search relevance > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on > EC2 (so I don't use the PySpark shell, hopefu

spark core/streaming doubts

2015-07-08 Thread Shushant Arora
1.Does creation of read only singleton object in each map function is same as broadcast object as singleton never gets garbage collected unless executor gets shutdown ? Aim is to avoid creation of complex object at each batch interval of a spark streaming app. 2.why JavaStreamingContext 's sc ()

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Feynman Liang
Take a look at https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse On Wed, Jul 8, 2015 at 7:47 AM, Daniel Siegmann wrote: > To set up Eclipse for Spark you should install the Scala IDE plugins: > http://scala-ide.org/download/current.html > > De

Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread Feynman Liang
A RDD[Double] is an abstraction for a large collection of doubles, possibly distributed across multiple nodes. The DoubleRDDFunctions are there for performing mean and variance calculations across this distributed dataset. In contrast, a Vector is not distributed and fits on your local machine. Yo

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
You can use DStream.transform for some stuff. Transform takes a RDD => RDD function that allow arbitrary RDD operations to be done on RDDs of a DStream. This function gets evaluated on the driver on every batch interval. If you are smart about writing the function, it can do different stuff at diff

Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
I just asked this question at the streaming webinar that just ended, but the speakers didn't answered so throwing here: AFAIK checkpoints are the only recommended method for running Spark streaming without data loss. But it involves serializing the entire dstream graph, which prohibits any logic c

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
Hi Julian, I recently built a Python+Spark application to do search relevance analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on EC2 (so I don't use the PySpark shell, hopefully thats what you are looking for). Can't share the code, but the basic approach is covered in this

Re: Restarting Spark Streaming Application with new code

2015-07-08 Thread Vinoth Chandar
Thanks for the clarification, Cody! On Mon, Jul 6, 2015 at 6:44 AM, Cody Koeninger wrote: > You shouldn't rely on being able to restart from a checkpoint after > changing code, regardless of whether the change was explicitly related to > serialization. > > If you are relying on checkpoints to ho

Re: (de)serialize DStream

2015-07-08 Thread Shixiong Zhu
DStream must be Serializable, it's metadata checkpointing. But you can use KryoSerializer for data checkpointing. The data checkpointing uses RDD.checkpoint which can be set by spark.serializer. Best Regards, Shixiong Zhu 2015-07-08 3:43 GMT+08:00 Chen Song : > In Spark Streaming, when using upd

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread ayan guha
Do you have a benchmark to say running these two statements as it is will be slower than what you suggest? On 9 Jul 2015 01:06, "Brandon White" wrote: > The point of running them in parallel would be faster creation of the > tables. Has anybody been able to efficiently parallelize something like

Re: Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov
What I seem to be don’t get is how my code ends up being on Worker node. My understanding was that jar file, which I use to start the job should automatically be copied into Worker nodes and added to classpath. It seems to be not the case. But if my jar is not copied into Worker nodes, then how

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Sean. "are you asking about foreach vs foreachPartition? that's quite different. foreachPartition does not give more parallelism but lets you operate on a whole batch of data at once, which is nice if you need to allocate some expensive resource to do the processing" This is basically wha

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Cody. The "good boy" comment wasn't from me :) I was the one asking for help. On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger wrote: > Sean already answered your question. foreachRDD and foreachPartition are > completely different, there's nothing fuzzy or insufficient about that > a

Re: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Evgeny Sinelnikov
Thank you, Ray, but it is already created and almost fixed: https://issues.apache.org/jira/browse/SPARK-8840 On Wed, Jul 8, 2015 at 4:04 PM, Sun, Rui wrote: > Hi, Evgeny, > > I reported a JIRA issue for your problem: > https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see >

Jobs with unknown origin.

2015-07-08 Thread Jan-Paul Bultmann
Hey, I have quite a few jobs appearing in the web-ui with the description "run at ThreadPoolExecutor.java:1142". Are these generated by SparkSQL internally? There are so many that they cause a RejectedExecutionException when the thread-pool runs out of space for them. RejectedExecutionExceptio

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Srikanth
Your tableLoad() APIs are not actions. File will be read fully only when an action is performed. If the action is something like table1.join(table2), then I think both files will be read in parallel. Can you try that and look at the execution plan or in 1.4 this is shown in Spark UI. Srikanth On

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen
@Evo There is no foreachRDD operation on RDDs; it is a method of DStream. It gives each RDD in the stream. RDD has a foreach, and foreachPartition. These give elements of an RDD. What do you mean it 'works' to call foreachRDD on an RDD? @Dmitry are you asking about foreach vs foreachPartition? tha

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Brandon White
The point of running them in parallel would be faster creation of the tables. Has anybody been able to efficiently parallelize something like this in Spark? On Jul 8, 2015 12:29 AM, "Akhil Das" wrote: > Whats the point of creating them in parallel? You can multi-thread it run > it in parallel tho

Re: Connecting to nodes on cluster

2015-07-08 Thread ayan guha
What's the error you are getting? On 9 Jul 2015 00:01, "Ashish Dutt" wrote: > Hi, > > We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two > days I have been trying to connect my laptop to the server using spark > but its been unsucessful. > The server contains data that nee

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Cody Koeninger
Sean already answered your question. foreachRDD and foreachPartition are completely different, there's nothing fuzzy or insufficient about that answer. The fact that you can call foreachPartition on an rdd within the scope of foreachRDD should tell you that they aren't in any way comparable. I'm

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Terry Hole
I am using spark 1.4.1rc1 with default hive settings Thanks - Terry Hi All, I'd like to use the hive context in spark shell, i need to recreate the hive meta database in the same location, so i want to close the derby connection previous created in the spark shell, is there any way to do this?

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Daniel Siegmann
To set up Eclipse for Spark you should install the Scala IDE plugins: http://scala-ide.org/download/current.html Define your project in Maven with Scala plugins configured (you should be able to find documentation online) and import as an existing Maven project. The source code should be in src/ma

PySpark without PySpark

2015-07-08 Thread Julian
Hey. Is there a resource that has written up what the necessary steps are for running PySpark without using the PySpark shell? I can reverse engineer (by following the tracebacks and reading the shell source) what the relevant Java imports needed are, but I would assume someone has attempted this

Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov
Hello. I have an issue with CustomKryoRegistrator, which causes ClassNotFound on Worker. The issue is resolved if call SparkConf.setJar with path to the same jar I run. It is a workaround, but it requires to specify the same jar file twice. The first time I use it to actually run the job, and

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
"These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives." Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using on

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread Ashish Dutt
Hello Sooraj, Thank you for your response. It indeed give me a ray of hope now. Can you please suggest any good tutorials for installing and working with ipython notebook server on the node. Thank you Ashish On 08-Jul-2015 6:16 PM, "sooraj" wrote: > > Hi Ashish, > > I am running ipython notebook s

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov
That was a) fuzzy b) insufficient – one can certainly use forach (only) on DStream RDDs – it works as empirical observation As another empirical observation: For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like

Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt
Hi, We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two days I have been trying to connect my laptop to the server using spark but its been unsucessful. The server contains data that needs to be cleaned and analysed. The cluster and the nodes are on linux environment. To con

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen
These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives. On Wed, Jul 8, 2015, 2:43 PM dgoldenberg wrote: > Is there a set of best practices for when to use foreachPartition vs. > foreachRDD? > > Is it generally tr

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov
For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like message brokers, databases and file systems - that increases the level of parallelism of your output processing -Original Message- From: dgoldenberg [mailto:dg

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Ashish Dutt
Hello Prateek, I started with getting the pre built binaries so as to skip the hassle of building them from scratch. I am not familiar with scala so can't comment on it. I have documented my experiences on my blog www.edumine.wordpress.com Perhaps it might be useful to you. On 08-Jul-2015 9:39 PM,

foreachRDD vs. forearchPartition ?

2015-07-08 Thread dgoldenberg
Is there a set of best practices for when to use foreachPartition vs. foreachRDD? Is it generally true that using foreachPartition avoids some of the over-network data shuffling overhead? When would I definitely want to use one method vs. the other? Thanks. -- View this message in context: h

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher wr

Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Prateek .
Hi I am beginner to scala and spark. I am trying to set up eclipse environment to develop spark program in scala, then take it's jar for spark-submit. How shall I start? To start my task includes, setting up eclipse for scala and spark, getting dependencies resolved, building project using m

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times. I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm processin

RE: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Sun, Rui
Hi, Evgeny, I reported a JIRA issue for your problem: https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see how it will be solved. Ray -Original Message- From: Evgeny Sinelnikov [mailto:esinelni...@griddynamics.com] Sent: Monday, July 6, 2015 7:27 PM To: huangzheng

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Also try to increase the number of partions gradually – not in one big jump from 20 to 100 but adding e.g. 10 at a time and see whether there is a correlation with adding more RAM to the executors From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Wednesday, July 8, 2015 1:26 PM To: 'An

Announcement of the webinar in the newsletter and on the site

2015-07-08 Thread Oleh Rozvadovskyy
Hi there, My name is Oleh Rozvadovskyy. I represent CyberVision Inc., the IoT company and the developer of Kaa IoT platform, which is open-source middleware for smart devices and servers. In a 2 weeks period we're going to run a webinar *"IoT data ingestion in Spark Streaming using Kaa" on Thu, J

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Are you sure you have actually increased the RAM (how exactly did you do that and does it show in Spark UI) Also use the SPARK UI and the driver console to check the RAM allocated for each RDD and RDD partion in each of the scenarios Re b) the general rule is num of partitions = 2 x nu

Re: UDF in spark

2015-07-08 Thread vinod kumar
Thank you for quick response Vishnu, I have following doubts too. 1.Is there is anyway to upload files to HDFS programattically using c# language?. 2.Is there is any way to automatically load scala block of code (for UDF) when i start the spark service? -Vinod On Wed, Jul 8, 2015 at 7:57 AM, VI

  1   2   >