PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread sooraj
Hi, I am using MLlib collaborative filtering API on an implicit preference data set. From a pySpark notebook, I am iteratively creating the matrix factorization model with the aim of measuring the RMSE for each combination of parameters for this API like the rank, lambda and alpha. After the code

Re: unable to bring up cluster with ec2 script

2015-07-08 Thread Akhil Das
Its showing connection refused, for some reason it was not able to connect to the machine either its the machine\s start up time or its with the security group. Thanks Best Regards On Wed, Jul 8, 2015 at 2:04 AM, Pagliari, Roberto wrote: > > > > > I'm following the tutorial about Apache Spark o

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Whats the point of creating them in parallel? You can multi-thread it run it in parallel though. Thanks Best Regards On Wed, Jul 8, 2015 at 5:34 AM, Brandon White wrote: > Say I have a spark job that looks like following: > > def loadTable1() { > val table1 = sqlContext.jsonFile(s"s3://textfi

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Ashish Dutt
Thanks for your reply Akhil. How do you multithread it? Sincerely, Ashish Dutt On Wed, Jul 8, 2015 at 3:29 PM, Akhil Das wrote: > Whats the point of creating them in parallel? You can multi-thread it run > it in parallel though. > > Thanks > Best Regards > > On Wed, Jul 8, 2015 at 5:34 AM, Bra

回复:RE: Hibench build fail

2015-07-08 Thread luohui20001
Hi Ted and Grace, Retried with Spark 1.4.0,still failed with same phenomenon.here is a log.FYI. What else details may help?BTW, is it a necessary step to run Hibench test for my spark cluster? I also tried to skip building Hibench to execute "bin/run-all.sh", but also got er

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Have a look http://alvinalexander.com/scala/how-to-create-java-thread-runnable-in-scala, create two threads and call thread1.start(), thread2.start() Thanks Best Regards On Wed, Jul 8, 2015 at 1:06 PM, Ashish Dutt wrote: > Thanks for your reply Akhil. > How do you multithread it? > > Sincerely,

Word2Vec distributed?

2015-07-08 Thread Carsten Schnober
Hi, I've been experimenting with the Spark Word2Vec implementation in the MLLib package. It seems to me that only the preparatory steps are actually performed in a distributed way, i.e. stages 0-2 that prepare the data. In stage 3 (mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems

SnappyCompressionCodec on the master

2015-07-08 Thread nizang
hi, I'm running spark standalone cluster (1.4.0). I have some applications running with scheduler every hour. I found that on one of the executions, the job got to be FINISHED after very few seconds (instead of ~5 minutes), and in the logs on the master, I can see the following exception: org.apa

Re: spark - redshift !!!

2015-07-08 Thread shahab
Hi, I did some experiment with loading data from s3 into spark. I loaded data from s3 using sc.textFile(). Have a look at the following code snippet: val csv = sc.textFile("s3n://mybucket/myfile.csv") val rdd = csv.map(line => line.split(",").map(elem => elem.trim)) // my data format is i

Re: spark - redshift !!!

2015-07-08 Thread spark user
Hi 'I am looking how to load data in redshift .Thanks  On Wednesday, July 8, 2015 12:47 AM, shahab wrote: Hi, I did some experiment with loading data from s3 into spark. I loaded data from s3 using sc.textFile(). Have a look at the following code snippet: val csv = sc.textFile(

Re: How to submit streaming application and exit

2015-07-08 Thread Bin Wang
Thanks. Actually I've find the way. I'm using spark-submit to submit the job the a YARN cluster with --mater yarn-cluster (which spark-submit process is not the driver). So I can config "spark.yarn.submit.waitAppComplettion" to "false" so that the process will exit after the job is submitted. ayan

How to upgrade Spark version in CDH 5.4

2015-07-08 Thread Ashish Dutt
Hi, I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4. I checked the documentation here

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Ashish Dutt
Thanks you Akhil for the link Sincerely, Ashish Dutt PhD Candidate Department of Information Systems University of Malaya, Lembah Pantai, 50603 Kuala Lumpur, Malaysia On Wed, Jul 8, 2015 at 3:43 PM, Akhil Das wrote: > Have a look > http://alvinalexander.com/scala/how-to-create-java-thread-runn

回复:回复:RE: Hibench build fail

2015-07-08 Thread luohui20001
should I add dependencies for "spark-core_2.10,spark-yarn_2.10,spark-streaming_2.10, org.apache.spark:spark-mllib_2.10,:spark-hive_2.10,:spark-graphx_2.10" in pom.xml?if yes, there are 7 pom.xml in HiBench listing below, which one to modify? [root@spark-study HiBench-master]# find ./ -name pom

Re: spark - redshift !!!

2015-07-08 Thread shahab
Sorry, I misunderstood. best, /Shahab On Wed, Jul 8, 2015 at 9:52 AM, spark user wrote: > Hi 'I am looking how to load data in redshift . > Thanks > > > > On Wednesday, July 8, 2015 12:47 AM, shahab > wrote: > > > Hi, > > I did some experiment with loading data from s3 into spark. I loaded d

Re: is it possible to disable -XX:OnOutOfMemoryError=kill %p for the executors?

2015-07-08 Thread Konstantinos Kougios
seems you're correct: 2015-07-07 17:21:27,245 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=38506,containerID=container_1436262805092_0022_01_03] is running be yond virtual memory limits. Current usage: 4.3 GB of 4.5 GB physic

UDF in spark

2015-07-08 Thread vinod kumar
Hi Everyone, I am new to spark.may I know how to define and use User Define Function in SPARK SQL. I want to use defined UDF by using sql queries. My Environment Windows 8 spark 1.3.1 Warm Regards, Vinod

Day of year

2015-07-08 Thread Ravisankar Mani
Hi everyone, I can't get 'day of year' when using spark sql query. Can you help any way to achieve day of year? Regards, Ravi

Using different users with spark thriftserver

2015-07-08 Thread Zalzberg, Idan (Agoda)
Hi, We are using spark thrift server as a hive replacement. One of the things we have with hive, is that different users can connect with their own usernames/passwords and get appropriate permissions. So on the same server, one user may have a query that will have permissions to run, while the o

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread sooraj
That turned out to be a silly data type mistake. At one point in the iterative call, I was passing an integer value for the parameter 'alpha' of the ALS train API, which was expecting a Double. So, py4j in fact complained that it cannot take a method that takes an integer value for that parameter.

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread Ashish Dutt
Hello Sooraj, I see you are using ipython notebook. Can you tell me are you on Windows OS or Linux based OS? I am using Windows 7 and I am new to Spark. I am trying to connect ipython with my local cluster based on CDH5.4. I followed these tutorials here but they are written on linux environment an

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread Ashish Dutt
My apologies for double posting but I missed the web links that i followed which are 1 , 2 , 3

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread sooraj
Hi Ashish, I am running ipython notebook server on one of the nodes of the cluster (HDP). Setting it up was quite straightforward, and I guess I followed the same references that you linked to. Then I access the notebook remotely from my development PC. Never tried to connect a local ipython (on a

Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Terry Hole
Hi All, I'd like to use the hive context in spark shell, i need to recreate the hive meta database in the same location, so i want to close the derby connection previous created in the spark shell, is there any way to do this? I try this, but it does not work: DriverManager.getConnection("jdbc:de

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-08 Thread Daniel Haviv
Hi, Just updating back that setting spark.driver.extraClassPath worked. Thanks, Daniel On Fri, Jul 3, 2015 at 5:35 PM, Ted Yu wrote: > Alternatively, setting spark.driver.extraClassPath should work. > > Cheers > > On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran > wrote: > >> >>> On Thu, Jul 2,

Re: UDF in spark

2015-07-08 Thread VISHNU SUBRAMANIAN
Hi, sqlContext.udf.register("udfname", functionname _) example: def square(x:Int):Int = { x * x} register udf as below sqlContext.udf.register("square",square _) Thanks, Vishnu On Wed, Jul 8, 2015 at 2:23 PM, vinod kumar wrote: > Hi Everyone, > > I am new to spark.may I know how to define

Problem in Understanding concept of Physical Cores

2015-07-08 Thread Aniruddh Sharma
Hi I am new to Spark. Following is the problem that I am facing Test 1) I ran a VM on CDH distribution with only 1 core allocated to it and I ran simple Streaming example in spark-shell with sending data on port and trying to read it. With 1 core allocated to this nothing happens in my strea

Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Aniruddh Sharma
Hi, I am new to Spark. I have done following tests and I am confused in conclusions. I have 2 queries. Following is the detail of test Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10 executors and

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
This is most likely due to the internal implementation of ALS in MLib. Probably for each parallel unit of execution (partition in Spark terms) the implementation allocates and uses a RAM buffer where it keeps interim results during the ALS iterations If we assume that the size of that intern

Re: UDF in spark

2015-07-08 Thread vinod kumar
Thanks Vishnu, When restart the service the UDF was not accessible by my query.I need to run the mentioned block again to use the UDF. Is there is any way to maintain UDF in sqlContext permanently? Thanks, Vinod On Wed, Jul 8, 2015 at 7:16 AM, VISHNU SUBRAMANIAN < johnfedrickena...@gmail.com> wr

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2015-07-08 Thread maxdml
Same feedback with spark 1.4.0 and hadoop 2.5.2. Workload is completing tho. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-remote-transport-Transport-InvalidAssociationException-The-remote-system-terminated-the-associan-tp20071p23713.html Sent from t

Re: UDF in spark

2015-07-08 Thread VISHNU SUBRAMANIAN
HI Vinod, Yes If you want to use a scala or python function you need the block of code. Only Hive UDF's are available permanently. Thanks, Vishnu On Wed, Jul 8, 2015 at 5:17 PM, vinod kumar wrote: > Thanks Vishnu, > > When restart the service the UDF was not accessible by my query.I need to >

Re: UDF in spark

2015-07-08 Thread vinod kumar
Thank you for quick response Vishnu, I have following doubts too. 1.Is there is anyway to upload files to HDFS programattically using c# language?. 2.Is there is any way to automatically load scala block of code (for UDF) when i start the spark service? -Vinod On Wed, Jul 8, 2015 at 7:57 AM, VI

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Are you sure you have actually increased the RAM (how exactly did you do that and does it show in Spark UI) Also use the SPARK UI and the driver console to check the RAM allocated for each RDD and RDD partion in each of the scenarios Re b) the general rule is num of partitions = 2 x nu

Announcement of the webinar in the newsletter and on the site

2015-07-08 Thread Oleh Rozvadovskyy
Hi there, My name is Oleh Rozvadovskyy. I represent CyberVision Inc., the IoT company and the developer of Kaa IoT platform, which is open-source middleware for smart devices and servers. In a 2 weeks period we're going to run a webinar *"IoT data ingestion in Spark Streaming using Kaa" on Thu, J

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Also try to increase the number of partions gradually – not in one big jump from 20 to 100 but adding e.g. 10 at a time and see whether there is a correlation with adding more RAM to the executors From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Wednesday, July 8, 2015 1:26 PM To: 'An

RE: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Sun, Rui
Hi, Evgeny, I reported a JIRA issue for your problem: https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see how it will be solved. Ray -Original Message- From: Evgeny Sinelnikov [mailto:esinelni...@griddynamics.com] Sent: Monday, July 6, 2015 7:27 PM To: huangzheng

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times. I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm processin

Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Prateek .
Hi I am beginner to scala and spark. I am trying to set up eclipse environment to develop spark program in scala, then take it's jar for spark-submit. How shall I start? To start my task includes, setting up eclipse for scala and spark, getting dependencies resolved, building project using m

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher wr

foreachRDD vs. forearchPartition ?

2015-07-08 Thread dgoldenberg
Is there a set of best practices for when to use foreachPartition vs. foreachRDD? Is it generally true that using foreachPartition avoids some of the over-network data shuffling overhead? When would I definitely want to use one method vs. the other? Thanks. -- View this message in context: h

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Ashish Dutt
Hello Prateek, I started with getting the pre built binaries so as to skip the hassle of building them from scratch. I am not familiar with scala so can't comment on it. I have documented my experiences on my blog www.edumine.wordpress.com Perhaps it might be useful to you. On 08-Jul-2015 9:39 PM,

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov
For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like message brokers, databases and file systems - that increases the level of parallelism of your output processing -Original Message- From: dgoldenberg [mailto:dg

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen
These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives. On Wed, Jul 8, 2015, 2:43 PM dgoldenberg wrote: > Is there a set of best practices for when to use foreachPartition vs. > foreachRDD? > > Is it generally tr

Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt
Hi, We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two days I have been trying to connect my laptop to the server using spark but its been unsucessful. The server contains data that needs to be cleaned and analysed. The cluster and the nodes are on linux environment. To con

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov
That was a) fuzzy b) insufficient – one can certainly use forach (only) on DStream RDDs – it works as empirical observation As another empirical observation: For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread Ashish Dutt
Hello Sooraj, Thank you for your response. It indeed give me a ray of hope now. Can you please suggest any good tutorials for installing and working with ipython notebook server on the node. Thank you Ashish On 08-Jul-2015 6:16 PM, "sooraj" wrote: > > Hi Ashish, > > I am running ipython notebook s

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
"These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives." Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using on

Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov
Hello. I have an issue with CustomKryoRegistrator, which causes ClassNotFound on Worker. The issue is resolved if call SparkConf.setJar with path to the same jar I run. It is a workaround, but it requires to specify the same jar file twice. The first time I use it to actually run the job, and

PySpark without PySpark

2015-07-08 Thread Julian
Hey. Is there a resource that has written up what the necessary steps are for running PySpark without using the PySpark shell? I can reverse engineer (by following the tracebacks and reading the shell source) what the relevant Java imports needed are, but I would assume someone has attempted this

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Daniel Siegmann
To set up Eclipse for Spark you should install the Scala IDE plugins: http://scala-ide.org/download/current.html Define your project in Maven with Scala plugins configured (you should be able to find documentation online) and import as an existing Maven project. The source code should be in src/ma

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Terry Hole
I am using spark 1.4.1rc1 with default hive settings Thanks - Terry Hi All, I'd like to use the hive context in spark shell, i need to recreate the hive meta database in the same location, so i want to close the derby connection previous created in the spark shell, is there any way to do this?

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Cody Koeninger
Sean already answered your question. foreachRDD and foreachPartition are completely different, there's nothing fuzzy or insufficient about that answer. The fact that you can call foreachPartition on an rdd within the scope of foreachRDD should tell you that they aren't in any way comparable. I'm

Re: Connecting to nodes on cluster

2015-07-08 Thread ayan guha
What's the error you are getting? On 9 Jul 2015 00:01, "Ashish Dutt" wrote: > Hi, > > We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two > days I have been trying to connect my laptop to the server using spark > but its been unsucessful. > The server contains data that nee

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Brandon White
The point of running them in parallel would be faster creation of the tables. Has anybody been able to efficiently parallelize something like this in Spark? On Jul 8, 2015 12:29 AM, "Akhil Das" wrote: > Whats the point of creating them in parallel? You can multi-thread it run > it in parallel tho

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen
@Evo There is no foreachRDD operation on RDDs; it is a method of DStream. It gives each RDD in the stream. RDD has a foreach, and foreachPartition. These give elements of an RDD. What do you mean it 'works' to call foreachRDD on an RDD? @Dmitry are you asking about foreach vs foreachPartition? tha

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Srikanth
Your tableLoad() APIs are not actions. File will be read fully only when an action is performed. If the action is something like table1.join(table2), then I think both files will be read in parallel. Can you try that and look at the execution plan or in 1.4 this is shown in Spark UI. Srikanth On

Jobs with unknown origin.

2015-07-08 Thread Jan-Paul Bultmann
Hey, I have quite a few jobs appearing in the web-ui with the description "run at ThreadPoolExecutor.java:1142". Are these generated by SparkSQL internally? There are so many that they cause a RejectedExecutionException when the thread-pool runs out of space for them. RejectedExecutionExceptio

Re: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Evgeny Sinelnikov
Thank you, Ray, but it is already created and almost fixed: https://issues.apache.org/jira/browse/SPARK-8840 On Wed, Jul 8, 2015 at 4:04 PM, Sun, Rui wrote: > Hi, Evgeny, > > I reported a JIRA issue for your problem: > https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see >

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Cody. The "good boy" comment wasn't from me :) I was the one asking for help. On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger wrote: > Sean already answered your question. foreachRDD and foreachPartition are > completely different, there's nothing fuzzy or insufficient about that > a

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Sean. "are you asking about foreach vs foreachPartition? that's quite different. foreachPartition does not give more parallelism but lets you operate on a whole batch of data at once, which is nice if you need to allocate some expensive resource to do the processing" This is basically wha

Re: Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov
What I seem to be don’t get is how my code ends up being on Worker node. My understanding was that jar file, which I use to start the job should automatically be copied into Worker nodes and added to classpath. It seems to be not the case. But if my jar is not copied into Worker nodes, then how

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread ayan guha
Do you have a benchmark to say running these two statements as it is will be slower than what you suggest? On 9 Jul 2015 01:06, "Brandon White" wrote: > The point of running them in parallel would be faster creation of the > tables. Has anybody been able to efficiently parallelize something like

Re: (de)serialize DStream

2015-07-08 Thread Shixiong Zhu
DStream must be Serializable, it's metadata checkpointing. But you can use KryoSerializer for data checkpointing. The data checkpointing uses RDD.checkpoint which can be set by spark.serializer. Best Regards, Shixiong Zhu 2015-07-08 3:43 GMT+08:00 Chen Song : > In Spark Streaming, when using upd

Re: Restarting Spark Streaming Application with new code

2015-07-08 Thread Vinoth Chandar
Thanks for the clarification, Cody! On Mon, Jul 6, 2015 at 6:44 AM, Cody Koeninger wrote: > You shouldn't rely on being able to restart from a checkpoint after > changing code, regardless of whether the change was explicitly related to > serialization. > > If you are relying on checkpoints to ho

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
Hi Julian, I recently built a Python+Spark application to do search relevance analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on EC2 (so I don't use the PySpark shell, hopefully thats what you are looking for). Can't share the code, but the basic approach is covered in this

Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
I just asked this question at the streaming webinar that just ended, but the speakers didn't answered so throwing here: AFAIK checkpoints are the only recommended method for running Spark streaming without data loss. But it involves serializing the entire dstream graph, which prohibits any logic c

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
You can use DStream.transform for some stuff. Transform takes a RDD => RDD function that allow arbitrary RDD operations to be done on RDDs of a DStream. This function gets evaluated on the driver on every batch interval. If you are smart about writing the function, it can do different stuff at diff

Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread Feynman Liang
A RDD[Double] is an abstraction for a large collection of doubles, possibly distributed across multiple nodes. The DoubleRDDFunctions are there for performing mean and variance calculations across this distributed dataset. In contrast, a Vector is not distributed and fits on your local machine. Yo

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Feynman Liang
Take a look at https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse On Wed, Jul 8, 2015 at 7:47 AM, Daniel Siegmann wrote: > To set up Eclipse for Spark you should install the Scala IDE plugins: > http://scala-ide.org/download/current.html > > De

spark core/streaming doubts

2015-07-08 Thread Shushant Arora
1.Does creation of read only singleton object in each map function is same as broadcast object as singleton never gets garbage collected unless executor gets shutdown ? Aim is to avoid creation of complex object at each batch interval of a spark streaming app. 2.why JavaStreamingContext 's sc ()

Re: PySpark without PySpark

2015-07-08 Thread Davies Liu
Great post, thanks for sharing with us! On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal wrote: > Hi Julian, > > I recently built a Python+Spark application to do search relevance > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on > EC2 (so I don't use the PySpark shell, hopefu

Re: Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim
Hi TD, you answered a wrong question. If you read the subject, mine was specifically about checkpointing. I'll elaborate The checkpoint, which is a serialized DStream DAG, contains all the metadata and *logic*, like the function passed to e.g. DStream.transform() This is serialized as a anonymous

Re: Reading Avro files from Streaming

2015-07-08 Thread harris
Resolved that compilation issue using AvroKey and AvroKeyInputFormat. val avroDs = ssc.fileStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Avro-files-f

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das
Hey Jong, No I did answer the right question. What I explained did not change the JVM classes (that is the function is the same) but it still ensures that computation is different (the filters get updated with time). So you can checkpoint this and recover from it. This is ONE possible way to do dy

Create RDD from output of unix command

2015-07-08 Thread foobar
What's the best practice of creating RDD from some external unix command output? I assume if the output size is large (say millions of lines), creating RDD from an array of all lines is not a good idea? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

Re: [SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-08 Thread Michael Armbrust
Here's a related JIRA: https://issues.apache.org/jira/browse/SPARK-7819 Typically you can work around this by making sure that the classes are shared across the isolation boundary, as discussed in the comments. On Tue, Jul 7, 2015 at 3:29 AM, Sea

Re: SnappyCompressionCodec on the master

2015-07-08 Thread Josh Rosen
Can you file a JIRA? https://issues.apache.org/jira/browse/SPARK On Wed, Jul 8, 2015 at 12:47 AM, nizang wrote: > hi, > > I'm running spark standalone cluster (1.4.0). I have some applications > running with scheduler every hour. I found that on one of the executions, > the job got to be FINISH

Re: spark core/streaming doubts

2015-07-08 Thread Tathagata Das
Responses inline. On Wed, Jul 8, 2015 at 10:26 AM, Shushant Arora wrote: > 1.Does creation of read only singleton object in each map function is same > as broadcast object as singleton never gets garbage collected unless > executor gets shutdown ? Aim is to avoid creation of complex object at ea

spark benchmarking

2015-07-08 Thread MrAsanjar .
Hi all, What is the most common used tool/product to benchmark spark job?

Communication between driver, cluster and HiveServer

2015-07-08 Thread Eric Pederson
All: I recently ran into a scenario where spark-shell could communicate with Hive but another application of mine (Spark Notebook) could not. When I tried to get a reference to a table using sql.table("tab") Spark Notebook threw an exception but spark-shell did not. I was trying to track down th

Re: spark benchmarking

2015-07-08 Thread Stephen Boesch
One option is the databricks/spark-perf project https://github.com/databricks/spark-perf 2015-07-08 11:23 GMT-07:00 MrAsanjar . : > Hi all, > What is the most common used tool/product to benchmark spark job? >

pause and resume streaming app

2015-07-08 Thread Shushant Arora
Is it possible to pause and resume a streaming app? I have a streaming app which reads events from kafka and post to some external source. I want to pause the app when external source is down and resume it automatically when it comes back ? Is it possible to pause the app and is it possible to p

Job completed successfully without processing anything

2015-07-08 Thread ๏̯͡๏
My job completed in 40 seconds that is not correct as there is no output.. I seee Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@10.115.86.24:54737/), Path(/user/OutputCommitCoordinator)] at akka.actor.ActorSelection$$anonfun$

RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar
Getting exception when wrting RDD to local disk using following function saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") The dir (/home/someuser/dir2/testupload/) was created before running the job. The error message is misleading. org.apache.spark.SparkExce

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Richard Marscher
Ah, I see this is streaming. I haven't any practical experience with that side of Spark. But the foreachPartition idea is a good approach. I've used that pattern extensively, even though not for singletons, but just to create non-serializable objects like API and DB clients on the executor side. I

RDD saveAsTextFile() to local disk

2015-07-08 Thread spok20nn
Getting exception when wrting RDD to local disk using following function saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") The dir (/home/someuser/dir2/testupload/) was created before running the job. The error message is misleading. org.apache.spark.SparkExce

Re: Create RDD from output of unix command

2015-07-08 Thread Richard Marscher
As a distributed data processing engine, Spark should be fine with millions of lines. It's built with the idea of massive data sets in mind. Do you have more details on how you anticipate the output of a unix command interacting with a running Spark application? Do you expect Spark to be continuous

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
You are welcome Davies. Just to clarify, I didn't write the post (not sure if my earlier post gave that impression, apologize if so), although I agree its great :-). -sujit On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu wrote: > Great post, thanks for sharing with us! > > On Wed, Jul 8, 2015 at 9

Disable heartbeat messages in REPL

2015-07-08 Thread Lincoln Atkinson
"WARN Executor: Told to re-register on heartbeat" is logged repeatedly in the spark shell, which is very distracting and corrupts the display of whatever set of commands I'm currently typing out. Is there an option to disable the logging of this message? Thanks, -Lincoln

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Feynman Liang
I was thinking the same thing! Try sc.setLogLevel("ERROR") On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson wrote: > “WARN Executor: Told to re-register on heartbeat” is logged repeatedly > in the spark shell, which is very distracting and corrupts the display of > whatever set of commands I’m

RE: Disable heartbeat messages in REPL

2015-07-08 Thread Lincoln Atkinson
Brilliant! Thanks. From: Feynman Liang [mailto:fli...@databricks.com] Sent: Wednesday, July 08, 2015 2:15 PM To: Lincoln Atkinson Cc: user@spark.apache.org Subject: Re: Disable heartbeat messages in REPL I was thinking the same thing! Try sc.setLogLevel("ERROR") On Wed, Jul 8, 2015 at 2:01 PM, L

Requirement failed: Some of the DStreams have different slide durations

2015-07-08 Thread anshu shukla
Hi all , I want to create union of 2 DStreams , in one of them *RDD is created per 1 second* , other is having RDD generated by reduceByWindowandKey with *duration set to 60 sec.* (slide duration also 60 sec .) - Main idea is to do some analysis for every minute data and emitting union

Real-time data visualization with Zeppelin

2015-07-08 Thread Ganelin, Ilya
Hi all – I’m just wondering if anyone has had success integrating Spark Streaming with Zeppelin and actually dynamically updating the data in near real-time. From my investigation, it seems that Zeppelin will only allow you to display a snapshot of data, not a continuously updating table. Has an

Re: Real-time data visualization with Zeppelin

2015-07-08 Thread Brandon White
Can you use a con job to update it every X minutes? On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya wrote: > Hi all – I’m just wondering if anyone has had success integrating Spark > Streaming with Zeppelin and actually dynamically updating the data in near > real-time. From my investigation, it s

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Andrew Or
Hi Lincoln, I've noticed this myself. I believe it's a new issue that only affects local mode. I've filed a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-8911 2015-07-08 14:20 GMT-07:00 Lincoln Atkinson : > Brilliant! Thanks. > > > > *From:* Feynman Liang [mailto:fli...@databrick

Re: Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt
The error is JVM has not responded after 10 seconds. On 08-Jul-2015 10:54 PM, "ayan guha" wrote: > What's the error you are getting? > On 9 Jul 2015 00:01, "Ashish Dutt" wrote: > >> Hi, >> >> We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two >> days I have been trying to

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, It seems whether I'm doing a foreachRDD or foreachPartition, I'm able to create per-worker/per-JVM singletons. With 4 workers, I've got 4 singletons created. I wouldn't be able to use broadcast vars because the 3rd party objects are not serializable. The shuffling effect is basically whe

Remote spark-submit not working with YARN

2015-07-08 Thread jegordon
I'm trying to submit a spark job from a different server outside of my Spark Cluster (running spark 1.4.0, hadoop 2.4.0 and YARN) using the spark-submit script : spark/bin/spark-submit --master yarn-client --executor-memory 4G myjobScript.py The think is that my application never pass from the ac

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza
Hi JG, One way this can occur is that YARN doesn't have enough resources to run your job. Have you verified that it does? Are you able to submit using the same command from a node on the cluster? -Sandy On Wed, Jul 8, 2015 at 3:19 PM, jegordon wrote: > I'm trying to submit a spark job from a

  1   2   >