Re: RDDs join problem: incorrect result

2015-07-28 Thread ๏̯͡๏
What is the size of each RDD? Size of your cluster & spark configurations that you tried out. On Tue, Jul 28, 2015 at 9:54 PM, ponkin wrote: > Hi, Alice > > Did you find solution? > I have exactly the same problem. > > > > -- > View this message in context: > http://apache-spark-user-list.100156

Spark Interview Questions

2015-07-28 Thread Mishra, Abhishek
Hello, Please help me with links or some document for Apache Spark interview questions and answers. Also for the tools related to it ,for which questions could be asked. Thanking you all. Sincerely, Abhishek - To unsubscribe,

Re: RDDs join problem: incorrect result

2015-07-28 Thread ponkin
Hi, Alice Did you find solution? I have exactly the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-join-problem-incorrect-result-tp19928p24049.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-28 Thread Jörn Franke
You may check out apache phoenix on top of Hbase for this. However, it does not have ODBC drivers, but JDBC ones. Maybe Hive 1.2 with a new version of TEZ will also serve your purpose. You should run some proof of concept with these technologies using real or generated data. About how much data are

Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-28 Thread Ruslan Dautkhanov
>> We want these use actions respond within 2 to 5 seconds. I think this goal is a stretch for Spark. Some queries may run faster than that on a large dataset, but in general you can't put an SLA like this. For example if you have to join some huge datasets, you'll likely will be much over that. S

Does spark-submit support file transfering from local to cluster?

2015-07-28 Thread Anh Hong
Hi, I'm using spark-submit cluster mode to submit a job from local to spark cluster. There are input files, output files, and job log files that I need to transfer in and out between local machine and spark cluster.Any recommendation methods to use file transferring. Is there any future plan

SparkR does not include SparkContext

2015-07-28 Thread Siegfried Bilstein
Hi, I'm starting R on Spark via the sparkR script but I can't access the sparkcontext as described in the programming guide. Any ideas? Thanks, Siegfried

Authentication Support with spark-submit cluster mode

2015-07-28 Thread Anh Hong
Hi,I'd like to remotely run spark-submit from a local machine to submit a job to spark cluster (cluster mode). What method do I use to authenticate myself to the cluster? Like how to pass user id or password or private key to the cluster Any help is appreciated.

Fwd: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Priya Ch
Hi TD, Thanks for the info. I have the scenario like this. I am reading the data from kafka topic. Let's say kafka has 3 partitions for the topic. In my streaming application, I would configure 3 receivers with 1 thread each such that they would receive 3 dstreams (from 3 partitions of kafka to

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Proust GZ Feng
Thanks Vanzin, spark-submit.cmd works Thanks Proust From: Marcelo Vanzin To: Proust GZ Feng/China/IBM@IBMCN Cc: Sean Owen , user Date: 07/29/2015 10:35 AM Subject:Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0 Can you run the windows batch files (e.g. spark-s

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Marcelo Vanzin
Can you run the windows batch files (e.g. spark-submit.cmd) from the cygwin shell? On Tue, Jul 28, 2015 at 7:26 PM, Proust GZ Feng wrote: > Hi, Owen > > Add back the cygwin classpath detection can pass the issue mentioned > before, but there seems lack of further support in the launch lib, see >

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Proust GZ Feng
Although I'm not sure how valuable Cygwin support it is, at least the release notes need mention that Cygwin is not supported by design from 1.4.0 >From the description of the changeset, looks like remove the supporting is not intended by the author Thanks Proust From: Sachin Naik To:

Job hang when running random forest

2015-07-28 Thread Andy Zhao
Hi guys, A job hanged about 16 hours when I run random forest algorithm, I don't know why that happened. I use spark 1.4.0 on yarn and here is the code and following picture is from spark ui

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Proust GZ Feng
Hi, Owen Add back the cygwin classpath detection can pass the issue mentioned before, but there seems lack of further support in the launch lib, see below stacktrace LAUNCH_CLASSPATH: C:\spark-1.4.0-bin-hadoop2.3\lib\spark-assembly-1.4.0-hadoop2.3.0.jar java -cp C:\spark-1.4.0-bin-hadoop2.3\l

Job hang when running random forest

2015-07-28 Thread Andy Zhao
Hi guys When I run random forest algorithm, a job hanged for 15.8h, I can not figure out why that happened. Here is the code And I use spark 1.4.0 on yarn <

Re: Spark Streaming Json file groupby function

2015-07-28 Thread Tathagata Das
If you are trying to keep such long term state, it will be more robust in the long term to use a dedicated data store (cassandra/HBase/etc.) that is designed for long term storage. On Tue, Jul 28, 2015 at 4:37 PM, swetha wrote: > > > Hi TD, > > We have a requirement to maintain the user sessio

Re: restart from last successful stage

2015-07-28 Thread Tathagata Das
Okay, may I am confused on the word "would be useful to *restart* from the output of stage 0" ... did the OP mean restart by the user or restart automatically by the system? On Tue, Jul 28, 2015 at 3:43 PM, ayan guha wrote: > Hi > > I do not think op asks about attempt failure but stage failure

Spark and Speech Recognition

2015-07-28 Thread Peter Wolf
Hello, I am writing a Spark application to use speech recognition to transcribe a very large number of recordings. I need some help configuring Spark. My app is basically a transformation with no side effects: recording URL --> transcript. The input is a huge file with one URL per line, and the

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Tathagata Das
@Ashwin: You could append the topic in the data. val kafkaStreams = topics.map { topic => KafkaUtils.createDirectStream(topic...).map { x => (x, topic) } } val unionedStream = context.union(kafkaStreams) @Brandon: I dont recommend it, but you could do something crazy like use the foreach

Re: broadcast variable question

2015-07-28 Thread Jonathan Coveney
That's great! Thanks El martes, 28 de julio de 2015, Ted Yu escribió: > If I understand correctly, there would be one value in the executor. > > Cheers > > On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney > wrote: > >> i am running in coarse grained mode, let's say with 8 cores per executor. >

Re: unsubscribe

2015-07-28 Thread Ted Yu
Take a look at the first section here: http://spark.apache.org/community.html On Tue, Jul 28, 2015 at 5:03 PM, Harshvardhan Chauhan wrote: > > > -- > *Harshvardhan Chauhan* | Software Engineer > *GumGum* | *Ads that stick* > 310-260-9666 | ha...@gumgum.com >

Re: Getting the number of slaves

2015-07-28 Thread amkcom
try sc.getConf.getInt("spark.executor.instances", 1) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-number-of-slaves-tp10604p24043.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: unsubscribe

2015-07-28 Thread Brandon White
NO! On Tue, Jul 28, 2015 at 5:03 PM, Harshvardhan Chauhan wrote: > > > -- > *Harshvardhan Chauhan* | Software Engineer > *GumGum* | *Ads that stick* > 310-260-9666 | ha...@gumgum.com >

unsubscribe

2015-07-28 Thread Harshvardhan Chauhan
-- *Harshvardhan Chauhan* | Software Engineer *GumGum* | *Ads that stick* 310-260-9666 | ha...@gumgum.com

Re: broadcast variable question

2015-07-28 Thread Ted Yu
If I understand correctly, there would be one value in the executor. Cheers On Tue, Jul 28, 2015 at 4:23 PM, Jonathan Coveney wrote: > i am running in coarse grained mode, let's say with 8 cores per executor. > > If I use a broadcast variable, will all of the tasks in that executor > share the

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Carol McDonald
I agree, I found this book very useful for getting started with spark and eclipse On Tue, Jul 28, 2015 at 11:10 AM, Petar Zecevic wrote: > > Sorry about self-promotion, but there's a really nice tutorial for setting > up Eclipse for Spark in "Spark in Action" book: > http://www.manning.com/bona

Re: Spark Streaming Json file groupby function

2015-07-28 Thread swetha
Hi TD, We have a requirement to maintain the user session state and to maintain/update the metrics for minute, day and hour granularities for a user session in our Streaming job. Can I keep those granularities in the state and recalculate each time there is a change? How would the performance

broadcast variable question

2015-07-28 Thread Jonathan Coveney
i am running in coarse grained mode, let's say with 8 cores per executor. If I use a broadcast variable, will all of the tasks in that executor share the same value? Or will each task broadcast its own value ie in this case, would there be one value in the executor shared by the 8 tasks, or would

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Brandon White
Thank you Tathagata. My main use case for the 500 streams is to append new elements into their corresponding Spark SQL tables. Every stream is mapped to a table so I'd like to use the streams to appended the new rdds to the table. If I union all the streams, appending new elements becomes a nightma

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Ashwin Giridharan
@Das, Is there anyway to identify a kafka topic when we have unified stream? As of now, for each topic I create dedicated DStream and use foreachRDD on each of these Streams. If I have say 100 kafka topics, then how can I use unified stream and still take topic specific actions inside foreachRDD ?

Re: restart from last successful stage

2015-07-28 Thread ayan guha
Hi I do not think op asks about attempt failure but stage failure and finally leading to job failure. In that case, rdd info from last run is gone even if from cache, isn't it? Ayan On 29 Jul 2015 07:01, "Tathagata Das" wrote: > If you are using the same RDDs in the both the attempts to run the

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Tathagata Das
I dont think any one has really run 500 text streams. And parSequences do nothing out there, you are only parallelizing the setup code which does not really compute anything. Also it setsup 500 foreachRDD operations that will get executed in each batch sequentially, so does not make sense. The writ

Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Brandon White
val ssc = new StreamingContext(sc, Minutes(10)) //500 textFile streams watching S3 directories val streams = streamPaths.par.map { path => ssc.textFileStream(path) } streams.par.foreach { stream => stream.foreachRDD { rdd => //do something } } ssc.start() Would something like this sca

Re: Do I really need to build Spark for Hive/Thrift Server support?

2015-07-28 Thread ReeceRobinson
I am building an analytics environment based on Spark and want to use HIVE in multi-user mode i.e. not use the embedded derby database but use Postgres and HDFS instead. I am using the included Spark Thrift Server to process queries using Spark SQL. The documentation gives me the impression that I

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Marcelo Vanzin
First, it's kinda confusing to change subjects in the middle of a thread... On Tue, Jul 28, 2015 at 1:44 PM, Elkhan Dadashov wrote: > @Marcelo > *Question1*: > Do you know why launching Spark job through SparkLauncher in Java, stdout > logs (i.e., INFO Yarn.Client) are written into error stream

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sachin Naik
I agree with Sean - using virtual box on windows and using linux vm is a lot easier than trying to circumvent the cygwin oddities. a lot of functionality might not work in cygwin and you will end up trying to do back patches. Unless there is a compelling reason - cygwin support seems not require

Re: restart from last successful stage

2015-07-28 Thread Tathagata Das
If you are using the same RDDs in the both the attempts to run the job, the previous stage outputs generated in the previous job will indeed be reused. This applies to core though. For dataframes, depending on what you do, the physical plan may get generated again leading to new RDDs which may caus

Re: [Spark ML] HasInputCol, etc.

2015-07-28 Thread Feynman Liang
Unfortunately, AFAIK custom transformers are not part of the public API so you will have to continue with what you're doing. On Tue, Jul 28, 2015 at 1:32 PM, Matt Narrell wrote: > Hey, > > Our ML ETL pipeline has several complex steps that I’d like to address > with custom Transformers in an ML

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
I run Spark in yarn-cluster mode. and yes , log aggregation enabled. In Yarn aggregated logs i can the job status correctly. The issue is Yarn Client logs (which is written to stdout in terminal) states that job has succeeded even though the job has failed. As user is not testing if Yarn RM succe

[Spark ML] HasInputCol, etc.

2015-07-28 Thread Matt Narrell
Hey, Our ML ETL pipeline has several complex steps that I’d like to address with custom Transformers in an ML Pipeline. Looking at the Tokenizer and HashingTF transformers I see these handy traits (HasInputCol, HasLabelCol, HasOutputCol, etc.) but they have strict access modifiers. How can I

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sean Owen
That's for the Windows interpreter rather than bash-running Cygwin. I don't know it's worth doing a lot of legwork for Cygwin, but, if it's really just a few lines of classpath translation in one script, seems reasonable. On Tue, Jul 28, 2015 at 9:13 PM, Steve Loughran wrote: > > there's a spark-

Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-28 Thread Steve Loughran
try looking at the causes and steps here https://wiki.apache.org/hadoop/BindException On 28 Jul 2015, at 09:22, Wayne Song mailto:wayne.e.s...@gmail.com>> wrote: I made this message with the Nabble web interface; I included the stack trace there, but I guess it didn't show up in the emails.

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Corey Nolet
On Tue, Jul 28, 2015 at 2:17 PM, Elkhan Dadashov wrote: > Thanks Corey for your answer, > > Do you mean that "final status : SUCCEEDED" in terminal logs means that > YARN RM could clean the resources after the application has finished > (application finishing does not necessarily mean succeeded o

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Steve Loughran
there's a spark-submit.cmd file for windows. Does that work? On 27 Jul 2015, at 21:19, Proust GZ Feng mailto:pf...@cn.ibm.com>> wrote: Hi, Spark Users Looks like Spark 1.4.0 cannot work with Cygwin due to the removing of Cygwin support in bin/spark-class The changeset is https://github.com/

Re: Actor not found for: ActorSelection

2015-07-28 Thread Haseeb
The problem was that I was trying to start the example app in standalone cluster mode by passing in *-Dspark.master=spark://myhost:7077* as an argument to the JVM. I launched the example app locally using -*Dspark.master=local* and it worked. -- View this message in context: http://apache-spark

Re: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread shenyan zhen
Saif, I am guessing but not sure your use case. Are you retrieving the entire table into Spark? If yes, do you have primary key on your table? If also yes, then JdbcRDD should be efficient. DataFrameReader.jdbc gives you more options, again, depends on your use case. Possible for you to describe

restart from last successful stage

2015-07-28 Thread Alex Nastetsky
Is it possible to restart the job from the last successful stage instead of from the beginning? For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long time and is successful, but the job fails on stage 1, it would be useful to be able to restart from the output of stage 0 inste

Re: DataFrame DAG recomputed even though DataFrame is cached?

2015-07-28 Thread Michael Armbrust
We will try to address this before Spark 1.5 is released: https://issues.apache.org/jira/browse/SPARK-9141 On Tue, Jul 28, 2015 at 11:50 AM, Kristina Rogale Plazonic wrote: > Hi, > > I'm puzzling over the following problem: when I cache a small sample of a > big dataframe, the small dataframe is

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
Thanks a lot for feedback, Marcelo. I've filed a bug just now - SPARK-9416 On Tue, Jul 28, 2015 at 12:14 PM, Marcelo Vanzin wrote: > BTW this is most probably caused by this line in PythonRunner.scala: > > System.exit(process.waitFor()) >

RE: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi
Thank you for your response Zhen, I am using some vendor specific JDBC driver JAR file (honestly I dont know where it came from). It’s api is NOT like JdbcRDD, instead, more like jdbc from DataFrameReader https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameRea

Re: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread shenyan zhen
Hi Saif, Are you using JdbcRDD directly from Spark? If yes, then the poor distribution could be due to the bound key you used. See the JdbcRDD Scala doc at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD : sql the text of the query. The query must contain t

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Marcelo Vanzin
BTW this is most probably caused by this line in PythonRunner.scala: System.exit(process.waitFor()) The YARN backend doesn't like applications calling System.exit(). On Tue, Jul 28, 2015 at 12:00 PM, Marcelo Vanzin wrote: > This might be an issue with how pyspark propagates the error back

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
But then how can we get to know if job is making progress in programmatic way (Java) ? Or if job has failed or succeeded ? Is looking to application log files the only way knowing about job final status (failed/succeeded) ? Because when job fails Job History server does not have much info about

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Marcelo Vanzin
This might be an issue with how pyspark propagates the error back to the AM. I'm pretty sure this does not happen for Scala / Java apps. Have you filed a bug? On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov wrote: > Thanks Corey for your answer, > > Do you mean that "final status : SUCCEEDED"

DataFrame DAG recomputed even though DataFrame is cached?

2015-07-28 Thread Kristina Rogale Plazonic
Hi, I'm puzzling over the following problem: when I cache a small sample of a big dataframe, the small dataframe is recomputed when selecting a column (but not if show() or count() is invoked). Why is that so and how can I avoid recomputation of the small sample dataframe? More details: - I hav

Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi
Hi all, I am experimenting and learning performance on big tasks locally, with a 32 cores node and more than 64GB of Ram, data is loaded from a database through JDBC driver, and launching heavy computations against it. I am presented with two questions: 1. My RDD is poorly distributed. I

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
Thanks Corey for your answer, Do you mean that "final status : SUCCEEDED" in terminal logs means that YARN RM could clean the resources after the application has finished (application finishing does not necessarily mean succeeded or failed) ? With that logic it totally makes sense. Basically the

Actor not found for: ActorSelection

2015-07-28 Thread Haseeb
I just cloned the master repository of Spark from Github. I am running it on OSX 10.9, Spark 1.4.1 and Scala 2.10.4 I just tried to run the SparkPi example program using IntelliJ Idea but get the error : akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkMaster@my

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jörn Franke
Can you put some transparent cache in front of the database? Or some jdbc proxy? Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele a écrit : > can the source write to Kafka/Flume/Hbase in addition to Postgres? no > it can't write ,this is due to the fact that there are many applications > those a

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
can the source write to Kafka/Flume/Hbase in addition to Postgres? no it can't write ,this is due to the fact that there are many applications those are producing this postGreSql data.I can't really asked all the teams to start writing to some other source. velocity of the application is too high

Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Burak Yavuz
Hey Stephen, In case these libraries exist on the client as a form of maven library, you can use --packages to ship the library and all it's dependencies, without building an uber jar. Best, Burak On Tue, Jul 28, 2015 at 10:23 AM, Marcelo Vanzin wrote: > Hi Stephen, > > There is no such direct

Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Marcelo Vanzin
Hi Stephen, There is no such directory currently. If you want to add an existing jar to every app's classpath, you need to modify two config values: spark.driver.extraClassPath and spark.executor.extraClassPath. On Mon, Jul 27, 2015 at 10:22 PM, Stephen Boesch wrote: > when using spark-submit:

Re: hive.contrib.serde2.RegexSerDe not found

2015-07-28 Thread Gianluca Privitera
Try use: org.apache.hadoop.hive.serde2.RegexSerDe GP On 27 Jul 2015, at 09:35, ZhuGe mailto:t...@outlook.com>> wrote: Hi all: I am testing the performance of hive on spark sql. The existing table is created with ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPE

Re: Data from PostgreSQL to Spark

2015-07-28 Thread santoshv98
Sqoop’s incremental data fetch will reduce the data size you need to pull from source, but then by the time that incremental data fetch is complete, is it not current again, if velocity of the data is high? May be you can put a trigger in Postgres to send data to the big data cluster as soon

Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-28 Thread Wayne Song
I made this message with the Nabble web interface; I included the stack trace there, but I guess it didn't show up in the emails. Anyways, here's the stack trace: 15/07/27 17:04:09 ERROR NettyTransport: failed to bind to /54.xx.xx.xx:7093, shutting down Netty transport Exception in thread "main"

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Cool, will revisit, is your latest code visible publicly somewhere ? On 28 July 2015 at 17:14, Ted Malaska wrote: > Yup you should be able to do that with the APIs that are going into HBase. > > Let me know if you need to chat about the problem and how to implement it > with the HBase apis. > >

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Oops, yes, I'm still messing with the repo on a daily basis.. fixed On 28 July 2015 at 17:11, Ted Yu wrote: > I got a compilation error: > > [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling > [INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes > at 143809956

Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Yu
I got a compilation error: [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling [INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes at 1438099569598 [ERROR] /home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36: err

Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into a

spark-csv number of partitions

2015-07-28 Thread Srikanth
Hello, I'm using spark-csv instead of sc.textfile() to work with CSV files. How can I set no.of partitions that will be created when reading a CSV? Basically an equivalent for minPartitions in textFile() val myrdd = sc.textFile("my.csv",24) Srikanth

Re: Is spark suitable for real time query

2015-07-28 Thread Petar Zecevic
You can try out a few tricks employed by folks at Lynx Analytics... Daniel Darabos gave some details at Spark Summit: https://www.youtube.com/watch?v=zt1LdVj76LU&index=13&list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs On 22.7.2015. 17:00, Louis Hust wrote: My code like below: Map t11opt

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Petar Zecevic
Sorry about self-promotion, but there's a really nice tutorial for setting up Eclipse for Spark in "Spark in Action" book: http://www.manning.com/bonaci/ On 24.7.2015. 7:26, Siva Reddy wrote: Hi All, I am trying to setup the Eclipse (LUNA) with Maven so that I create a maven projects

sc.parallelise to work more like a producer/consumer?

2015-07-28 Thread Kostas Kougios
Hi, I am using sc.parallelise(...32k of items) several times for 1 job. Each executor takes x amount of time to process it's items but this results in some executors finishing quickly and staying idle till the others catch up. Only after all executors complete the first 32k batch, the next batch is

PySpark MLlib Numpy Dependency

2015-07-28 Thread Eskilson,Aleksander
The documentation for the Numpy dependency for MLlib seems somewhat vague [1]. Is Numpy only a dependency for the driver node, or must it also be installed on every worker node? Thanks, Alek [1] -- http://spark.apache.org/docs/latest/mllib-guide.html#dependencies CONFIDENTIALITY NOTICE This me

projection optimization?

2015-07-28 Thread Eric Friedman
If I have a Hive table with six columns and create a DataFrame (Spark 1.4.1) using a sqlContext.sql("select * from ...") query, the resulting physical plan shown by explain reflects the goal of returning all six columns. If I then call select("one_column") on that first DataFrame, the resulting Da

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-28 Thread Cody Koeninger
That stacktrace looks like an out of heap space on the driver while writing checkpoint, not on the worker nodes. How much memory are you giving the driver? How big are your stored checkpoints? On Tue, Jul 28, 2015 at 9:30 AM, Nicolas Phung wrote: > Hi, > > After using KafkaUtils.createDirectSt

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-28 Thread Nicolas Phung
Hi, After using KafkaUtils.createDirectStream[Object, Object, KafkaAvroDecoder, KafkaAvroDecoder, Option[AnalyticEventEnriched]](ssc, kafkaParams, map, messageHandler), I'm encountering the following issue: 15/07/28 00:29:57 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriv

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Petar Zecevic
Sorry about self-promotion, but there's a really nice tutorial for setting up Eclipse for Spark in "Spark in Action" book: http://www.manning.com/bonaci/ On 27.7.2015. 10:22, Akhil Das wrote: You can follow this doc https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#Use

Re: Is spark suitable for real time query

2015-07-28 Thread Petar Zecevic
You can try out a few tricks employed by folks at Lynx Analytics... Daniel Darabos gave some details at Spark Summit: https://www.youtube.com/watch?v=zt1LdVj76LU&index=13&list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs On 22.7.2015. 17:00, Louis Hust wrote: My code like below: Map t11opt

Checkpoint issue in spark streaming

2015-07-28 Thread Sadaf
Hi all. I am writing a twitter connector using spark streaming. i have written the following code to maintain checkpoint. val ssc=StreamingContext.getOrCreate("hdfs://192.168.23.109:9000/home/cloud9/twitterCheckpoint",()=> { managingContext() }) def managingContext():StreamingContext = {

Re: Multiple operations on same DStream in Spark Streaming

2015-07-28 Thread Dean Wampler
Is this average supposed to be across all partitions? If so, it will require some one of the reduce operations in every batch interval. If that's too slow for the data rate, I would investigate using PairDStreamFunctions.updateStateByKey to compute the sum + count of the 2nd integers, per 1st integ

Re: Clustetr setup for SPARK standalone application:

2015-07-28 Thread Dean Wampler
When you say you installed Spark, did you install the master and slave services for standalone mode as described here ? If you intended to run Spark on Hadoop, see here . It looks l

Re: log file directory

2015-07-28 Thread Ted Yu
Path to log file should be displayed when you launch the master. e.g. /mnt/var/log/apps/spark -hadoop-org.apache.spark.deploy.master.Master-MACHINENAME.out On Mon, Jul 27, 2015 at 11:28 PM, Jack Yang wrote: > Hi all, > > > > I have questions with regarding to the log file directory. > > > > Tha

Re: Checkpoints in SparkStreaming

2015-07-28 Thread Cody Koeninger
Yes, you need to follow the documentation. Configure your stream, including the transformations made to it, inside the getOrCreate function. On Tue, Jul 28, 2015 at 3:14 AM, Guillermo Ortiz wrote: > I'm using SparkStreaming and I want to configure checkpoint to manage > fault-tolerance. > I've

Re: spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Cody Koeninger
You don't have to use some other package in order to get access to the offsets. Shushant, have you read the available documentation at http://spark.apache.org/docs/latest/streaming-kafka-integration.html https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md or watched https:/

Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Ted Yu
Can you show us the snippet of the exception stack ? Thanks > On Jul 27, 2015, at 10:22 PM, Stephen Boesch wrote: > > when using spark-submit: which directory contains third party libraries that > will be loaded on each of the slaves? I would like to scp one or more > libraries to each of t

Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-28 Thread Ted Yu
Can you show the full stack trace ? Which Spark release are you using ? Thanks > On Jul 27, 2015, at 10:07 AM, Wayne Song wrote: > > Hello, > > I am trying to start a Spark master for a standalone cluster on an EC2 node. > The CLI command I'm using looks like this: > > > > Note that I'm

Re: spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Dibyendu Bhattacharya
If you want the offset of individual kafka messages , you can use this consumer form Spark Packages .. http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Tue, Jul 28, 2015 at 6:18 PM, Shushant Arora wrote: > Hi > > I am processing kafka messages using spark str

spark streaming get kafka individual message's offset and partition no

2015-07-28 Thread Shushant Arora
Hi I am processing kafka messages using spark streaming 1.3. I am using mapPartitions function to process kafka message. How can I access offset no of individual message getting being processed. JavaPairInputDStream directKafkaStream =KafkaUtils.createDirectStream(..); directKafkaStream.mapPa

Re: *Metrics API is odd in MLLib

2015-07-28 Thread Sam
Hi Xiangrui & Spark People, I recently got round to writing an evaluation framework for Spark that I was hoping to PR into MLLib and this would solve some of the aforementioned issues. I have put the code on github in a separate repo for now as I would like to get some sandboxed feedback. The re

Iterating over values by Key

2015-07-28 Thread gulyasm
I have K/V pairs where V is an Iterable (from previous groupBy). I use the JAVA API. What I want is to iterate over the values by key, and on every element set previousElementId attribute, that is the id of the previous element in the sorted list. I try to do this with mapValues. I create an arra

Messages are not stored for actorStream when using RoundRobinRouter

2015-07-28 Thread Juan Rodríguez Hortalá
Hi, I'm using a simple akka actor to create a actorStream. The actor just forwards the messages received to the stream by calling super[ActorHelper].store(msg). This works ok when I create the stream with ssc.actorStream[A](Props(new ProxyReceiverActor[A]), receiverActorName) but when I try to u

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sean Owen
Does adding back the cygwin detection and this clause make it work? if $cygwin; then CLASSPATH="`cygpath -wp "$CLASSPATH"`" fi If so I imagine that's fine to bring back, if that's still needed. On Tue, Jul 28, 2015 at 9:49 AM, Proust GZ Feng wrote: > Thanks Owen, the problem under Cygwin is w

Spark SQL ArrayOutofBoundsException Question

2015-07-28 Thread tranan
Hello all, I am currently having an error with Spark SQL access Elasticsearch using Elasticsearch Spark integration. Below is the series of command I issued along with the stacktrace. I am unclear what the error could mean. I can print the schema correctly but error out if i try and display a f

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
I trying do that, but there will always data mismatch, since by the time scoop is fetching main database will get so many updates. There is something called incremental data fetch using scoop but that hits a database rather than reading the WAL edit. On 28 July 2015 at 02:52, wrote: > Why can

Re: Data from PostgreSQL to Spark

2015-07-28 Thread Jeetendra Gangele
Hi Ayan Thanks for reply. Its around 5 GB having 10 tables...this data changes very frequently every minutes few updates its difficult to have this data in spark, if any updates happen on main tables, how can I refresh spark data? On 28 July 2015 at 02:11, ayan guha wrote: > You can call dB

RE: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Manohar Reddy
Yaa got it Thanks Akhil. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, July 28, 2015 2:47 PM To: Manohar Reddy Cc: user@spark.apache.org Subject: Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client That happens when you batch duration is less than your processing

Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Akhil Das
That happens when you batch duration is less than your processing time, you need to set StorageLevel to MEMORY_AND_DISK, if you are using the latest version of spark and you are just exploring things, then you can go with the kafka consumers that comes with Spark itself. You will not have this issu

RE: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Manohar Reddy
Thanks Akhil.that solved now but below is the new stack trace. Don’t feel bad, am look into that but if it is there in your fingers please 15/07/28 09:03:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 77, ip-10-252-7-70.us-west-2.compute.internal): java.lang.Exception: Could n

  1   2   >