Re: Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-26 Thread Akhil Das
Try to add them in the SPARK_CLASSPATH in your conf/spark-env.sh file Thanks Best Regards On Thu, Jun 25, 2015 at 9:31 PM, Bin Wang wrote: > I am trying to run the Spark example code HBaseTest from command line > using spark-submit instead run-example, in that case, I can learn more how > to ru

Re: Performing sc.paralleize (..) in workers not in the driver program

2015-06-26 Thread Akhil Das
Why do you want to do that? Thanks Best Regards On Thu, Jun 25, 2015 at 10:16 PM, shahab wrote: > Hi, > > Apparently, sc.paralleize (..) operation is performed in the driver > program not in the workers ! Is it possible to do this in worker process > for the sake of scalability? > > best > /Sh

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Akhil Das
You just need to set your HADOOP_HOME which appears to be null in the stackstrace. If you are not having the winutils.exe, then you can download and put it there. Thanks Best Regards On Thu, Jun 25, 2015 at 11:30 PM, Ashic M

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Sean Owen
Yes, Spark Core depends on Hadoop libs, and there is this unfortunate twist on Windows. You'll still need HADOOP_HOME set appropriately since Hadoop needs some special binaries to work on Windows. On Fri, Jun 26, 2015 at 11:06 AM, Akhil Das wrote: > You just need to set your HADOOP_HOME which app

Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with almost all those databases out there (You just need to pass the Input/Output Format classes or there are a bunch of connectors also available). Thanks Best Regards On Fri, Jun 26, 2015 at 12:07 PM, louis.hust wrote: > Hi,

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
Its a scala version conflict, can you paste your build.sbt file? Thanks Best Regards On Fri, Jun 26, 2015 at 7:05 AM, stati wrote: > Hello, > > When I run a spark job with spark-submit it fails with below exception for > code line >/*val webLogDF = webLogRec.toDF().select("ip", "date",

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread ayan guha
It's a problem since 1.3 I think On 26 Jun 2015 04:00, "Ashic Mahtab" wrote: > Hello, > Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've > noticed the following: > > * On 1.4, sc.textFile("D:\\folder\\").collect() fails from both > spark-shell.cmd and when running a scala

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
Thanks for the replies, guys. Is this a permanent change as of 1.3, or will it go away at some point? Also, does it require an entire Hadoop installation, or just WinUtils.exe? Thanks,Ashic. Date: Fri, 26 Jun 2015 18:22:03 +1000 Subject: Re: Recent spark sc.textFile needs hadoop for folders?!? Fr

Re: Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-26 Thread ayan guha
Try removing .jar part from driver classpath On 26 Jun 2015 17:39, "Akhil Das" wrote: > Try to add them in the SPARK_CLASSPATH in your conf/spark-env.sh file > > Thanks > Best Regards > > On Thu, Jun 25, 2015 at 9:31 PM, Bin Wang wrote: > >> I am trying to run the Spark example code HBaseTest fr

Problem after enabling Hadoop native libraries

2015-06-26 Thread Arunabha Ghosh
Hi, I'm having trouble reading Bzip2 compressed sequence files after I enabled hadoop native libraries in spark. Running LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/ $SPARK_HOME/bin/spark-submit --class gives the following error 5/06/26 00:48:02 INFO CodecPool: Got brand-new decompressor [.

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Roman Sokolov
Ok, but what does it means? I did not change the core files of spark, so is it a bug there? PS: on small datasets (<500 Mb) I have no problem. Am 25.06.2015 18:02 schrieb "Ted Yu" : > The assertion failure from TriangleCount.scala corresponds with the > following lines: > > g.outerJoinVertices

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Steve Loughran
On 26 Jun 2015, at 09:29, Ashic Mahtab mailto:as...@live.com>> wrote: Thanks for the replies, guys. Is this a permanent change as of 1.3, or will it go away at some point? Don't blame the spark team, complain to the hadoop team for being slow to embrace the java 1.7 APIs for low-level filesys

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Robin East
You’ll get this issue if you just take the first 2000 lines of that file. The problem is triangleCount() expects srdId < dstId which is not the case in the file (e.g. vertex 28). You can get round this by calling graph.convertToCanonical Edges() which removes bi-directional edges and ensures sr

[Spark 1.3.1] Spark HiveQL -> CDH 5.3 Hive 0.13 UDF's

2015-06-26 Thread Mike Frampton
Hi I have a five node CDH 5.3 cluster running on CentOS 6.5, I also have a separate install of Spark 1.3.1. ( The CDH 5.3 install has Spark 1.2 but I wanted a newer version. ) I managed to write some Scala based code using a Hive Context to connect to Hive and create/populate tables etc. I

Re: Time is ugly in Spark Streaming....

2015-06-26 Thread Gerard Maas
Are you sharing the SimpleDateFormat instance? This looks a lot more like the non-thread-safe behaviour of SimpleDateFormat (that has claimed many unsuspecting victims over the years), than any 'ugly' Spark Streaming. Try writing the timestamps in millis to Kafka and compare. -kr, Gerard. On Fri,

The usage of OpenBLAS

2015-06-26 Thread Tsai Li Ming
Hi, I found out that the instructions for OpenBLAS has been changed by the author of netlib-java in: https://github.com/apache/spark/pull/4448 since Spark 1.3.0 In that PR, I asked whether there’s still a need to compile OpenBLAS with USE_THREAD=0, and also about Intel MKL. Is it still applica

RE: Performing sc.paralleize (..) in workers not in the driver program

2015-06-26 Thread prajod.vettiyattil
This is how it works, I think: sc.parallelize(..) takes the variable inside the (..) and returns a “distributable equivalent” of that variable. That is, an RDD is returned. This RDD can be worked on by multiple workers threads in _parallel_. The parallelize(..) has to be done on the driver runn

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Srikanth
Thanks Akhil for checking this out. Here is my build.sbt. name := "Weblog Analysis" version := "1.0" scalaVersion := "2.11.5" javacOptions ++= Seq("-source", "1.7", "-target", "1.7") libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.4.0" % "provided", "org.apache.spark"

spark streaming with kafka reset offset

2015-06-26 Thread Shushant Arora
I am using spark streaming 1.2. If processing executors get crashed will receiver rest the offset back to last processed offset? If receiver itself got crashed is there a way to reset the offset without restarting streaming application other than smallest or largest. Is spark streaming 1.3 whi

Time is ugly in Spark Streaming....

2015-06-26 Thread Sea
Hi, all I find a problem in spark streaming, when I use the time in function foreachRDD... I find the time is very interesting. val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) dataStream.map(x => createGroup(x._2, dimensio

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Ted Yu
See SPARK-4917 which went into Spark 1.3.0 On Fri, Jun 26, 2015 at 2:27 AM, Robin East wrote: > You’ll get this issue if you just take the first 2000 lines of that file. > The problem is triangleCount() expects srdId < dstId which is not the case > in the file (e.g. vertex 28). You can get round

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu
Hi, Jie, Thank you very much for this work! Very helpful! I just would like to confirm that I understand the numbers correctly: if we take the running time of 1.2 release as 100s 9.1% - means the running time is 109.1 s? -4% - means it comes 96s? If that’s the true meaning of the numbers, w

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
Those provided spark libraries are compatible with scala 2.11? Thanks Best Regards On Fri, Jun 26, 2015 at 4:48 PM, Srikanth wrote: > Thanks Akhil for checking this out. Here is my build.sbt. > > name := "Weblog Analysis" > > version := "1.0" > > scalaVersion := "2.11.5" > > javacOptions ++= Se

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Roman Sokolov
Yep, I already found it. So I added 1 line: val graph = GraphLoader.edgeListFile(sc, "", ...) val newgraph = graph.convertToCanonicalEdges() and could successfully count triangles on "newgraph". Next will test it on bigger (several Gb) networks. I am using Spark 1.3 and 1.4 but haven't seen

RE: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Huang, Jie
Correct. Your calculation is right! We have been aware of that kmeans performance drop also. According to our observation, it is caused by some unbalanced executions among different tasks. Even we used the same test data between different versions (i.e., not caused by the data skew). And the c

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu
Thank you, Jie! Very nice work! -- Nan Zhu http://codingcat.me On Friday, June 26, 2015 at 8:17 AM, Huang, Jie wrote: > Correct. Your calculation is right! > > We have been aware of that kmeans performance drop also. According to our > observation, it is caused by some unbalanced execut

Spark driver hangs on start of job

2015-06-26 Thread Sjoerd Mulder
Hi, I have a really annoying issue that i cannot replicate consistently, still it happens every +- 100 submissions. (it's a job that's running every 3 minutes). Already reported an issue for this: https://issues.apache.org/jira/browse/SPARK-8592 Here are the Thread dump of the Driver and the Exec

RE: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Huang, Jie
Thanks. In general, we can see a stable trend in Spark master branch and latest release. And we are also considering to add more benchmarks/workloads into this automation perf tool. Any comment and feedback is warmly welcomed. Thank you && Best Regards, Grace (Huang Jie) From: Nan Zhu [mailto:

Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Ashish Soni
Hi , If i have a below data format , how can i use kafka direct stream to de-serialize as i am not able to understand all the parameter i need to pass , Can some one explain what will be the arguments as i am not clear about this JavaPairInputDStream , V > org .apache .spark .streaming .kafk

Dependency Injection with Spark Java

2015-06-26 Thread Michal Čizmazia
How to use Dependency Injection with Spark Java? Please could you point me to any articles/frameworks? Thanks!

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Akhil Das
​JavaPairInputDStream messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet ); Here: jssc => JavaStreamingContext String.class => Key , Value classes

?????? Time is ugly in Spark Streaming....

2015-06-26 Thread Sea
Yes, I make it. -- -- ??: "Gerard Maas";; : 2015??6??26??(??) 5:40 ??: "Sea"<261810...@qq.com>; : "user"; "dev"; : Re: Time is ugly in Spark Streaming Are you sharing the SimpleDateFormat instance? This looks a lo

spark streaming - checkpoint

2015-06-26 Thread ram kumar
Hi, - JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1)); ssc.checkpoint(checkPointDir); JavaStreamingContextFactory factory = new JavaStreamingContextFactory() { public JavaStreamingContext create() {

Time series data

2015-06-26 Thread Caio Cesar Trucolo
Hi everyone! I am working with multiple time series data and in summary I have to adjust each time series (like inserting average values in data gaps) and then training regression models with mllib for each time series. The adjustment step I did with the adjustement function being mapped for each

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Roberto Coluccio
I got a similar issue. Might your as well be related to this https://issues.apache.org/jira/browse/SPARK-8368 ? On Fri, Jun 26, 2015 at 2:00 PM, Akhil Das wrote: > Those provided spark libraries are compatible with scala 2.11? > > Thanks > Best Regards > > On Fri, Jun 26, 2015 at 4:48 PM, Srikan

Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread RedOakMark
Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html

Re: sparkR could not find function "textFile"

2015-06-26 Thread Eskilson,Aleksander
Yeah, I ask because you might notice that by default the column types for CSV tables read in by read.df() are only strings (due to limitations in type inferencing in the DataBricks package). There was a separate discussion about schema inferencing, and Shivaram recently merged support for specif

How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Problem: how do we recover from user errors (connectivity issues / storage service down / etc.)? Environment: Spark streaming using Kafka Direct Streams Code Snippet: HashSet topicsSet = new HashSet(Arrays.asList("kafkaTopic1")); HashMap kafkaParams = new HashMap(); kafkaParams.put("metadata.brok

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
1. These are my settings: rank = 100 iterations = 12 users = ~20M items = ~2M training examples = ~500M-1B (I'm running into the issue even with 500M training examples) 2. The memory storage never seems to go too high. The user blocks may go up to ~10Gb, and each executor will have a few GB used o

Re:

2015-06-26 Thread ๏̯͡๏
All these throw compilation error at newAPIHadoopFile 1) val hadoopConfiguration = new Configuration() hadoopConfiguration.set("mapreduce.input.fileinputformat.split.maxsize", "67108864") sc.newAPIHadoopFile[AvroKey, NullWritable, AvroKeyInputFormat](path + "/*.avro", classOf[AvroKey],

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
Forgot to mention: rank of 100 usually works ok, 120 consistently cannot finish. On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody wrote: > 1. These are my settings: > rank = 100 > iterations = 12 > users = ~20M > items = ~2M > training examples = ~500M-1B (I'm running into the issue even with 500M >

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Ashish Soni
my question is why there are similar two parameter String.Class and StringDecoder.class what is the difference each of them ? Ashish On Fri, Jun 26, 2015 at 8:53 AM, Akhil Das wrote: > ​JavaPairInputDStream messages = > KafkaUtils.createDirectStream( > jssc, > String.class, >

Re: Master dies after program finishes normally

2015-06-26 Thread Yifan LI
Hi, I just encountered the same problem, when I run a PageRank program which has lots of stages(iterations)… The master was lost after my program done. And, the issue still remains even I increased driver memory. Have any idea? e.g. how to increase the master memory? Thanks. Best, Yifan LI

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Benjamin Fradet
There is one for the key of your Kafka message and one for its value. On 26 Jun 2015 4:21 pm, "Ashish Soni" wrote: > my question is why there are similar two parameter String.Class and > StringDecoder.class what is the difference each of them ? > > Ashish > > On Fri, Jun 26, 2015 at 8:53 AM, Akhi

Re: hadoop input/output format advanced control

2015-06-26 Thread ๏̯͡๏
I am trying the very same thing to configure min split size with Spark 1.3.1 and i get compilation error Code: val hadoopConfiguration = new Configuration(sc.hadoopConfiguration) hadoopConfiguration.set("mapreduce.input.fileinputformat.split.maxsize", "67108864") sc.newAPIHadoopFile

Re:

2015-06-26 Thread Silvio Fiorito
Make sure you’re importing the right namespace for Hadoop v2.0. This is what I tried: import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapreduce.lib.input.{FileInputFormat, TextInputFormat} val hadoopConf = new org.apache.hadoop.conf.Configuration() hadoopConf.setLong(Fi

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
We've seen this issue as well in production. We also aren't sure what causes it, but have just recently shaded some of the Spark code in TaskSchedulerImpl that we use to effectively bubble up an exception from Spark instead of zombie in this situation. If you are interested I can go into more detai

Re:

2015-06-26 Thread ๏̯͡๏
My imports: import org.apache.avro.generic.GenericData import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.AvroKey import org.apache.avro.Schema import org.apache.hadoop.io.NullWritable import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.conf.C

Re:

2015-06-26 Thread ๏̯͡๏
Is that its not supported with Avro. Unlikely. On Fri, Jun 26, 2015 at 8:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > My imports: > > import org.apache.avro.generic.GenericData > > import org.apache.avro.generic.GenericRecord > > import org.apache.avro.mapred.AvroKey > > import org.apache.avro.Schema > > impor

Re:

2015-06-26 Thread ๏̯͡๏
Same code of yours works for me as well On Fri, Jun 26, 2015 at 8:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Is that its not supported with Avro. Unlikely. > > On Fri, Jun 26, 2015 at 8:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wrote: > >> My imports: >> >> import org.apache.avro.generic.GenericData >> >> import org.apache.av

Re:

2015-06-26 Thread ๏̯͡๏
org.apache.avro avro 1.7.7 provided com.databricks spark-avro_2.10 1.0.0 org.apache.avro avro-mapred 1.7.7 hadoop2 provided On Fri, Jun 26, 2015 at 8:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Same code of yours works for me as well > > On Fri, Jun 26, 2015 at 8:02 AM,

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
I use the mllib not the ML. Does that make a difference ? Sent from my iPhone > On Jun 26, 2015, at 7:19 AM, Ravi Mody wrote: > > Forgot to mention: rank of 100 usually works ok, 120 consistently cannot > finish. > >> On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody wrote: >> 1. These are my set

Re: spark streaming with kafka reset offset

2015-06-26 Thread Cody Koeninger
The receiver-based kafka createStream in spark 1.2 uses zookeeper to store offsets. If you want finer-grained control over offsets, you can update the values in zookeeper yourself before starting the job. createDirectStream in spark 1.3 is still marked as experimental, and subject to change. Tha

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
If you're consistently throwing exceptions and thus failing tasks, once you reach max failures the whole stream will stop. It's up to you to either catch those exceptions, or restart your stream appropriately once it stops. Keep in mind that if you're relying on checkpoints, and fixing the error

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Thanks for quick response, My question here is how do I know that the max retries are done ( because in my code I never know whether it is failure of first try or the last try ) and I need to handle this message, is there any callback ? Also, I know the limitation of checkpoint in upgrading the

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
Please see my comments inline. It would be helpful if you can attach the full stack trace. -Xiangrui On Fri, Jun 26, 2015 at 7:18 AM, Ravi Mody wrote: > 1. These are my settings: > rank = 100 > iterations = 12 > users = ~20M > items = ~2M > training examples = ~500M-1B (I'm running into the issue

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
No, they use the same implementation. On Fri, Jun 26, 2015 at 8:05 AM, Ayman Farahat wrote: > I use the mllib not the ML. Does that make a difference ? > > Sent from my iPhone > > On Jun 26, 2015, at 7:19 AM, Ravi Mody wrote: > > Forgot to mention: rank of 100 usually works ok, 120 consistently

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
TaskContext has an attemptNumber method on it. If you want to know which messages failed, you have access to the offsets, and can do whatever you need to with them. On Fri, Jun 26, 2015 at 10:21 AM, Amit Assudani wrote: > Thanks for quick response, > > My question here is how do I know that t

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Also, what I understand is, max failures doesn’t stop the entire stream, it fails the job created for the specific batch, but the subsequent batches still proceed, isn’t it right ? And question still remains, how to keep track of those failed batches ? From: amit assudani mailto:aassud...@impet

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
No, if you have a bad message that you are continually throwing exceptions on, your stream will not progress to future batches. On Fri, Jun 26, 2015 at 10:28 AM, Amit Assudani wrote: > Also, what I understand is, max failures doesn’t stop the entire stream, > it fails the job created for the sp

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Hmm, not sure why, but when I run this code, it always keeps on consuming from Kafka and proceeds ignoring the previous failed batches, Also, Now that I get the attempt number from TaskContext and I have information of max retries, I am supposed to handle it in the try/catch block, but does it

Re: Kryo serialization of classes in additional jars

2015-06-26 Thread patcharee
Hi, I am having this problem on spark 1.4. Do you have any ideas how to solve it? I tried to use spark.executor.extraClassPath, but it did not help BR, Patcharee On 04. mai 2015 23:47, Imran Rashid wrote: Oh, this seems like a real pain. You should file a jira, I didn't see an open issue --

Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
I understand that Kerberos support for accessing Hadoop resources in Spark only works when running Spark on YARN. However, I'd really like to hack something together for Spark on Mesos running alongside a secured Hadoop cluster. My simplified appplication (gist: https://gist.github.com/ariens

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-26 Thread Thomas Gerber
Note that this problem is probably NOT caused directly by GraphX, but GraphX reveals it because as you go further down the iterations, you get further and further away of a shuffle you can rely on. On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber wrote: > Hello, > > We run GraphX ConnectedComponen

Re: Problem after enabling Hadoop native libraries

2015-06-26 Thread Marcelo Vanzin
What master are you using? If this is not a "local" master, you'll need to set LD_LIBRARY_PATH on the executors also (using spark.executor.extraLibraryPath). If you are using local, then I don't know what's going on. On Fri, Jun 26, 2015 at 1:39 AM, Arunabha Ghosh wrote: > Hi, > I'm having

Multiple dir support : newApiHadoopFile

2015-06-26 Thread Bahubali Jain
Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Shivaram Venkataraman
We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark wrote: > Good morning, > > I am having

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Timothy Chen
Hi Dave, I don't understand Keeberos much but if you know the exact steps that needs to happen I can see how we can make that happen with the Spark framework. Tim > On Jun 26, 2015, at 8:49 AM, Dave Ariens wrote: > > I understand that Kerberos support for accessing Hadoop resources in Spark

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Ted Yu
See this related thread: http://search-hadoop.com/m/q3RTtiYm8wgHego1 On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain wrote: > > Hi, > How do we read files from multiple directories using newApiHadoopFile () ? > > Thanks, > Baahu > -- > Twitter:http://twitter.com/Baahu > >

Re:

2015-06-26 Thread Silvio Fiorito
OK, here’s how I did it, using just the built-in Avro libraries with Spark 1.3: import org.apache.avro.generic.{GenericData, GenericRecord} import org.apache.avro.mapred.AvroKey import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.io.NullWritable import org.apache.hadoop.ma

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi
Comma separated paths works only with spark 1.4 and up 2015-06-26 18:56 GMT+02:00 Eugen Cepoi : > You can comma separate them or use globbing patterns > > 2015-06-26 18:54 GMT+02:00 Ted Yu : > >> See this related thread: >> http://search-hadoop.com/m/q3RTtiYm8wgHego1 >> >> On Fri, Jun 26, 2015 at

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi
You can comma separate them or use globbing patterns 2015-06-26 18:54 GMT+02:00 Ted Yu : > See this related thread: > http://search-hadoop.com/m/q3RTtiYm8wgHego1 > > On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain wrote: > >> >> Hi, >> How do we read files from multiple directories using newApiHa

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Also, I get TaskContext.get() null when used in foreach function below ( I get it when I use it in map, but the whole point here is to handle something that is breaking in action ). Please help. :( From: amit assudani mailto:aassud...@impetus.com>> Date: Friday, June 26, 2015 at 11:41 AM To: Cod

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Mark Stephenson
Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio

spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-26 Thread igor.berman
Hi, wanted to get some advice regarding tunning spark application I see for some of the tasks many log entries like this Executor task launch worker-38 ExternalAppendOnlyMap: Thread 239 spilling in-memory map of 5.1 MB to disk (272 times so far) (especially when inputs are considerable) I understan

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Shivaram Venkataraman
I was using RStudio on the master node of the same cluster in the demo. However I had installed Spark under the user `rstudio` (i.e. /home/rstudio) and that will make the permissions work correctly. You will need to copy the config files from /root/spark/conf after installing Spark though and it mi

Re: Dependency Injection with Spark Java

2015-06-26 Thread Igor Berman
asked myself same question today...actually depends on what you are trying to do if you want injection into workers code I think it will be a bit hard... if only in code that driver executes i.e. in main, it's straight forward imho, just create your classes from injector(e.g. spring's application c

Re: Unable to specify multiple directories as input

2015-06-26 Thread ๏̯͡๏
So for each directory you create one RDD and then union them all. On Fri, Jun 26, 2015 at 10:05 AM, Bahubali Jain wrote: > oh..my use case is not very straight forward. > The input can have multiple directories... > > On Fri, Jun 26, 2015 at 9:30 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wrote: > >> Yes, only workin

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread mark
So you created an EC2 instance with RStudio installed first, then installed Spark under that same username?  That makes sense, I just want to verify your work flow. Thank you again for your willingness to help! On Fri, Jun 26, 2015 at 10:13 AM -0700, "Shivaram Venkataraman" wrote:

Spark SQL - Setting YARN Classpath for primordial class loader

2015-06-26 Thread Kumaran Mani
Hi, The response to the below thread for making yarn-client mode work by adding the JDBC driver JAR to spark.{driver,executor}.extraClassPath works fine. http://mail-archives.us.apache.org/mod_mbox/spark-user/201504.mbox/%3CCAAOnQ7vHeBwDU2_EYeMuQLyVZ77+N_jDGuinxOB=sff2lkc...@mail.gmail.com%3E Bu

Re:

2015-06-26 Thread ๏̯͡๏
Silvio, Thanks for your responses and patience. It worked after i reshuffled the arguments and removed avro dependencies. On Fri, Jun 26, 2015 at 9:55 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > OK, here’s how I did it, using just the built-in Avro libraries with > Spark 1.3: >

Re: Executors requested are way less than what i actually got

2015-06-26 Thread ๏̯͡๏
These are my YARN queue configurations Queue State:RUNNINGUsed Capacity:206.7%Absolute Used Capacity:3.1%Absolute Capacity:1.5%Absolute Max Capacity:10.0%Used Resources:Num Schedulable Applications:7Num Non-Schedulable Applications:0Num Containers:390Max Applications:45Max Applications Per User:27

Re:

2015-06-26 Thread Silvio Fiorito
No worries, glad to help! It also helped me as I had not worked directly with the Hadoop APIs for controlling splits. From: "ÐΞ€ρ@Ҝ (๏̯͡๏)" Date: Friday, June 26, 2015 at 1:31 PM To: Silvio Fiorito Cc: user Subject: Re: Silvio, Thanks for your responses and patience. It worked after i reshuffled

YARN worker out of disk memory

2015-06-26 Thread Tarun Garg
Hi, I am running a spark job over yarn, after 2-3 hr execution workers start dieing and i found that a lot of file at /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1435184713615_0008/blockmgr-333f0ade-2474-43a6-9960-f08a15bcc7b7/3f named temp_shuffle. my job is kakfastream.map

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
Hello ; I checked on my partitions/storage and here is what I have I have 80 executors 5 G per executore. Do i need to set additional params say cores spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOp

HOw to concatenate two csv files into one RDD?

2015-06-26 Thread Rex X
With Python Pandas, it is easy to do concatenation of dataframes by combining pandas.concat and pandas.read_csv pd.concat([pd.read_csv(os.path.join(Path_to_csv_files, f)) for f in csvfiles]) where "csvfiles" is the list o

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
Hi, we are on 1.3.1 right now so in case there are differences in the Spark files I'll walk through the logic of what we did and post a couple gists at the end. We haven't committed to forking Spark for our own deployments yet, so right now we shadow some Spark classes in our application code with

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
Thanks for the awesome response, Steve. As you say, it's not ideal, but the clarification greatly helps. Cheers, everyone :) -Ashic. Subject: Re: Recent spark sc.textFile needs hadoop for folders?!? From: ste...@hortonworks.com To: as...@live.com CC: guha.a...@gmail.com; user@spark.apache.org Date

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread XianXing Zhang
Do we have any update on this thread? Has anyone met and solved similar problems before? Any pointers will be greatly appreciated! Best, XianXing On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu wrote: > Hi Peng, > > I got exactly same error! My shuffle data is also very large. Have you > figured out

Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Wang, Ningjun (LNG-NPV)
In rdd.mapPartition(...) if I try to iterate through the items in the partition, everything screw. For example val rdd = sc.parallelize(1 to 1000, 3) val count = rdd.mapPartitions(iter => { println(iter.length) iter }).count() The count is 0. This is incorrect. The count should be 1000. If

Re: Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Mark Hamstra
Do you want to transform the RDD, or just produce some side effect with its contents? If the latter, you want foreachPartition, not mapPartitions. On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > In rdd.mapPartition(…) if I try to iterate through

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
Hi Timothy, Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I need to ensure that my tasks running on the slaves perform a Kerberos login before accessing any HDFS resources. To login, they just need the name of the principal (username) and a keytab file. Then they just

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Olivier Girardot
I would pretty much need exactly this kind of feature too Le ven. 26 juin 2015 à 21:17, Dave Ariens a écrit : > Hi Timothy, > > > > Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I > need to ensure that my tasks running on the slaves perform a Kerberos login > before acc

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
So you have 100 partitions (blocks). This might be too many for your dataset. Try setting a smaller number of blocks, e.g., 32 or 64. When ALS starts iterations, you can see the shuffle read/write size from the "stages" tab of Spark WebUI. Vary number of blocks and check the numbers there. Kyro ser

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
I set the number of partitions on the input dataset at 50. The number of CPU cores I'm using is 84 (7 executors, 12 cores). I'll look into getting a full stack trace. Any idea what my errors mean, and why increasing memory causes them to go away? Thanks. On Fri, Jun 26, 2015 at 11:26 AM, Xiangrui

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Tim Chen
So correct me if I'm wrong, sounds like all you need is a principal user name and also a keytab file downloaded right? I'm adding support from spark framework to download additional files along side your executor and driver, and one workaround is to specify a user principal and keytab file that ca

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
how do i set these partitons? is this is the call to ALS model = ALS.trainImplicit(ratings, rank, numIterations)? On Jun 26, 2015, at 12:33 PM, Xiangrui Meng wrote: > So you have 100 partitions (blocks). This might be too many for your dataset. > Try setting a smaller number of blocks, e.g.,

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen wrote: > So correct me if I'm wrong, sounds like all you need is a principal user > name and also a keytab file downloaded right? > I'm not familiar with Mesos so don't know what kinds of features it has, but at the very least it would need to start cont

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread Eugen Cepoi
Are you using yarn? If yes increase the yarn memory overhead option. Yarn is probably killing your executors. Le 26 juin 2015 20:43, "XianXing Zhang" a écrit : > Do we have any update on this thread? Has anyone met and solved similar > problems before? > > Any pointers will be greatly appreciated

spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Ashish Nigam
I bring up spark streaming job that uses Kafka as input source. No data to process and then shut it down. And bring it back again. This time job does not start because it complains that DStream is not initialized. 15/06/26 01:10:44 ERROR yarn.ApplicationMaster: User class threw exception: org.apac

  1   2   >