Spark Language / Data Base Question

2015-06-25 Thread Sinha, Ujjawal (SFO-MAP)
Hi Guys I am very new for spark , I have 2 question 1) which language is best to use/learn spark ?A) scala b ) java or c) Python 2) DataBase - which data base is best fit for spark once processing is done I need to store dataa) Cassandra b ) redshift c ) hdfs/hive d ) hbase or

Re:

2015-06-25 Thread Akhil Das
Look in the tuning section , also you need to figure out whats taking time and where's your bottleneck etc. If everything is tuned properly, then you will need to throw more cores :) Thanks Best Regards On Thu, Jun 25, 2015 at 12:19 AM, ÐΞ€ρ@Ҝ (๏̯

Re: Can Spark1.4 work with CDH4.6

2015-06-25 Thread Akhil Das
You can look into spark.driver.userClassPathFirst flag. spark.driver.userClassPathFirstfalse(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user d

Re: Spark Language / Data Base Question

2015-06-25 Thread pandees waran
There’s no one best for these questions. The question can be more refined with a specific use case and for that which is the best data store. > On Jun 25, 2015, at 12:02 AM, Sinha, Ujjawal (SFO-MAP) > wrote: > > Hi Guys > > > I am very new for spark , I have 2 question > > > 1) which lan

spark1.4 sparkR usage

2015-06-25 Thread 1106944...@qq.com
Hi all I have installed spark1.4, then want to use sparkR . assueme spark master ip= node1, how to start sparkR ? and summit job to spark cluster? anyone help me? or give me blog/doc thank you very much 1106944...@qq.com

JDBCRDD sync with mssql

2015-06-25 Thread Manohar753
Hi Team, in my usecase i need to sync the data with mssql for any operation in mssql.but as per my spark knowledge we have JDBCRDD it will read data from rdbms tables with upper and lower limits. someone please help is there any API to sync data automatically from single rdbms table for any DML h

Re: bugs in Spark PageRank implementation

2015-06-25 Thread Sean Owen
#2 is not a bug. Have a search through JIRA. It is merely unformalized. I think that is how (one of?) the original PageRank papers does it. On Thu, Jun 25, 2015, 7:39 AM Kelly, Terence P (HP Labs Researcher) < terence.p.ke...@hp.com> wrote: > Hi, > > Colleagues and I have found that the PageRank

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com <1106944...@qq.com> wrote: > Hi all >I have installed spark1.4, then want to use sparkR . assueme spark > master ip= node1, how to start sparkR ? and summit job t

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com <1106944...@qq.com> wrote: > Hi all >I have installed spark1.4, then want to use sparkR . assueme spark > master ip= node1, how to start sparkR ? and summit job t

Re: Akka failures: Driver Disassociated

2015-06-25 Thread Akhil Das
Can you look in the worker logs and see whats going on? It may happen that you ran out of diskspace etc. Thanks Best Regards On Thu, Jun 25, 2015 at 12:08 PM, barmaley wrote: > I'm running Spark 1.3.1 on AWS... Having long-running application (spark > context) which accepts and completes jobs

Re: Killing Long running tasks (stragglers)

2015-06-25 Thread Akhil Das
That totally depends on the way you extract the data. It will be helpful if you can paste your code so that we will understand it better. Thanks Best Regards On Wed, Jun 24, 2015 at 2:32 PM, William Ferrell wrote: > Hello - > > I am using Apache Spark 1.2.1 via pyspark. Thanks to any developers

Re: spark1.4 sparkR usage

2015-06-25 Thread Jean-Charles RISCH
Hello, Is this the official R Package? It is written : "*NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. *" Thanks, JC ᐧ 2015-06-25 10:55 GMT+02:00 Akhil Das : > Here you go https://amplab-extras.github.io/SparkR-pkg/ > > Thanks > Best Regards

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
It won't change too much, it will get you started. Further details you can read from the official website itself https://spark.apache.org/docs/latest/sparkr.html Thanks Best Regards On Thu, Jun 25, 2015 at 2:38 PM, Jean-Charles RISCH < risch.jeanchar...@gmail.com> wrote: > Hello, > > Is this the

map vs mapPartitions

2015-06-25 Thread Shushant Arora
Does mapPartitions keep complete partitions in memory of executor as iterable. JavaRDD rdd = jsc.textFile("path"); JavaRDD output = rdd.mapPartitions(new FlatMapFunction, Integer>() { public Iterable call(Iterator input) throws Exception { List output = new ArrayList(); while(input.hasNext()){ o

Re: spark1.4 sparkR usage

2015-06-25 Thread Jean-Charles RISCH
Thank you. But it's a bit scary because when I compare official API ( https://spark.apache.org/docs/1.4.0/api/R/index.html) and amplab API ( https://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/index.html), they look very different. JC ᐧ 2015-06-25 11:10 GMT+02:00 Akhil Das : > It won't change

Re: Problem with version compatibility

2015-06-25 Thread Sean Owen
-dev +user That all sounds fine except are you packaging Spark classes with your app? that's the bit I'm wondering about. You would mark it as a 'provided' dependency in Maven. On Thu, Jun 25, 2015 at 5:12 AM, jimfcarroll wrote: > Hi Sean, > > I'm running a Mesos cluster. My driver app is built

Re: Parsing a tsv file with key value pairs

2015-06-25 Thread anshu shukla
Can you be more specific Or can you provide sample file . On Thu, Jun 25, 2015 at 11:00 AM, Ravikant Dindokar wrote: > Hi Spark user, > > I am new to spark so forgive me for asking a basic question. I'm trying to > import my tsv file into spark. This file has key and value separated by a > \t pe

Re: Parsing a tsv file with key value pairs

2015-06-25 Thread Ravikant Dindokar
So I have a file where each line represents an edge in the graph & has two values separated by a tab. Both values are vertex id's (source and sink). I want to parse this file as dictionary in spark RDD. So my question is get these values in the form of dictionary in RDD? sample file : 12 15

Re: map vs mapPartitions

2015-06-25 Thread Sean Owen
No, or at least, it depends on how the source of the partitions was implemented. On Thu, Jun 25, 2015 at 12:16 PM, Shushant Arora wrote: > Does mapPartitions keep complete partitions in memory of executor as > iterable. > > JavaRDD rdd = jsc.textFile("path"); > JavaRDD output = rdd.mapPartitions(

Re: Parquet problems

2015-06-25 Thread Anders Arpteg
Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan skrev: > Did you try

Re: Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-25 Thread Steve Loughran
you are using a guava version on the classpath which your version of Hadoop can't handle. try a version < 15 or build spark against Hadoop 2.7.0 > On 24 Jun 2015, at 19:03, maxdml wrote: > >Exception in thread "main" java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMil

Re: Loss of data due to congestion

2015-06-25 Thread ayan guha
Then you should see checkpointing ( https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing ) On Thu, Jun 25, 2015 at 3:33 PM, anshu shukla wrote: > Thaks, > I am talking about streaming. > On 25 Jun 2015 5:37 am, "ayan guha" wrote: > >> Can you elaborate little mor

How to create correct data frame for classification in Spark ML?

2015-06-25 Thread dusan
Hi, I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample data: age,hours_per_week,education,sex,salaryRange 38,40,"hs-grad","male","A" 28,40,"bachelors","female","A" 52,45,"hs-grad","

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
say source is HDFS,And file is divided in 10 partitions. so what will be input contains. public Iterable call(Iterator input) say I have 10 executors in job each having single partition. will it have some part of partition or complete. And if some when I call input.next() - it will fetch rest o

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Aaron
Sorry about not suppling the error..that would make things helpful you'd think :) [INFO] [INFO] Building Spark Project SQL 1.4.1 [INFO] [INFO] [INFO

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Sean Owen
Hm that looks like a Parquet version mismatch then. I think Spark 1.4 uses 1.6? You might well get away with 1.6 here anyway. On Thu, Jun 25, 2015 at 3:13 PM, Aaron wrote: > Sorry about not suppling the error..that would make things helpful you'd > think :) > > [INFO] > --

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Aaron
Yep! That was it. Using the 1.6.0rc3 that comes with spark, rather than using the 1.5.0-cdh5.4.2 version. Thanks for the help! Cheers, Aaron On Thu, Jun 25, 2015 at 8:24 AM, Sean Owen wrote: > Hm that looks like a Parquet version mismatch then. I think Spark 1.4 > uses 1.6? You might w

Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Eskilson,Aleksander
Hi there, Parallelize is part of the RDD API which was made private for Spark v. 1.4.0. Some functions in the RDD API were considered too low-level to expose, so only most of the DataFrame API is currently public. The original rationale for this decision can be found on the issue's JIRA [1]. The d

Re: java.lang.OutOfMemoryError: PermGen space

2015-06-25 Thread Roberto Coluccio
Glad it worked! Actually I got similar issues even with Spark Streaming v1.2.x based drivers. Think also that the default config in Spark on EMR is 512m ! Roberto On Thu, Jun 25, 2015 at 1:20 AM, Srikanth wrote: > That worked. Thanks! > > I wonder what changed in 1.4 to cause this. It wouldn

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Eskilson,Aleksander
The simple answer is that SparkR does support map/reduce operations over RDD’s through the RDD API, but since Spark v 1.4.0, those functions were made private in SparkR. They can still be accessed by prepending the function with the namespace, like SparkR:::lapply(rdd, func). It was thought tho

Re: map vs mapPartitions

2015-06-25 Thread Daniel Darabos
Spark creates a RecordReader and uses next() on it when you call input.next(). (See https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L215) How the RecordReader works is an HDFS question, but it's safe to say there is no difference between using ma

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
Then how performance of mapPartitions is faster than map? On Thu, Jun 25, 2015 at 6:40 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Spark creates a RecordReader and uses next() on it when you call > input.next(). (See > https://github.com/apache/spark/blob/v1.4.0/core/src/main/

Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-25 Thread Roman Sokolov
Hello! I am trying to compute number of triangles with GraphX. But get memory error or heap size, even though the dataset is very small (1Gb). I run the code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate machines 8Gb RAM each). So I have 15x more memory than the dat

Re: map vs mapPartitions

2015-06-25 Thread Hao Ren
It's not the number of executors that matters, but the # of the CPU cores of your cluster. Each partition will be loaded on a core for computing. e.g. A cluster of 3 nodes has 24 cores, and you divide the RDD in 24 partitions (24 tasks for narrow dependency). Then all the 24 partitions will be lo

Re: Debugging Apache Spark clustered application from Eclipse

2015-06-25 Thread Yana Kadiyska
Pass that debug string to your executor like this: --conf spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address= 7761". When your executor is launched it will send debug information on port 7761. When you attach the Eclipse debugger, you need to have the IP

Spark Meetup Istanbul

2015-06-25 Thread Şafak Serdar Kapçı
Hello, I create a Meetup and Linkedin group in Istanbul. If it is possible can you add to list as Istanbul Meetup? There is none official Meetup in Istanbul. I am full time developer and edx student and Spark learner. I am taking both courses: BerkeleyX: CS100.1x Introduction to Big Data with Apac

Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Felix C
Thanks! It's good to know --- Original Message --- From: "Eskilson,Aleksander" Sent: June 25, 2015 5:57 AM To: "Felix C" , user@spark.apache.org Subject: Re: SparkR parallelize not found with 1.4.1? Hi there, Parallelize is part of the RDD API which was made private for Spark v. 1.4.0. Some fu

Re: NaiveBayes for MLPipeline is absent

2015-06-25 Thread Xiangrui Meng
FYI, I made a JIRA for this: https://issues.apache.org/jira/browse/SPARK-8600. -Xiangrui On Fri, Jun 19, 2015 at 3:01 PM, Xiangrui Meng wrote: > Hi Justin, > > We plan to add it in 1.5, along with some other estimators. We are now > preparing a list of JIRAs, but feel free to create a JIRA for th

Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Eskilson,Aleksander
I forgot to mention that if you need to access these functions for some reason, you can prepend the function call with the SparkR private namespace, like so, SparkR:::lapply(rdd, func). On 6/25/15, 9:30 AM, "Felix C" wrote: >Thanks! It's good to know > >--- Original Message --- > >From: "Eskilso

Re: Spark Meetup Istanbul

2015-06-25 Thread ayan guha
BTW is there active spark community around Melbourne? Kindly ping me if any enthusiast wants to partner with me to create one... On 26 Jun 2015 00:17, "Şafak Serdar Kapçı" wrote: > Hello, > I create a Meetup and Linkedin group in Istanbul. If it is possible can > you add to list as Istanbul Meetu

Re: Spark Meetup Istanbul

2015-06-25 Thread Paco Nathan
Hi Ayan, Yes, there is -- quite active Check the Spark global events listing to see about meetups and other Spark-related talks in Melbourne: https://docs.google.com/spreadsheets/d/1HKb_uwpQOOtBihRH8nBhgOHrsuy1nsGNlKwG32_qA3Y/edit#gid=0 ...and many other locations :) Paco On Thu, Jun 25, 2015

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to read chunk

2015-06-25 Thread Josh Rosen
Which Spark version are you using? AFAIK the corruption bugs in sort-based shuffle should have been fixed in newer Spark releases. On Wed, Jun 24, 2015 at 12:25 PM, Piero Cinquegrana < pcinquegr...@marketshare.com> wrote: > Switching spark.shuffle.manager from sort to hash fixed this issue as >

Re: Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-25 Thread Max Demoulin
I see, thank you! -- Henri Maxime Demoulin 2015-06-25 5:54 GMT-04:00 Steve Loughran : > you are using a guava version on the classpath which your version of > Hadoop can't handle. try a version < 15 or build spark against Hadoop 2.7.0 > > > On 24 Jun 2015, at 19:03, maxdml wrote: > > > >Exc

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
yes, 1 partition per core and mapPartitions apply function on each partition. Question is Does complete partition loads in memory so that function can be applied to it or its an iterator and iterator.next() loads next record and if yes then how is it efficient than map which also works on 1 recor

Can I access the Decision Tree Output

2015-06-25 Thread Dempsey, Robert
Hi Spark Version 1.4 I am trying to replicate a DecisionTree in Spark having already produced the same in Scikit Learn. In Scikit learn I can access the model to pick out certain details. Is there anyway to that in spark? I know you can save the model which looks to go into a parquet file. H

Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-25 Thread Bin Wang
I am trying to run the Spark example code HBaseTest from command line using spark-submit instead run-example, in that case, I can learn more how to run spark code in general. However, it told me CLASS_NOT_FOUND about htrace since I am using CDH5.4. I successfully located the htrace jar file but I

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-25 Thread Ted Yu
The assertion failure from TriangleCount.scala corresponds with the following lines: g.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) => val dblCount = optCounter.getOrElse(0) // double count should be even (divisible by two) assert((dblCount & 1)

Re: spark1.4 sparkR usage

2015-06-25 Thread Shivaram Venkataraman
The Apache Spark API docs for SparkR https://spark.apache.org/docs/1.4.0/api/R/index.html represent what has been released with Spark 1.4. The AMPLab version is no longer under active development and I'd recommend users to use the version in the Apache project. Thanks Shivaram On Thu, Jun 25, 201

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time that that means each record is being pulled at 1 time. That's the beauty of exposing Iterators & Iterables in an API rather than collections- the

Re: mllib from sparkR

2015-06-25 Thread Shivaram Venkataraman
Not yet - We are working on it as a part of https://issues.apache.org/jira/browse/SPARK-6805 and you can follow the JIRA for more information On Wed, Jun 24, 2015 at 2:30 AM, escardovi wrote: > Hi, > I was wondering if it is possible to use MLlib function inside SparkR, as > outlined at the Spar

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Shivaram Venkataraman
In addition to Aleksander's point please let us know what use case would use RDD-like API in https://issues.apache.org/jira/browse/SPARK-7264 -- We are hoping to have a version of this API in upcoming releases. Thanks Shivaram On Thu, Jun 25, 2015 at 6:02 AM, Eskilson,Aleksander < alek.eskil...@c

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
Also, I've noticed that .map() actually creates a MapPartitionsRDD under the hood. SO I think the real difference is just in the API that's being exposed. You can do a map() and not have to think about the partitions at all or you can do a .mapPartitions() and be able to do things like chunking of

Performing sc.paralleize (..) in workers not in the driver program

2015-06-25 Thread shahab
Hi, Apparently, sc.paralleize (..) operation is performed in the driver program not in the workers ! Is it possible to do this in worker process for the sake of scalability? best /Shahab

RE: Performing sc.paralleize (..) in workers not in the driver program

2015-06-25 Thread Ganelin, Ilya
The parallelize operation accepts as input a data structure in memory. When you call it, you are necessarily operating In the memory space of the driver since that is where user code executes. Until you have an RDD, you can't really operate in a distributed way. If your files are stores in a di

Fwd: map vs mapPartitions

2015-06-25 Thread Hao Ren
-- Forwarded message -- From: Hao Ren Date: Thu, Jun 25, 2015 at 7:03 PM Subject: Re: map vs mapPartitions To: Shushant Arora In fact, map and mapPartitions produce RDD of the same type: MapPartitionsRDD. Check RDD api source code below: def map[U: ClassTag](f: T => U): RDD[U]

Re: Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-25 Thread Max Demoulin
Can I actually include another version of guava in the classpath when launching the example through spark submit? -- Henri Maxime Demoulin 2015-06-25 10:57 GMT-04:00 Max Demoulin : > I see, thank you! > > -- > Henri Maxime Demoulin > > 2015-06-25 5:54 GMT-04:00 Steve Loughran : > >> you are usin

Re: Parsing a tsv file with key value pairs

2015-06-25 Thread Don Drake
Use this package: https://github.com/databricks/spark-csv and change the delimiter to a tab. The documentation is pretty straightforward, you'll get a Dataframe back from the parser. -Don On Thu, Jun 25, 2015 at 4:39 AM, Ravikant Dindokar wrote: > So I have a file where each line represents

assign unique ID (Long Value) to each line in RDD

2015-06-25 Thread Ravikant Dindokar
I have a file containing one line for each edge in the graph with two vertex ids (source & sink). sample: 12 (here 1 is source and 2 is sink node for the edge) 15 23 42 43 I want to assign a unique Id (Long value )to each edge i.e for each line of the file. How to ensure assign

Re: How to get the memory usage infomation of a spark application

2015-06-25 Thread maxdml
You can see the amount of memory consumed by each executor in the web ui (go to the application page, and click on the executor tab). Otherwise, for a finer grained monitoring, I can only think of correlating a system monitoring tool like Ganglia, with the event timeline of your job. -- View th

Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
Hi, I'm trying to use spark over Azure's HDInsight but the spark-shell fails when starting: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.

Re: assign unique ID (Long Value) to each line in RDD

2015-06-25 Thread neel choudhury
Hi Ravi you can do one thing. You can create a RDD with the edges and then do zipWithIndex Let a = sc.parallelize(['9:8','1:2','1:2','3,5']) a.zipWithIndex().collect() gives [('9:8', 0), ('1:2', 1), ('1:2', 2), ('3,5', 3)] Let me know if you have any other queries On Thu, Jun 25, 2015 at

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Peter Rudenko
Hi Daniel, yes it supported, however you need to add hadoop-azure.jar to classpath of spark shell (http://search.maven.org/#search%7Cga%7C1%7Chadoop-azure - it's available only for hadoop-2.7.0). Try to find it on your node and run: export CLASSPATH=$CLASSPATH:hadoop-azure.jar && spark-shell

Recent spark sc.textFile needs hadoop for folders?!?

2015-06-25 Thread Ashic Mahtab
Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've noticed the following: * On 1.4, sc.textFile("D:\\folder\\").collect() fails from both spark-shell.cmd and when running a scala application referencing the spark-core package from maven.* sc.textFile("D:\\folder\\file

java.io.NotSerializableException: org.apache.spark.SparkContext

2015-06-25 Thread ๏̯͡๏
Spark Version: 1.3.1 How can SparkContext not be serializable. Any suggestions to resolve this issue ? I included a trait + implementation (implmentation has a method that takes SC as argument) and i started seeing this exception trait DetailDataProvider[T1 <: Data] extends java.io.Serializable

Re: java.io.NotSerializableException: org.apache.spark.SparkContext

2015-06-25 Thread ๏̯͡๏
Ok. I modified the code to remove sc as sc is never serializable and must not be passed to map functions. On Thu, Jun 25, 2015 at 11:11 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Spark Version: 1.3.1 > > How can SparkContext not be serializable. > Any suggestions to resolve this issue ? > > I included a trait +

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Silvio Fiorito
Hi Daniel, As Peter pointed out you need the hadoop-azure JAR as well as the Azure storage SDK for Java (com.microsoft.azure:azure-storage). Even though the WASB driver is built for 2.7, I was still able to use the hadoop-azure JAR with Spark built for older Hadoop versions, back to 2.4 I think

Scala/Python or Java

2015-06-25 Thread spark user
Hi All , I am new for spark , i just want to know which technology is good/best for spark learning ? 1) Scala 2) Java 3) Python  I know spark support all 3 languages , but which one is best ? Thanks su  

Re: How to run kmeans.py Spark example in yarn-cluster ?

2015-06-25 Thread Elkhan Dadashov
Hi all, Does Spark 1.4 version support Python applications on Yarn-cluster ? (--master yarn-cluster) Does Spark 1.4 version support Python applications with deploy-mode cluster ? (--deploy-mode cluster) How can we ship 3rd party Python dependencies with Python Spark job ? (running on Yarn cluste

Re:

2015-06-25 Thread ๏̯͡๏
How can i increase the number of tasks from 174 to 500 without running repartition. The input size is 512.0 MB (hadoop) / 4159106. Can this be reduced to 64 MB so as to increase the number of tasks. Similar to split size that increases the number of mappers in Hadoop M/R. On Thu, Jun 25, 2015 at

Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Elkhan Dadashov
In addition to previous emails, when i try to execute this command from command line: ./bin/spark-submit --verbose --master yarn-cluster --py-files mypython/libs/numpy-1.9.2.zip --deploy-mode cluster mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0 - numpy-1.9.2.zip - is downloaded numpy packa

Spark 1.4.0, Secure YARN Cluster, Application Master throws 500 connection refused

2015-06-25 Thread Nachiketa
Spark 1.4.0 - Custom built from source against Hortonworks HDP 2.2 (hadoop 2.6.0+) HDP 2.2 Cluster (Secure, kerberos) spark-shell (--master yarn-client) launches fine and the prompt shows up. Clicking on the Application Master url on the YARN RM UI, throws 500 connect error. The same build works

Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Marcelo Vanzin
That sounds like SPARK-5479 which is not in 1.4... On Thu, Jun 25, 2015 at 12:17 PM, Elkhan Dadashov wrote: > In addition to previous emails, when i try to execute this command from > command line: > > ./bin/spark-submit --verbose --master yarn-cluster --py-files > mypython/libs/numpy-1.9.2.zip

Re:

2015-06-25 Thread Silvio Fiorito
Hi Deepak, Have you tried specifying the minimum partitions when you load the file? I haven’t tried that myself against HDFS before, so I’m not sure if it will affect data locality. Ideally not, it should still maintain data locality but just more partitions. Once your job runs, you can check i

Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Elkhan Dadashov
Thanks Marcelo. But my case is different. My mypython/libs/numpy-1.9.2.zip is in *local directory* (can also put in HDFS), but still fails. But SPARK-5479 is : PySpark on yarn mode need to support *non-local* python files. The job fails only whe

Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Naveen Madhire
Hi Marcelo, Quick Question. I am using Spark 1.3 and using Yarn Client mode. It is working well, provided I have to manually pip-install all the 3rd party libraries like numpy etc to the executor nodes. So the SPARK-5479 fix in 1.5 which you mentioned fix this as well? Thanks. On Thu, Jun 25,

Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Marcelo Vanzin
Please take a look at the pull request with the actual fix; that will explain why it's the same issue. On Thu, Jun 25, 2015 at 12:51 PM, Elkhan Dadashov wrote: > Thanks Marcelo. > > But my case is different. My mypython/libs/numpy-1.9.2.zip is in *local > directory* (can also put in HDFS), but s

Re: Spark 1.4.0, Secure YARN Cluster, Application Master throws 500 connection refused

2015-06-25 Thread Nachiketa
A few other observations. 1. Spark 1.3.1 (custom built against HDP 2.2) was running fine against the same cluster and same hadoop configuration (hence seems like regression). 2. HA is enabled for YARN RM and HDFS (not sure if this would impact anything but wanted to share anyway). 3. Found this

Re: Scala/Python or Java

2015-06-25 Thread Ted Yu
The answer depends on the user's experience with these languages as well as the most commonly used language in the production environment. Learning Scala requires some time. If you're very comfortable with Java / Python, you can go with that while at the same time familiarizing yourself with Scala

Re: Spark 1.4.0, Secure YARN Cluster, Application Master throws 500 connection refused (Resolved)

2015-06-25 Thread Nachiketa
Setting the yarn.resourcemanager.webapp.address.rm1 and yarn.resourcemanager.webapp.address.rm2 in yarn-site.xml seems to have resolved the issue. Appreciate any comments about the regression from 1.3.1 ? Thanks. Regards, Nachiketa On Fri, Jun 26, 2015 at 1:28 AM, Nachiketa wrote: > A few othe

Re: Scala/Python or Java

2015-06-25 Thread Saurabh Agrawal
Greetings, Even I am a beginner and currently learning Spark. I found Python + Spark combination to be easiest to learn given my past experience with Python, but yes, it depends on the user. Here is some reference documentation: https://spark.apache.org/docs/latest/programming-guide.html Regards

Re: Scala/Python or Java

2015-06-25 Thread ayan guha
I am a python fan so I use python. But what I noticed some features are typically 1-2 release behind for python. So I strongly agree with Ted that start with language you are most familiar with and plan to move to scala eventually On 26 Jun 2015 06:07, "Ted Yu" wrote: > The answer depends on the

sparkR could not find function "textFile"

2015-06-25 Thread Wei Zhou
Hi all, I am exploring sparkR by activating the shell and following the tutorial here https://amplab-extras.github.io/SparkR-pkg/ And when I tried to read in a local file with textFile(sc, "file_location"), it gives an error could not find function "textFile". By reading through sparkR doc for 1

Re: sparkR could not find function "textFile"

2015-06-25 Thread Eskilson,Aleksander
Hi there, The tutorial you’re reading there was written before the merge of SparkR for Spark 1.4.0 For the merge, the RDD API (which includes the textFile() function) was made private, as the devs felt many of its functions were too low level. They focused instead on finishing the DataFrame API

Re: sparkR could not find function "textFile"

2015-06-25 Thread Wei Zhou
Hi Alek, Thanks for the explanation, it is very helpful. Cheers, Wei 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander : > Hi there, > > The tutorial you’re reading there was written before the merge of SparkR > for Spark 1.4.0 > For the merge, the RDD API (which includes the textFile() function

Re: sparkR could not find function "textFile"

2015-06-25 Thread Wei Zhou
Hi Alek, Just a follow up question. This is what I did in sparkR shell: lines <- SparkR:::textFile(sc, "./README.md") head(lines) And I am getting error: "Error in x[seq_len(n)] : object of type 'S4' is not subsettable" I'm wondering what did I do wrong. Thanks in advance. Wei 2015-06-25 13:

sql dataframe internal representation

2015-06-25 Thread Koert Kuipers
i noticed in DataFrame that to get the rdd out of it some conversions are done: val converter = CatalystTypeConverters.createToScalaConverter(schema) rows.map(converter(_).asInstanceOf[Row]) does this mean DataFrame internally does not use the standard scala types? why not?

Re: sparkR could not find function "textFile"

2015-06-25 Thread Shivaram Venkataraman
The `head` function is not supported for the RRDD that is returned by `textFile`. You can run `take(lines, 5L)`. I should add a warning here that the RDD API in SparkR is private because we might not support it in the upcoming releases. So if you can use the DataFrame API for your application you s

Re: sparkR could not find function "textFile"

2015-06-25 Thread Eskilson,Aleksander
Yeah, that’s probably because the head() you’re invoking there is defined for SparkR DataFrames [1] (note how you don’t have to use the SparkR::: namepsace in front of it), but SparkR:::textFile() returns an RDD object, which is more like a distributed list data structure the way you’re applying

Re: Scala/Python or Java

2015-06-25 Thread spark user
Spark is based on Scala and it written in Scala .To debug and fix issue i guess learning Scala is good  for long term ? any advise ? On Thursday, June 25, 2015 1:26 PM, ayan guha wrote: I am a python fan so I use python. But what I noticed some features are typically 1-2 release be

Re: sparkR could not find function "textFile"

2015-06-25 Thread Wei Zhou
Thanks to both Shivaram and Alek. Then if I want to create DataFrame from comma separated flat files, what would you recommend me to do? One way I can think of is first reading the data as you would do in r, using read.table(), and then create spark DataFrame out of that R dataframe, but it is obvi

Re: sparkR could not find function "textFile"

2015-06-25 Thread Shivaram Venkataraman
You can use the Spark CSV reader to do read in flat CSV files to a data frame. See https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85 for an example Shivaram On Thu, Jun 25, 2015 at 2:15 PM, Wei Zhou wrote: > Thanks to both Shivaram and Alek. Then if I want to create DataFrame from > comma s

Re: sparkR could not find function "textFile"

2015-06-25 Thread Eskilson,Aleksander
Sure, I had a similar question that Shivaram was able fast for me, the solution is implemented using a separate DataBrick’s library. Check out this thread from the email archives [1], and the read.df() command [2]. CSV files can be a bit tricky, especially with inferring their schemas. Are you u

Re: sparkR could not find function "textFile"

2015-06-25 Thread Wei Zhou
Thanks Shivaram, this is exactly what I am looking for. 2015-06-25 14:22 GMT-07:00 Shivaram Venkataraman : > You can use the Spark CSV reader to do read in flat CSV files to a data > frame. See https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85 for an > example > > Shivaram > > On Thu, Jun 25,

Re: sparkR could not find function "textFile"

2015-06-25 Thread Wei Zhou
I tried out the solution using spark-csv package, and it worked fine now :) Thanks. Yes, I'm playing with a file with all columns as String, but the real data I want to process are all doubles. I'm just exploring what sparkR can do versus regular scala spark, as I am by heart a R person. 2015-06-2

Re:

2015-06-25 Thread ๏̯͡๏
I use sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](path + "/*.avro") https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/SparkContext.html#newAPIHadoopFile(java.lang.String, java.lang.Class, java.lang.Class, java.lang.Class, org.apache.ha

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Wei Zhou
Hi Shivaram/Alek, I understand that a better way to import data is to DataFrame rather than RDD. If one wants to do a map-like transformation for such row in sparkR, one could use sparkR:::lapply(), but is there a counterpart row operation on DataFrame? The use case I am working on requires compli

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
Thank you guys for the helpful answers. Daniel > On 25 ביוני 2015, at 21:23, Silvio Fiorito > wrote: > > Hi Daniel, > > As Peter pointed out you need the hadoop-azure JAR as well as the Azure > storage SDK for Java (com.microsoft.azure:azure-storage). Even though the > WASB driver is built

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Shivaram Venkataraman
We don't support UDFs on DataFrames in SparkR in the 1.4 release. The existing functionality can be seen as a pre-processing step which you can do and then collect data back to the driver to do more complex processing. Along with the RDD API ticket, we are also working on UDF support. You can see t

RE: Using Spark on Azure Blob Storage

2015-06-25 Thread Jacob Kim
Below is the link for step by step guide in how to setup and use Spark in HDInsight. https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/ Jacob From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: Thursday, June 25, 2015 3:19 PM To: Silvio Fiorit

reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan
Hi, I am trying to see what is the best way to reduce the values of a RDD of (key,value) pairs into (key,ListOfValues) pair. I know various ways of achieving this, but I am looking for a efficient, elegant one-liner if there is one. Example: Input RDD: (USA, California), (UK, Yorkshire), (US

  1   2   >