Porting R code to SparkR

2015-11-11 Thread Sanjay Subramanian
Hi guys This is possibly going to sound like a vague, stupid question but I have a problem to solve and I need help. So any which way I go is only up :-)  I have a bunch of R scripts (I am not a R expert) and we are currently evaluating how to translate these R scripts to SparkR data frame syntax

Re: Is it possible Running SparkR on 2 nodes without HDFS

2015-11-10 Thread Sanjay Subramanian
slave will report back if this works ! thanks sanjay  From: shenLiu To: Sanjay Subramanian ; User Sent: Monday, November 9, 2015 10:23 PM Subject: RE: Is it possible Running SparkR on 2 nodes without HDFS #yiv4791623997 #yiv4791623997 --.yiv4791623997hmmessage P{margin:0px;padding

Is it possible Running SparkR on 2 nodes without HDFS

2015-11-09 Thread Sanjay Subramanian
hey guys I have a 2 node SparkR (1 master 1 slave)cluster on AWS using  spark-1.5.1-bin-without-hadoop.tgz Running the SparkR job on the master node  /opt/spark-1.5.1-bin-hadoop2.6/bin/sparkR --master   spark://ip-xx-ppp-vv-ddd:7077 --packages com.databricks:spark-csv_2.10:1.2.0   --executor-cores

Re: Spark-sql versus Impala versus Hive

2015-06-19 Thread Sanjay Subramanian
know if this was Hive on Tez. - Steve From: Sanjay Subramanian Reply-To: Sanjay Subramanian Date: Thursday, June 18, 2015 at 11:08 To: "user@spark.apache.org" Subject: Spark-sql versus Impala versus Hive I just published results of my findings herehttps://bigdatalatte.wordpress.com/2

Spark-sql versus Impala versus Hive

2015-06-18 Thread Sanjay Subramanian
I just published results of my findings herehttps://bigdatalatte.wordpress.com/2015/06/18/spark-sql-versus-impala-versus-hive/

Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-17 Thread Sanjay Subramanian
aers; create table unique_aers_demo as select distinct isr,event_dt,age,age_cod,sex,year,quarter from aers.aers_demo_view " --driver-memory 4G --total-executor-cores 12 --executor-memory 4G thanks From: Sanjay Subramanian To: "user@spark.apache.org" Sent: Thursday, J

spark-sql CLI options does not work --master yarn --deploy-mode client

2015-06-16 Thread Sanjay Subramanian
hey guys  I have CDH 5.3.3 with Spark 1.2.0 (on Yarn) This does not work /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql  --deploy-mode client --master yarn --driver-memory 1g -e "select j.person_id, p.first_name, p.last_name, count(*) from (select person_id from cdr.cdr_mjp_joborder where pers

HDFS not supported by databricks cloud :-(

2015-06-16 Thread Sanjay Subramanian
hey guys After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is not supported by Databricks cloud.My speed bottleneck is to transfer ~1TB of snapshot HDFS data (250+ external hive tables) to S3 :-(  I want to use databricks cloud but this to me is a starting disabler.The ha

Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-16 Thread Sanjay Subramanian
ing Spark 1.4.0 with SQL code generation turned on; this should make a > huge difference. > >> On Sat, Jun 13, 2015 at 5:08 PM, Sanjay Subramanian >> wrote: >> hey guys >> >> I tried the following settings as well. No luck >> >> --total-executor

Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Sanjay Subramanian
:-) to my questions on all CDH groups, Spark, Hive best regards sanjay   From: Josh Rosen To: Sanjay Subramanian Cc: "user@spark.apache.org" Sent: Friday, June 12, 2015 7:15 AM Subject: Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-11 Thread Sanjay Subramanian
hey guys Using Hive and Impala daily intensively.Want to transition to spark-sql in CLI mode Currently in my sandbox I am using the Spark (standalone mode) in the CDH distribution (starving developer version 5.3.3) 3 datanode hadoop cluster32GB RAM per node8 cores per node | spark | 1.2.0+cdh5.3

Cant figure out spark-sql errors - switching to Impala - sorry guys

2015-06-02 Thread Sanjay Subramanian
Cant figure out spark-sql errors - switching to Hive and Impala for now - sorry guys, no hard feelings From: Sanjay Subramanian To: Sanjay Subramanian ; user Sent: Saturday, May 30, 2015 1:52 PM Subject: Re: spark-sql errors any ideas guys ? how to solve this ? From

Re: spark-sql errors

2015-05-30 Thread Sanjay Subramanian
any ideas guys ? how to solve this ? From: Sanjay Subramanian To: user Sent: Friday, May 29, 2015 5:29 PM Subject: spark-sql errors https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/6SqGuYemnbc

Re: Is anyone using Amazon EC2? (second attempt!)

2015-05-29 Thread Sanjay Subramanian
I use spark on EC2 but it's a CDH 5.3.3 distribution (starving developer version) installed thru Cloudera Manager. Spark is configured to run on Yarn. Regards Sanjay Sent from my iPhone > On May 29, 2015, at 6:16 PM, roni wrote: > > Hi , > Any update on this? > I am not sure if the issue I

spark-sql errors

2015-05-29 Thread Sanjay Subramanian
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/6SqGuYemnbc

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-28 Thread Sanjay Subramanian
t;SQL File" mode - /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -f   get_names.hql From: Andrew Otto To: Sanjay Subramanian Cc: user Sent: Thursday, May 28, 2015 7:26 AM Subject: Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Sanjay Subramanian
hey guys On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x , there are about 300+ hive tables.The data is stored an text (moving slowly to Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be able to define JOINS etc using a programming structure

Re: MappedRDD signature

2015-01-28 Thread Sanjay Subramanian
Thanks Sean. that works and I started the join of this mappedRDD to another one I have.I have to internalize the use of Map versus FlatMap. Thinking Map Reduce Java Hadoop code often blinds me :-)  From: Sean Owen To: Sanjay Subramanian Cc: Cheng Lian ; Jorge Lopez-Malla ; "

MappedRDD signature

2015-01-28 Thread Sanjay Subramanian
hey guys  I am not following why this happens DATASET===Tab separated values (164 columns) Spark command 1val mjpJobOrderRDD = sc.textFile("/data/cdr/cdr_mjp_joborder_raw")val mjpJobOrderColsPairedRDD = mjpJobOrderRDD.map(line => { val tokens = line.split("\t");(tokens(23),to

Re: FlatMapValues

2015-01-05 Thread Sanjay Subramanian
cool let me adapt that. thanks a tonregardssanjay From: Sean Owen To: Sanjay Subramanian Cc: "user@spark.apache.org" Sent: Monday, January 5, 2015 3:19 AM Subject: Re: FlatMapValues For the record, the solution I was suggesting was about like this: inputRDD.flatM

Re: A spark newbie question

2015-01-04 Thread Sanjay Subramanian
val sconf = new SparkConf().setMaster("local").setAppName("MedicalSideFx-CassandraLogsMessageTypeCount") val sc = new SparkContext(sconf)val inputDir = "/path/to/cassandralogs.txt" sc.textFile(inputDir).map(line => line.replace("\"", "")).map(line => (line.split(' ')(0) + " " + line.split(' ')(2

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian
s hopefully answered :-)  (2,List(1001,1000,1002,1003, 1004,1001,1006,1007))(3,List(1011,1012,1013,1010, 1007,1009,1005,1008))(1,List(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) From: Shixiong Zhu To: Sanjay Subramanian Cc: dcmovva ; "user@

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian
6,1007)) (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 1007,1009,1005,1008)) From: Sanjay Subramanian To: dcmovva ; "user@spark.apache.org" Sent: Saturday, January 3, 2015 12:19 PM Subject: Re:

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian
This is my design. Now let me try and code it in Spark. rdd1.txt =1~4,5,6,72~4,53~6,7 rdd2.txt  4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010 TRANSFORM 1===map each value to key (like an inverted index)4~15~16~17~15~24~26~37~3 TRANSFOR

Re: saveAsTextFile

2015-01-03 Thread Sanjay Subramanian
@lailaBased on the error u mentioned in the nabble link below, it seems like there are no permissions to write to HDFS. So this is possibly why saveAsTextFile is failing. From: Pankaj Narang To: user@spark.apache.org Sent: Saturday, January 3, 2015 4:07 AM Subject: Re: saveAsTextFile

Re: FlatMapValues

2015-01-02 Thread Sanjay Subramanian
else { ("") } }).flatMap(str => str.split('\t')).filter(line => line.toString.length() > 0).saveAsTextFile("/data/vaers/msfx/reac/" + outFile) From: Sanjay Subramanian To: Hitesh Khamesra Cc:

Re: FlatMapValues

2015-01-01 Thread Sanjay Subramanian
thanks let me try that out From: Hitesh Khamesra To: Sanjay Subramanian Cc: Kapil Malik ; Sean Owen ; "user@spark.apache.org" Sent: Thursday, January 1, 2015 9:46 AM Subject: Re: FlatMapValues How about this..apply flatmap on per line. And in that function, parse each

Re: FlatMapValues

2014-12-31 Thread Sanjay Subramanian
,Injection site oedema025005,Injection site reaction thanks sanjay From: Kapil Malik To: Sean Owen ; Sanjay Subramanian Cc: "user@spark.apache.org" Sent: Wednesday, December 31, 2014 9:35 AM Subject: RE: FlatMapValues Hi Sanjay, Oh yes .. on flatMapValues, it&#x

Re: FlatMapValues

2014-12-31 Thread Sanjay Subramanian
else {     ("","")   }   }).filter(pair => pair._1.length() > 0).flatMapValues(skus => skus.split('\t')).saveAsTextFile("/data/vaers/msfx/reac/" + outFile) Please note that this too saves lines like (025126,Chills),i.e. with opening and closing bracke

FlatMapValues

2014-12-31 Thread Sanjay Subramanian
hey guys  My dataset is like this  025126,Chills,8.10,Injection site oedema,8.10,Injection site reaction,8.10,Malaise,8.10,Myalgia,8.10 Intended output is ==025126,Chills 025126,Injection site oedema 025126,Injection site reaction 025126,Malaise 025126,Myalgia My code is as follo

Re: How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
at I do at www.medicalsidefx.orgPrimarily an iPhone app but underlying is Lucene, Hadoop and hopefully soon in 2015 - Spark :-)   From: Sean Owen To: Sanjay Subramanian Cc: "user@spark.apache.org" Sent: Wednesday, December 24, 2014 8:56 AM Subject: Re: How to identify er

Re: How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
lter.map(line => { if (line.split('$').length >= 13){ line.split('$')(0) + "~" + line.split('$')(5) + "~" + line.split('$')(11) + "~" + line.split('$')(12) } }) From: Sanjay Subramanian To: "use

How to identify erroneous input record ?

2014-12-24 Thread Sanjay Subramanian
hey guys  One of my input records has an problem that makes the code fail. var demoRddFilter = demoRdd.filter(line => !line.contains("ISR$CASE$I_F_COD$FOLL_SEQ") || !line.contains("primaryid$caseid$caseversion")) var demoRddFilterMap = demoRddFilter.map(line => line.split('$')(0) + "~" + line.s

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Sanjay Subramanian
Thanks a ton Ashishsanjay From: Ashish Rangole To: Sanjay Subramanian Cc: Krishna Sankar ; Sean Owen ; Guillermo Ortiz ; user Sent: Sunday, November 23, 2014 11:03 AM Subject: Re: Spark or MR, Scala or Java? This being a very broad topic, a discussion can quickly get subjective

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Sanjay Subramanian
I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis. To get me jumpstarted into Spark I started this gitHub where there is "IntelliJ-ready-To-run" code (simple examples of jon, sparksql etc) and

Re: Extracting values from a Collecion

2014-11-22 Thread Sanjay Subramanian
quot;,")(0), x.split(",")(1))).reduceByKey((v1,v2) => v1+"|"+v2) file1Rdd.collect().foreach(println) file2Rdd.collect().foreach(println) file1Rdd.join(file2Rdd).collect().foreach( e => println(e.toString.replace("(","").replace(")","

Re: Extracting values from a Collecion

2014-11-22 Thread Sanjay Subramanian
Thanks Jeyregardssanjay From: Jey Kottalam To: Sanjay Subramanian Cc: Arun Ahuja ; Andrew Ash ; user Sent: Friday, November 21, 2014 10:07 PM Subject: Extracting values from a Collecion Hi Sanjay, These are instances of the standard Scala collection type "Set"

Re: Extracting values from a Collecion

2014-11-21 Thread Sanjay Subramanian
(4,(ringo,Set(With a Little Help From My Friends, Octopus's Garden)))(2,(john,Set(Julia, Nowhere Man)))(3,(george,Set(While My Guitar Gently Weeps, Norwegian Wood)))(1,(paul,Set(Yesterday, Michelle))) Again the question is how do I extract values from the Set ? thanks sanjay From

Extracting values from a Collecion

2014-11-21 Thread Sanjay Subramanian
hey guys names.txt= 1,paul2,john3,george4,ringo songs.txt= 1,Yesterday2,Julia3,While My Guitar Gently Weeps4,With a Little Help From My Friends1,Michelle2,Nowhere Man3,Norwegian Wood4,Octopus's Garden What I want to do is real simple  Desired Output ==(4,(With a Litt

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Sanjay Subramanian
s" to quickly test , experiment and debug code. From: Jay Vyas To: Sanjay Subramanian Cc: "user@spark.apache.org" Sent: Thursday, November 20, 2014 4:53 PM Subject: Re: Code works in Spark-Shell but Fails inside IntelliJ This seems pretty standard: your IntelliJ classp

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Sanjay Subramanian
Subramanian Cc: "user@spark.apache.org" Sent: Thursday, November 20, 2014 4:49 PM Subject: Re: Code works in Spark-Shell but Fails inside IntelliJ Looks like intelij might be trying to load the wrong version of spark? On Thu, Nov 20, 2014 at 4:35 PM, Sanjay Subramanian wrote:

Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Sanjay Subramanian
hey guys I am at AmpCamp 2014 at UCB right now :-)  Funny Issue... This code works in Spark-Shell but throws a funny exception in IntelliJ CODE val sqlContext = new org.apache.spark.sql.SQLContext(sc)sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")val wikiData = sqlContext.parq

Cant start spark-shell in CDH Spark Standalone 1.1.0+cdh5.2.0+56

2014-10-27 Thread Sanjay Subramanian
hey guys Anyone using CDH Spark StandaloneI installed Spark standalone thru Cloudera Manager $ spark-shell --total-executor-cores 8 /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/bin/../lib/spark/bin/spark-shell: line 44: /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/bin/utils.sh:

Re: Spark inside Eclipse

2014-10-03 Thread Sanjay Subramanian
adClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 12 more I am gonna keep working to solve this. Meanwhile if u can provide some guidance that would be cool  sanjay   From: Daniel Siegmann To: Ashish Jain Cc: Sanjay Subramanian ; "user@spark.apache.o

Re: Spark inside Eclipse

2014-10-03 Thread Sanjay Subramanian
cool thanks will set this up and report back how things wentregardssanjay From: Daniel Siegmann To: Ashish Jain Cc: Sanjay Subramanian ; "user@spark.apache.org" Sent: Thursday, October 2, 2014 6:52 AM Subject: Re: Spark inside Eclipse You don't need to do anything

Re: Multiple spark shell sessions

2014-10-01 Thread Sanjay Subramanian
: Error was: Failure(java.net.BindException: Address already in use)14/10/01 17:34:38 INFO SparkUI: Started SparkUI at http://hadoop02:4041 sanjay From: Matei Zaharia To: Sanjay Subramanian Cc: "user@spark.apache.org" Sent: Wednesday, October 1, 2014 5:19 PM Subject: Re: Mult

Spark inside Eclipse

2014-10-01 Thread Sanjay Subramanian
hey guys Is there a way to run Spark in local mode from within Eclipse.I am running Eclipse Kepler on a Macbook Pro with MavericksLike one can run hadoop map/reduce applications from within Eclipse and debug and learn. thanks sanjay   

Multiple spark shell sessions

2014-10-01 Thread Sanjay Subramanian
hey guys I am using  spark 1.0.0+cdh5.1.0+41 When two users try to run "spark-shell" , the first guy's spark-shell shows active in the 18080 Web UI but the second user shows WAITING and the shell has a bunch of errors but does go the spark-shell and "sc.master" seems to point to the correct master