[Spark 1.5.2]: Iterate through Dataframe columns and put it in map

2016-03-02 Thread Divya Gehlot
Hi, I need to iterate through columns in dataframe based on certain condition and put it in map . Dataset Column1 Column2 Car Model1 Bike Model2 Car Model2 Bike Model 2 I want to iterate through above dataframe and put it in map where car is key and model1 and model 2

[Issue:]Getting null values for Numeric types while accessing hive tables (Registered on Hbase,created through Phoenix)

2016-03-03 Thread Divya Gehlot
Hi, I am registering hive table on Hbase CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:AGE") TBLPROPERTIES ("hbase.table.name" = "TEST", "hbase.mapre

Spark 1.5.2 - Read custom schema from file

2016-03-03 Thread Divya Gehlot
Hi, I have defined a custom schema as shown below : val customSchema = StructType( > StructField("year", IntegerType, true), > StructField("make", StringType, true), > StructField("model", StringType, true), > StructField("comment", StringType, true), StructField("blank", Stri

Spark 1.5.2 : change datatype in programaticallly generated schema

2016-03-03 Thread Divya Gehlot
Hi, I am generating schema programatically as show below val schemaFile = sc.textFile("/TestDivya/Spark/cars.csv") val schemaString = schemaFile.first() val schema = StructType(Array( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType(), true I want to

Spark 1.5.2 -Better way to create custom schema

2016-03-04 Thread Divya Gehlot
Hi , I have a data set in HDFS . Is there any better any to define the custom schema for the data set having more 100+ fields of different data types. Thanks, Divya

Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Divya Gehlot
Hi, Could somebody help me by providing the steps /redirect me to blog/documentation on how to run Spark job written in scala through Oozie. Would really appreciate the help. Thanks, Divya

[Error]Run Spark job as hdfs user from oozie workflow

2016-03-09 Thread Divya Gehlot
Hi, I have non secure Hadoop 2.7.2 cluster on EC2 having Spark 1.5.2 When I am submitting my spark scala script through shell script using Oozie workflow. I am submitting job as hdfs user but It is running as user = "yarn" so all the output should get store under user/yarn directory only . When I

Spark with Yarn Client

2016-03-11 Thread Divya Gehlot
Hi, I am trying to understand behaviour /configuration of spark with yarn client on hadoop cluster . Can somebody help me or point me document /blog/books which has deeper understanding of above two. Thanks, Divya

append rows to dataframe

2016-03-13 Thread Divya Gehlot
Hi, Please bear me for asking such a naive question I have list of conditions (dynamic sqls) sitting in hbase table . I need to iterate through those dynamic sqls and add the data to dataframes. As we know dataframes are immutable ,when I try to iterate in for loop as shown below I get only last d

[How To :]Custom Logging of Spark Scala scripts

2016-03-14 Thread Divya Gehlot
Hi, Can somebody point how can I confgure custom logs for my Spark (scala scripts) So that I can at which level my script failed and why ? Thanks, Divya

convert row to map of key as int and values as arrays

2016-03-15 Thread Divya Gehlot
Hi, As I cant add colmns from another Dataframe I am planning to my row coulmns to map of key and arrays As I am new to scala and spark I am trying like below // create an empty map import scala.collection.mutable.{ArrayBuffer => mArrayBuffer} var map = Map[Int,mArrayBuffer[Any]]() def addNode(

[Error] : dynamically union All + adding new column

2016-03-19 Thread Divya Gehlot
Hi, I am dynamically doing union all and adding new column too val dfresult = > dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9") > val schemaL = dfresult.schema > var dffiltered = sqlContext.createDataFrame(sc.emptyRDD[Row], schemaL) > for ((key,values) <- lcrMap) {

[Spark-1.5.2]Column renaming with withColumnRenamed has no effect

2016-03-19 Thread Divya Gehlot
Hi, I am adding a new column and renaming it at same time but the renaming doesnt have any effect. dffiltered = > dffiltered.unionAll(dfresult.withColumn("Col1",lit("value1").withColumn("Col2",lit("value2")).cast("int")).withColumn("Col3",lit("values3")).withColumnRenamed("Col1","Col1Rename").drop

Get the number of days dynamically in with Column

2016-03-20 Thread Divya Gehlot
I have a time stamping table which has data like No of Days ID 11D 22D and so on till 30 days Have another Dataframe with start date and end date I need to get the difference between these two days and get the ID from Time Stamping table and do With Column . The

declare constant as date

2016-03-20 Thread Divya Gehlot
Hi, In Spark 1.5.2 Do we have any utiility which converts a constant value as shown below orcan we declare a date variable like val start_date :Date = "2015-03-02" val start_date = "2015-03-02" toDate like how we convert to toInt ,toString I searched for it but couldnt find it Thanks, Divya

Re: declare constant as date

2016-03-21 Thread Divya Gehlot
Oh my my I am so silly I can declare it as string and cast it to date My apologies for Spamming the mailing list. Thanks, Divya On 21 March 2016 at 14:51, Divya Gehlot wrote: > Hi, > In Spark 1.5.2 > Do we have any utiility which converts a constant value as shown below > orcan

find the matching and get the value

2016-03-22 Thread Divya Gehlot
Hi, I am using Spark1.5.2 My requirement is as below df.withColumn("NoOfDays",lit(datediff(df("Start_date"),df("end_date" Now have to add one more columnn where my datediff(Start_date,end_date)) should match with map keys Map looks like MyMap(1->1D,2->2D,3->3M,4->4W) I want to do s

[Spark -1.5.2]Dynamically creation of caseWhen expression

2016-03-23 Thread Divya Gehlot
Hi, I have a map collection . I am trying to build when condition based on the key values . Like df.withColumn("ID", when( condition with map keys ,values of map ) How can I do that dynamically. Currently I am iterating over keysIterator and get the values Kal keys = myMap.keysIterator.toArray Lik

Re: [SQL] Two columns in output vs one when joining DataFrames?

2016-03-28 Thread Divya Gehlot
Hi Jacek , The difference is being mentioned in Spark doc itself Note that if you perform a self-join using this function without aliasing the input * [[DataFrame]]s, you will NOT be able to reference any columns after the join, since * there is no way to disambiguate which side of the join you w

Change TimeZone Setting in Spark 1.5.2

2016-03-28 Thread Divya Gehlot
Hi, The Spark set up is on Hadoop cluster. How can I set up the Spark timezone to sync with Server Timezone ? Any idea? Thanks, Divya

[Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both

2016-04-01 Thread Divya Gehlot
[image: Mic Drop] Hi, I have Hadoop Hortonworks 3 NODE Cluster on EC2 with *Hadoop *version 2.7.x *Spark *version - 1.5.2 *Phoenix *version - 4.4 *Hbase *version 1.1.x *Cluster Statistics * Date Node 1 OS: redhat7 (x86_64)Cores (CPU): 2 (2)Disk: 20.69GB/99.99GB (20.69% used) Memory: 7.39GB Date

[Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both

2016-04-01 Thread Divya Gehlot
Forgot to mention I am using all DataFrame API instead of sqls to the operations -- Forwarded message -- From: Divya Gehlot Date: 1 April 2016 at 18:35 Subject: [Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both To: "user @spark" [image: Mic Drop]

[HELP:]Save Spark Dataframe in Phoenix Table

2016-04-07 Thread Divya Gehlot
Hi, I hava a Hortonworks Hadoop cluster having below Configurations : Spark 1.5.2 HBASE 1.1.x Phoenix 4.4 I am able to connect to Phoenix through JDBC connection and able to read the Phoenix tables . But while writing the data back to Phoenix table I am getting below error : org.apache.spark.sql.

[Error:] When writing To Phoenix 4.4

2016-04-11 Thread Divya Gehlot
Hi, I am getting error when I try to write data to Phoenix . *Software Confguration :* Spark 1.5.2 Phoenix 4.4 Hbase 1.1 *Spark Scala Script :* val dfLCR = readTable(sqlContext, "", "TEST") val schemaL = dfLCR.schema val lcrReportPath = "/TestDivya/Spark/Results/TestData/" val dfReadReport= sqlCon

[ASK]:Dataframe number of column limit in Saprk 1.5.2

2016-04-12 Thread Divya Gehlot
Hi, I would like to know does Spark Dataframe API has limit on creation of number of columns? Thanks, Divya

Memory needs when using expensive operations like groupBy

2016-04-13 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with exit code 143 . except one job where I am using unionAll and groupBy operation on multiple columns . Please advice me the options to optimize it . The one option which I am using it now --conf spark.executor.extraJavaOp

[Help]:Strange Issue :Debug Spark Dataframe code

2016-04-15 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10. Is there any other option apart from "explain(true)" to debug Spark Dataframe code . I am facing strange issue . I have a lookuo dataframe and using it join another dataframe on different columns . I am getting *Analysis exception* in third join. When

Fwd: [Help]:Strange Issue :Debug Spark Dataframe code

2016-04-17 Thread Divya Gehlot
Reposting again as unable to find the root cause where things are going wrong. Experts please help . -- Forwarded message -- From: Divya Gehlot Date: 15 April 2016 at 19:13 Subject: [Help]:Strange Issue :Debug Spark Dataframe code To: "user @spark" Hi, I am using S

[Spark 1.5.2] Log4j Configuration for executors

2016-04-18 Thread Divya Gehlot
Hi, I tried configuring logs to write it to file for Spark Driver and Executors . I have two separate log4j properties files for Spark driver and executor respectively. Its wrtiting log for Spark driver but for executor logs I am getting below error : java.io.FileNotFoundException: /home/hdfs/spa

[Ask :]Best Practices - Application logging in Spark 1.5.2 + Scala 2.10

2016-04-21 Thread Divya Gehlot
Hi, I am using Spark with Hadoop 2.7 cluster I need to print all my print statement and or any errors to file for instance some info if passed some level or some error if something misisng in my Spark Scala Script. Can some body help me or redirect me tutorial,blog, books . Whats the best way to a

Re: Spark DataFrame sum of multiple columns

2016-04-21 Thread Divya Gehlot
Easy way of doing it newdf = df.withColumn('total', sum(df[col] for col in df.columns)) On 22 April 2016 at 11:51, Naveen Kumar Pokala wrote: > Hi, > > > > Do we have any way to perform Row level operations in spark dataframes. > > > > > > For example, > > > > I have a dataframe with columns f

[Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-24 Thread Divya Gehlot
Hi, After joining two dataframes, saving dataframe using Spark CSV. But all the result data is being written to only one part file whereas there are 200 part files being created, rest 199 part files are empty. What is the cause of uneven partitioning ? How can I evenly distribute the data ? Would

Cant join same dataframe twice ?

2016-04-25 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 . I have a use case where I need to join the same dataframe twice on two different columns. I am getting error missing Columns For instance , val df1 = df2.join(df3,"Column1") Below throwing error missing columns val df 4 = df1.join(df3,"Column2") Is the bug or valid sc

Re: Cant join same dataframe twice ?

2016-04-26 Thread Divya Gehlot
ar. > > Thought? > > // maropu > > > > > > > > On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla > wrote: > >> Also, check the column names of df1 ( after joining df2 and df3 ). >> >> Prasad. >> >> From: Ted Yu >> Date: Monday,

Re: removing header from csv file

2016-04-26 Thread Divya Gehlot
yes you can remove the headers by removing the first row can first() or head() to do that Thanks, Divya On 27 April 2016 at 13:24, Ashutosh Kumar wrote: > I see there is a library spark-csv which can be used for removing header > and processing of csv files. But it seems it works with sqlcont

getting ClassCastException when calling UDF

2016-04-27 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 and defined below udf import org.apache.spark.sql.functions.udf > val myUdf = (wgts : Int , amnt :Float) => { > (wgts*amnt)/100.asInstanceOf[Float] > } > val df2 = df1.withColumn("WEIGHTED_AMOUNT",callUDF(udfcalWghts, FloatType,col("RATE"),col("AMOUNT"))) In my sc

Re: Cant join same dataframe twice ?

2016-04-27 Thread Divya Gehlot
df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"), >> df2("b").as("2-b")) >> val df4 = df3.join(df2, df3("2-b") === df2("b&quo

Re: [Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-28 Thread Divya Gehlot
-04-25 9:34 GMT+07:00 Divya Gehlot : > >> Hi, >> >> After joining two dataframes, saving dataframe using Spark CSV. >> But all the result data is being written to only one part file whereas >> there are 200 part files being created, rest 199 part files are empty. >

[Spark 1.5.2] Spark dataframes vs sql query -performance parameter ?

2016-05-03 Thread Divya Gehlot
Hi, I am interested to know on which parameters we can say Spark data frames are better sql queries . Would be grateful ,If somebody can explain me with the usecases . Thanks, Divya

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread Divya Gehlot
Hi , Even I am getting the similar error Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile When I tried to build Phoenix Project using maven . Maven version : 3.3 Java version - 1.7_67 Phoenix - downloaded latest master from Git hub If anybody find the the resolution please

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-04 Thread Divya Gehlot
Divya On 4 May 2016 at 21:31, sunday2000 <2314476...@qq.com> wrote: > Check your javac version, and update it. > > > -- 原始邮件 ------ > *发件人:* "Divya Gehlot";; > *发送时间:* 2016年5月4日(星期三) 中午11:25 > *收件人:* "sunday2000"<2314476

package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
Hi, Is there any package or project in Spark/scala which supports Data Quality check? For instance checking null values , foreign key constraint Would really appreciate ,if somebody has already done it and happy to share or has any open source package . Thanks, Divya

Fwd: package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ I am looking for something similar to above solution . -- Forwarded message -- From: "Divya Gehlot" Date: May 5, 2016 6:51 PM Subject: package for data quality in Spar

[Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
Hi, How can I get and set storage level for Dataframes like RDDs , as mentioned in following book links https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html Thanks, Divya

Re: [Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
, Ted Yu wrote: > I am afraid there is no such API. > > When persisting, you can specify StorageLevel : > > def persist(newLevel: StorageLevel): this.type = { > > Can you tell us your use case ? > > Thanks > > On Thu, May 5, 2016 at 8:06 PM, Divya Gehlot >

Found Data Quality check package for Spark

2016-05-06 Thread Divya Gehlot
Hi, I just stumbled upon some data quality check package for spark https://github.com/FRosner/drunken-data-quality Has any body used it ? Would really appreciate the feedback . Thanks, Divya

best fit - Dataframe and spark sql use cases

2016-05-09 Thread Divya Gehlot
Hi, I would like to know the uses cases where data frames is best fit and use cases where Spark SQL is best fit based on the one's experience . Thanks, Divya

[Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Apache Phoenix 4.4 As Spark 1.5.2 doesn't support subquery in where conditions . https://issues.apache.org/jira/browse/SPARK-4226 Is there any alternative way to find foreign key constraints. Would really appreciate the help. Thanks, Divya

Re: Error joining dataframes

2016-05-18 Thread Divya Gehlot
Can you try var df_join = df1.join(df2,df1( "Id") ===df2("Id"), "fullouter").drop(df1("Id")) On May 18, 2016 2:16 PM, "ram kumar" wrote: I tried scala> var df_join = df1.join(df2, "Id", "fullouter") :27: error: type mismatch; found : String("Id") required: org.apache.spark.sql.Column

find two consective points

2016-07-14 Thread Divya Gehlot
Hi, I have huge data set like similar below : timestamp,fieldid,point_id 1468564189,89,1 1468564090,76,4 1468304090,89,9 1468304090,54,6 1468304090,54,4 Have configuration file of consecutive points -- 1,9 4,6 like 1 and 9 are consecutive points similarly 4,6 are consecutive points Now I need

Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Divya Gehlot
Hi, I have created a map by reading a text file val keyValueMap = file_read.map(t => t.getString(0) -> t.getString(4)).collect().toMap Now I have another dataframe where I need to dynamically replace all the keys of Map with values val df_input = reading the file as dataframe val df_replacekeys =

Re: Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Divya Gehlot
hanks, Divya On 18 July 2016 at 23:06, Jacek Laskowski wrote: > See broadcast variable. > > Or (just a thought) do join between DataFrames. > > Jacek > > On 18 Jul 2016 9:24 a.m., "Divya Gehlot" wrote: > >> Hi, >> >> I have created a map by r

write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
Hi, Could somebody share example of writing and calling udf which converts unix tme stamp to date tiime . Thanks, Divya

difference between two consecutive rows of same column + spark + dataframe

2016-07-20 Thread Divya Gehlot
Hi, I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its not giving me *null *in the result set. Would really appreciate th

Re: write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
f you can minimize your use of > UDFs. Spark 2.0/ tungsten does a lot of code generation. It will have to > treat your UDF as a block box. > > Andy > > From: Rishabh Bhardwaj > Date: Wednesday, July 20, 2016 at 4:22 AM > To: Rabin Banerjee > Cc: Divya Gehlot , "user @sp

getting null when calculating time diff with unix_timestamp + spark 1.6

2016-07-20 Thread Divya Gehlot
Hi, val lags=sqlContext.sql("select *,(unix_timestamp(time1,'$timeFmt') - lag(unix_timestamp(time2,'$timeFmt'))) as time_diff from df_table"); Instead of time difference in seconds I am gettng null . Would reay appreciate the help. Thanks, Divya

calculate time difference between consecutive rows

2016-07-20 Thread Divya Gehlot
I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its giving me *null *in the result set. Would really appreciate the help.

add hours to from_unixtimestamp

2016-07-21 Thread Divya Gehlot
Hi, I need to add 8 hours to from_unixtimestamp df.withColumn(from_unixtime(col("unix_timestamp"),fmt)) as "date_time" I am try to joda time function def unixToDateTime (unix_timestamp : String) : DateTime = { val utcTS = new DateTime(unix_timestamp.toLong * 1000L)+ 8.hours return utcTS } Its

Create dataframe column from list

2016-07-22 Thread Divya Gehlot
Hi, Can somebody help me by creating the dataframe column from the scala list . Would really appreciate the help . Thanks , Divya

[Error] : Save dataframe to csv using Spark-csv in Spark 1.6

2016-07-24 Thread Divya Gehlot
Hi, I am getting below error when I am trying to save dataframe using Spark-CSV > > final_result_df.write.format("com.databricks.spark.csv").option("header","true").save(output_path) java.lang.NoSuchMethodError: > scala.Predef$.$conforms()Lscala/Predef$$less$colon$less; > at > com.databricks.spa

FileUtil.fullyDelete does ?

2016-07-26 Thread Divya Gehlot
Hi, When I am doing the using theFileUtil.copymerge function val file = "/tmp/primaryTypes.csv" FileUtil.fullyDelete(new File(file)) val destinationFile= "/tmp/singlePrimaryTypes.csv" FileUtil.fullyDelete(new File(destinationFile)) val counts = partitions. reduceByKey {case (x,y) => x + y}.

Re: FileUtil.fullyDelete does ?

2016-07-26 Thread Divya Gehlot
.io.File) > > On Tue, Jul 26, 2016 at 12:09 PM, Divya Gehlot > wrote: > >> Resending to right list >> -- Forwarded message -- >> From: "Divya Gehlot" >> Date: Jul 26, 2016 6:51 PM >> Subject: FileUtil.fullyDelete does ?

Spark GraphFrames

2016-08-01 Thread Divya Gehlot
Hi, Has anybody has worked with GraphFrames. Pls let me know as I need to know the real case scenarios where It can used . Thanks, Divya

[Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
Hi, I am working with Spark 1.6 with scala and using Dataframe API . I have a use case where I need to compare two rows and add entry in the new column based on the lookup table for example : My DF looks like : col1col2 newCol1 street1 person1 street2 person1

Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
based on the time stamp column On 5 August 2016 at 10:43, ayan guha wrote: > How do you know person1 is moving from street1 to street2 and not other > way around? Basically, how do you ensure the order of the rows as you have > written them? > > On Fri, Aug 5, 2016 at 12:16 P

[Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
Hi, I have use case where I need to use or[||] operator in filter condition. It seems its not working its taking the condition before the operator and ignoring the other filter condition after or operator. As any body faced similar issue . Psuedo code : df.filter(col("colName").notEqual("no_value"

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, da

[Spark 1.6]-increment value column based on condition + Dataframe

2016-08-09 Thread Divya Gehlot
Hi, I have column values having values like Value 30 12 56 23 12 16 12 89 12 5 6 4 8 I need create another column if col("value") > 30 1 else col("value") < 30 newColValue 0 1 0 1 2 3 4 0 1 2 3 4 5 How can I have create an increment column The grouping is happening based on some other cols which

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread Divya Gehlot
Can you please check order of all the data set of union all operations. Are they in same order ? On 9 August 2016 at 02:47, max square wrote: > Hey guys, > > I'm trying to save Dataframe in CSV format after performing unionAll > operations on it. > But I get this exception - > > Exception in t

org.apache.spark.SparkException: Task failed while writing rows.+ Spark output data to hive table

2015-12-10 Thread Divya Gehlot
Hi, I am using HDP2.3.2 with Spark 1.4.1 and trying to insert data in hive table using hive context. Below is the sample code 1. spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m 2. //Sample code 3. import org.apache.spark.sql.SQLContext 4. import sqlCon

Pros and cons -Saving spark data in hive

2015-12-15 Thread Divya Gehlot
Hi, I am new bee to Spark and I am exploring option and pros and cons which one will work best in spark and hive context.My dataset inputs are CSV files, using spark to process the my data and saving it in hive using hivecontext 1) Process the CSV file using spark-csv package and create temptab

Difference between Local Hive Metastore server and A Hive-based Metastore server

2015-12-17 Thread Divya Gehlot
Hi, I am new bee to spark and using 1.4.1 Got confused between Local Metastore server and a hive based metastore server. Can somebody share the usecases when to use which one and pros and cons ? I am using HDP 2,.3.2 in which hive-site-xml is already in spark configuration directory that means H

custom schema in spark throwing error

2015-12-20 Thread Divya Gehlot
1. scala> import org.apache.spark.sql.hive.HiveContext 2. import org.apache.spark.sql.hive.HiveContext 3. 4. scala> import org.apache.spark.sql.hive.orc._ 5. import org.apache.spark.sql.hive.orc._ 6. 7. scala> import org.apache.spark.sql.types.{StructType, StructField, Strin

configure spark for hive context

2015-12-21 Thread Divya Gehlot
Hi, I am trying to configure spark for hive context (Please dont get mistaken with hive on spark ) I placed hive-site.xml in spark/CONF_DIR Now when I run spark-shell I am getting below error Version which I am using *Hadoop 2.6.2 Spark 1.5.2 Hive 1.2.1 * Welcome to >

error while defining custom schema in Spark 1.5.0

2015-12-22 Thread Divya Gehlot
Hi, I am new bee to Apache Spark ,using CDH 5.5 Quick start VM.having spark 1.5.0. I working on custom schema and getting error import org.apache.spark.sql.hive.HiveContext >> >> scala> import org.apache.spark.sql.hive.orc._ >> import org.apache.spark.sql.hive.orc._ >> >> scala> import org.apache

error creating custom schema

2015-12-23 Thread Divya Gehlot
Hi, I am trying to create custom schema but its throwing below error scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new org.a

DataFrame Vs RDDs ... Which one to use When ?

2015-12-27 Thread Divya Gehlot
Hi, I am new bee to spark and a bit confused about RDDs and DataFames in Spark. Can somebody explain me with the use cases which one to use when ? Would really appreciate the clarification . Thanks, Divya

DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
Hi, I am trying to join two dataframes and able to display the results in the console ater join. I am saving that data and and saving in the joined data in CSV format using spark-csv api . Its just saving the column names not data at all. Below is the sample code for the reference: spark-shell

Re: DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
. .. 15/12/28 00:25:40 INFO YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool 15/12/28 00:25:40 INFO DAGScheduler: Job 2 finished: saveAsTextFile at package.scala:157, took 9.293578 s P.S. : Atta

Re: DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
) are getting created just for small data set Attaching the dataset files too. On 28 December 2015 at 13:29, Divya Gehlot wrote: > yes > Sharing the execution flow > > 15/12/28 00:19:15 INFO SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/1

returns empty result set when using TimestampType and NullType as StructType +DataFrame +Scala + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
> > SQL context available as sqlContext. > > scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new org.apache.spark.sql.hive.HiveC

returns empty result set when using TimestampType and NullType as StructType +DataFrame +Scala + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
SQL context available as sqlContext. > > scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new org.apache.spark.sql.hive.HiveCont

Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
Hi, I have input data set which is CSV file where I have date columns. My output will also be CSV file and will using this output CSV file as for hive table creation. I have few queries : 1.I tried using custom schema using Timestamp but it is returning empty result set when querying the dataframe

[Spakr1.4.1] StuctField for date column in CSV file while creating custom schema

2015-12-28 Thread Divya Gehlot
Hi, I am newbee to Spark , My appologies for such a naive question I am using Spark 1.4.1 and wrtiting code in scala . I have input data as CSVfile which I am parsing using spark-csv package . I am creating custom schema to process the CSV file . Now my query is which dataype or can say Structfie

map spark.driver.appUIAddress IP to different IP

2015-12-28 Thread Divya Gehlot
Hi, I have HDP2.3.2 cluster installed in Amazon EC2. I want to update the IP adress of spark.driver.appUIAddress,which is currently mapped to private IP of EC2. Searched in spark config in ambari,could find spark.driver.appUIAddress property. Because of this private IP mapping,the spark webUI p

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Divya Gehlot
e post, it appears that you resolved this issue. Great job! > > I would recommend putting the solution here as well so that it helps > another developer down the line. > > Thanks > > > On Monday, December 28, 2015 8:56 PM, Divya Gehlot < > divya.htco...@gmail.com>

Error while starting Zeppelin Service in HDP2.3.2

2015-12-30 Thread Divya Gehlot
Hi, I am getting following error while starting the Zeppelin service from ambari server . /var/lib/ambari-agent/data/errors-2408.txt Traceback (most recent call last): File "/var/lib/ambari-agent/cache/stacks/HDP/2.3/services/ZEPPELIN/package/scripts/master.py", line 295, in Master().exec

Group by Dynamically

2016-01-24 Thread Divya Gehlot
Hi, I have two files File1 Group by Condition Field1 Y Field 2 N Field3 Y File2 is data file having field1,field2,field3 etc.. field1 field2 field3 field4 field5 data1 data2 data3 data4 data 5 data11 data22 data33 data44 data 55 Now my requirement is to group b

Dynamic sql in Spark 1.5

2016-02-02 Thread Divya Gehlot
Hi, Does Spark supports dyamic sql ? Would really appreciate the help , if any one could share some references/examples. Thanks, Divya

how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Divya Gehlot
Hi, I would like to know how to calculate how much -executor-memory should we allocate , how many num-executors,total-executor-cores we should give while submitting spark jobs . Is there any formula for it ? Thanks, Divya

Re: Dynamic sql in Spark 1.5

2016-02-02 Thread Divya Gehlot
uld be best to use the Dataframe API for creating dynamic SQL > queries. See > http://spark.apache.org/docs/1.5.2/sql-programming-guide.html for details. > > On Feb 2, 2016, at 6:49 PM, Divya Gehlot wrote: > > Hi, > Does Spark supports dyamic sql ? > Would really apprecia

add new column in the schema + Dataframe

2016-02-04 Thread Divya Gehlot
Hi, I am beginner in spark and using Spark 1.5.2 on YARN.(HDP2.3.4) I have a use case where I have to read two input files and based on certain conditions in second input file ,have to add a new column in the first input file and save it . I am using spark-csv to read my input files . Would reall

pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-04 Thread Divya Gehlot
Hi, I have two input datasets First input dataset like as below : year,make,model,comment,blank > "2012","Tesla","S","No comment", > 1997,Ford,E350,"Go get one now they are going fast", > 2015,Chevy,Volt Second Input dataset : TagId,condition > 1997_cars,year = 1997 and

Spark : Unable to connect to Oracle

2016-02-10 Thread Divya Gehlot
Hi, I am new bee to Spark and using Spark 1.5.2 version. I am trying to connect to Oracle DB using Spark API,getting errors : Steps I followed : Step 1- I placed the ojdbc6.jar in /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar Step 2- Registered the jar file sc.addJar("/usr/hdp/2.3.4.0-3485/spark/lib/o

Passing a dataframe to where clause + Spark SQL

2016-02-10 Thread Divya Gehlot
Hi, //Loading all the DB Properties val options1 = Map("url" -> "jdbc:oracle:thin:@xx.xxx.xxx.xx:1521:dbname","user"->"username","password"->"password","dbtable" -> "TESTCONDITIONS") val testCond = sqlContext.load("jdbc",options1 ) val condval = testCond.select("Cond") testCond.show() val options

IllegalStateException : When use --executor-cores option in YARN

2016-02-14 Thread Divya Gehlot
Hi, I am starting spark-shell with following options : spark-shell --properties-file /TestDivya/Spark/Oracle.properties --jars /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --driver-class-path /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --packages com.databricks:spark-csv_2.10:1.1.0 --master yarn-cl

Difference between spark-shell and spark-submit.Which one to use when ?

2016-02-14 Thread Divya Gehlot
Hi, I would like to know difference between spark-shell and spark-submit in terms of real time scenarios. I am using Hadoop cluster with Spark on EC2. Thanks, Divya

which master option to view current running job in Spark UI

2016-02-14 Thread Divya Gehlot
Hi, I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as scala files . I am bit confused between using *master *options I want to execute this spark job in YARN Curently running as spark-shell --properties-file /TestDivya/Spark/Oracle.properties --jars /usr/hdp/2.3.4.0-3485/spark/lib/o

Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Divya Gehlot
Hi, I have hadoop cluster set up in EC2. I am unable to view application logs in Web UI as its taking internal IP Like below : http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042 How can I change this to external one or redir

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Divya Gehlot
icMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html > > Regards > Sab > > On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot > wrote: > >> Hi, >> I have hadoop cluster set up in EC2. >> I am unable to view application logs in Web UI as its taking internal IP

  1   2   >