DropNa in Spark for Columns

2021-02-26 Thread Chetan Khatri
Hi Users, What is equivalent of *df.dropna(axis='columns'**) *of Pandas in the Spark/Scala? Thanks

Performance Improvement: Collect in spark taking huge time

2021-05-05 Thread Chetan Khatri
Hi All, Collect in spark is taking huge time. I want to get list of values of one column to Scala collection. How can I do this? val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF .select(col("reporting_table")).except(clientSchemaDF) logger.info(s"###

Re: Performance Improvement: Collect in spark taking huge time

2021-05-05 Thread Chetan Khatri
May 5, 2021 at 10:15 PM Chetan Khatri wrote: > Hi All, Collect in spark is taking huge time. I want to get list of values > of one column to Scala collection. How can I do this? > val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF > .select(col("

Usage of DropDuplicate in Spark

2021-06-22 Thread Chetan Khatri
Hi Spark Users, I want to use DropDuplicate, but those records which I discard. I would like to log to the instrumental table. What would be the best approach to do that? Thanks

Re: Usage of DropDuplicate in Spark

2021-06-22 Thread Chetan Khatri
gt; > hope this helps > > Thanks > Sachit > > On Tue, Jun 22, 2021, 22:23 Chetan Khatri > wrote: > >> Hi Spark Users, >> >> I want to use DropDuplicate, but those records which I discard. I >> would like to log to the instrumental table. >> >> What would be the best approach to do that? >> >> Thanks >> >

Re: Usage of DropDuplicate in Spark

2021-06-22 Thread Chetan Khatri
I am looking for any built-in API if at all exists? On Tue, Jun 22, 2021 at 1:16 PM Chetan Khatri wrote: > this has been very slow > > On Tue, Jun 22, 2021 at 1:15 PM Sachit Murarka > wrote: > >> Hi Chetan, >> >> You can substract the data frame or use excep

Need help on migrating Spark on Hortonworks to Kubernetes Cluster

2022-05-08 Thread Chetan Khatri
Hi Everyone, I need help on my Airflow DAG which has Spark Submit and Now I have Kubernetes Cluster instead Hortonworks Linux Distributed Spark Cluster.My existing Spark-Submit is through BashOperator as below: calculation1 = '/usr/hdp/2.6.5.0-292/spark2/bin/spark-submit --conf spark.yarn.maxAppA

to find Difference of locations in Spark Dataframe rows

2022-06-07 Thread Chetan Khatri
Hi Dear Spark Users, It has been many years that I have worked on Spark, Please help me. Thanks much I have different cities and their co-ordinates in DataFrame[Row], I want to find distance in KMs and then show only those records /cities which are 10 KMs far. I have a function created that can

Is there any Job/Career channel

2023-01-15 Thread Chetan Khatri
Hi Spark Users, Is there any Job/Career channel for Apache Spark? Thank you

About Error while reading large JSON file in Spark

2016-10-18 Thread Chetan Khatri
va:135) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237) What would be resolution for the same ? Thanks in Advance ! -- Yours Aye, Chetan Khatri.

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Chetan Khatri
ray is one object, it cannot be split into multiple > partition. > > > On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri > wrote: > >> Hello Community members, >> >> I am getting error while reading large JSON file in spark, >> >> *Code:* >> >

About Reading Parquet - failed to read single gz parquet - failed entire transformation

2016-10-21 Thread Chetan Khatri
apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) -- Yours Aye, Chetan Khatri.

Writing to Parquet Job turns to wait mode after even completion of job

2016-10-21 Thread Chetan Khatri
. then it clears broadcast, accumulator shared variables. Can we sped up this thing ? Thanks. -- Yours Aye, Chetan Khatri. M.+91 7 80574 Data Science Researcher INDIA ​​Statement of Confidentiality The contents of this e-mail message and any attachments are

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-21 Thread Chetan Khatri
f you are appending a small amount of data to a > large existing Parquet dataset. > > If that's the case, you may disable Parquet summary files by setting > Hadoop configuration " parquet.enable.summary-metadata" to false. > > We've disabled it by default since 1

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-28 Thread Chetan Khatri
resolve current issue. It takes more time to clear Broadcast, accumulator etc. Can we tune up this with spark 1.6.1 MapR distribution. On Oct 27, 2016 2:34 PM, "Mehrez Alachheb" wrote: > I think you should just shut down your SparkContext at the end. > sc.stop() > > 201

About Spark Multiple Shared Context with Spark 2.0

2016-12-13 Thread Chetan Khatri
Hello Guys, What would be approach to accomplish Spark Multiple Shared Context without Alluxio and with with Alluxio , and what would be best practice to achieve parallelism and concurrency for spark jobs. Thanks. -- Yours Aye, Chetan Khatri. M.+91 7 80574 Data Science Researcher INDIA

Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
batch where flag is 0 or 1. I am looking for best practice approach with any distributed tool. Thanks. - Chetan Khatri

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
> > > On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Guys, >> >> I would like to understand different approach for Distributed Incremental >> load from HBase, Is there any *tool / incubactor tool* which

Best Practice for Spark Job Jar Generation

2016-12-22 Thread Chetan Khatri
h for Uber Less Jar, Guys can you please explain me best practice industry standard for the same. Thanks, Chetan Khatri.

Dependency Injection and Microservice development with Spark

2016-12-23 Thread Chetan Khatri
Hello Community, Current approach I am using for Spark Job Development with Scala + SBT and Uber Jar with yml properties file to pass configuration parameters. But If i would like to use Dependency Injection and MicroService Development like Spring Boot feature in Scala then what would be the stan

Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Chetan Khatri
us). > > --- > Regards, > Andy > > On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Spark Community, >> >> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and >>

Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Chetan Khatri
dy > > On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Andy, Thanks for reply. >> >> If we download all the dependencies at separate location and link with >> spark job jar on spark cluster, is it best way to execute

Re: Approach: Incremental data load from HBASE

2016-12-23 Thread Chetan Khatri
> After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ok, Sure will ask. >> >> But what would be

Apache Hive with Spark Configuration

2016-12-28 Thread Chetan Khatri
Hello Users / Developers, I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which version is more compatible with Spark 2.0.2 ? THanks

Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
Hello Spark Community, I am reading HBase table from Spark and getting RDD but now i wants to convert RDD of Spark Rows and want to convert to DF. *Source Code:* bin/spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --conf spark.hbase.host=127.0.0.1 import it.nerdamme

Re: Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
, unable to check with error that what exactly is. Thanks., On Wed, Dec 28, 2016 at 9:00 PM, Chetan Khatri wrote: > Hello Spark Community, > > I am reading HBase table from Spark and getting RDD but now i wants to > convert RDD of Spark Rows and want to convert to DF. >

Re: Apache Hive with Spark Configuration

2017-01-04 Thread Chetan Khatri
nd we've found (from having different > versions as well) that older versions are mostly compatible. Some things > fail occasionally, but we haven't had too many problems running different > versions with the same metastore in practice. > > rb > > On Wed, Dec 28

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Chetan Khatri
tlS, https://freebusy.io/la...@mapflat.com > > > On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri > wrote: > > Hello Community, > > > > Current approach I am using for Spark Job Development with Scala + SBT > and > > Uber Jar with yml properties file to pass config

Re: Approach: Incremental data load from HBASE

2017-01-04 Thread Chetan Khatri
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the > data into hbase. > > For your use case, the producer needs to find rows where the flag is 0 or > 1. > After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, De

Re: Approach: Incremental data load from HBASE

2017-01-06 Thread Chetan Khatri
t at Row level. > > On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ted Yu, >> >> You understood wrong, i said Incremental load from HBase to Hive, >> individually you can say Incremental Import f

Re: Approach: Incremental data load from HBASE

2017-01-06 Thread Chetan Khatri
Ayan, Thanks Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses ! On Fri, Jan 6, 2017 at 3:23 PM, ayan guha wrote: > IMHO you should not "think" HBase in RDMBS terms, but you can use > ColumnFilters to filter out new records > > On Fri, Jan 6, 2017 at

About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
Hello Community, I am struggling to save Dataframe to Hive Table, Versions: Hive 1.2.1 Spark 2.0.1 *Working code:* /* @Author: Chetan Khatri /* @Author: Chetan Khatri Description: This Scala script has written for HBase to Hive module, which reads table from HBase and dump it out to Hive

Re: About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
chema.struct); stdDf: org.apache.spark.sql.DataFrame = [stid: string, name: string ... 3 more fields] Thanks. On Tue, Jan 17, 2017 at 12:48 AM, Chetan Khatri wrote: > Hello Community, > > I am struggling to save Dataframe to Hive Table, > > Versions: > > Hive 1.2.

Weird experience Hive with Spark Transformations

2017-01-16 Thread Chetan Khatri
Hello, I have following services are configured and installed successfully: Hadoop 2.7.x Spark 2.0.x HBase 1.2.4 Hive 1.2.1 *Installation Directories:* /usr/local/hadoop /usr/local/spark /usr/local/hbase *Hive Environment variables:* #HIVE VARIABLES START export HIVE_HOME=/usr/local/hive expo

Re: anyone from bangalore wants to work on spark projects along with me

2017-01-19 Thread Chetan Khatri
Connect with Bangalore - Spark Meetup group. On Thu, Jan 19, 2017 at 3:07 PM, Deepak Sharma wrote: > Yes. > I will be there before 4 PM . > Whats your contact number ? > Thanks > Deepak > > On Thu, Jan 19, 2017 at 2:38 PM, Sirisha Cheruvu > wrote: > >> Are we meeting today?! >> >> On Jan 18, 20

HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
Hello Spark Community Folks, Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load from Hbase to Hive. I have seen couple of good example at HBase Github Repo: https://github.com/ apache/hbase/tree/master/hbase-spark If I would like to use HBaseContext with HBase 1.2.4, how

Re: HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
Yu wrote: > Though no hbase release has the hbase-spark module, you can find the > backport patch on HBASE-14160 (for Spark 1.6) > > You can build the hbase-spark module yourself. > > Cheers > > On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri < > chetan.opensou...@gmai

Re: outdated documentation? SparkSession

2017-01-27 Thread Chetan Khatri
Not outdated at all, because there are other methods having dependencies on sparkcontext so you have to create it. For example, https://gist.github.com/chetkhatri/f75c2b743e6cb2d7066188687448c5a1 On Fri, Jan 27, 2017 at 2:06 PM, Wojciech Indyk wrote: > Hi! > In this doc http://spark.apache.org/d

Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
use Hive EXTERNAL TABLE > with > > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. > > > Try this if you problem can be solved > > > https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration > > > Regards > > Amrit > > >

Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
TotalOrderPartitioner (sorts data, producing a large number of region files) Import HFiles into HBase HBase can merge files if necessary On Sat, Jan 28, 2017 at 11:32 AM, Chetan Khatri wrote: > @Ted, I dont think so. > > On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu wrote: > >> Does t

Re: issue with running Spark streaming with spark-shell

2017-01-28 Thread Chetan Khatri
if you are using any other package give it as argument --packages On Sat, Jan 28, 2017 at 8:14 PM, Jacek Laskowski wrote: > Hi, > > How did you start spark-shell? > > Jacek > > On 28 Jan 2017 11:20 a.m., "Mich Talebzadeh" > wrote: > >> >> Hi, >> >> My spark-streaming application works fine whe

Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Chetan Khatri
Hello Spark Users, I am getting error while saving Spark Dataframe to Hive Table: Hive 1.2.1 Spark 2.0.0 Local environment. Note: Job is getting executed successfully and the way I want but still Exception raised. *Source Code:* package com.chetan.poc.hbase /** * Created by chetan on 24/1/17.

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Chetan Khatri
> since. > > Jacek > > > On 29 Jan 2017 9:24 a.m., "Chetan Khatri" > wrote: > > Hello Spark Users, > > I am getting error while saving Spark Dataframe to Hive Table: > Hive 1.2.1 > Spark 2.0.0 > Local environment. > Note: Job is getting execut

Spark Job Performance monitoring approaches

2017-02-15 Thread Chetan Khatri
Hello All, What would be the best approches to monitor Spark Performance, is there any tools for Spark Job Performance monitoring ? Thanks.

Re: Spark Job Performance monitoring approaches

2017-02-15 Thread Chetan Khatri
> github.com/SparkMonitor/varOne https://github.com/groupon/sparklint > > Chetan Khatri schrieb am Do., 16. Feb. 2017 > um 06:15 Uhr: > >> Hello All, >> >> What would be the best approches to monitor Spark Performance, is there >> any tools for Spark Job Performance monitoring ? >> >> Thanks. >> >

Issues: Generate JSON with null values in Spark 2.0.x

2017-03-07 Thread Chetan Khatri
Hello Dev / Users, I am working with PySpark Code migration to scala, with Python - Iterating Spark with dictionary and generating JSON with null is possible with json.dumps() which will be converted to SparkSQL[Row] but in scala how can we generate json will null values as a Dataframe ? Thanks.

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-20 Thread Chetan Khatri
ot in omitted form, like: > > { > "first_name": "Dongjin" > } > > right? > > - Dongjin > > On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri > wrote: > >> Hello Dev / Users, >> >> I am working with PySpark Code migration to

Flatten JSON to multiple columns in Spark

2017-07-17 Thread Chetan Khatri
Hello Spark Dev's, Can you please guide me, how to flatten JSON to multiple columns in Spark. *Example:* Sr No Title ISBN Info 1 Calculus Theory 1234567890 [{"cert":[{ "authSbmtr":"009415da-c8cd-418d-869e-0a19601d79fa", 009415da-c8cd-418d-869e-0a19601d79fa "certUUID":"03ea5a1a-5530-4fa3-8871-9d1

Re: Flatten JSON to multiple columns in Spark

2017-07-17 Thread Chetan Khatri
Georg, Thank you for revert, it throws error because it is coming as string. On Tue, Jul 18, 2017 at 11:38 AM, Georg Heiler wrote: > df.select ($"Info.*") should help > Chetan Khatri schrieb am Di. 18. Juli 2017 > um 08:06: > >> Hello Spark Dev's, >

Re: Flatten JSON to multiple columns in Spark

2017-07-17 Thread Chetan Khatri
Explode is not working in this scenario with error - string cannot be used in explore either array or map in spark On Tue, Jul 18, 2017 at 11:39 AM, 刘虓 wrote: > Hi, > have you tried to use explode? > > Chetan Khatri 于2017年7月18日 周二下午2:06写道: > >> Hello Spark Dev's, >

Re: Flatten JSON to multiple columns in Spark

2017-07-18 Thread Chetan Khatri
t; <https://github.com/bazaarvoice/jolt> >> >> >> >> >> On Monday, July 17, 2017, 11:18:24 PM PDT, Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >> >> Explode is not working in this scenario with error - string cannot be &

Re: Flatten JSON to multiple columns in Spark

2017-07-19 Thread Chetan Khatri
>> schemas and I didn't want to own the transforms. >>>> >>>> I also recommend persisting anything that isn't part of your schema in >>>> an 'extras field' So when you parse out your json, if you've got anything >>>> lef

Re: Flatten JSON to multiple columns in Spark

2017-07-19 Thread Chetan Khatri
mRole", StringType) .add("pgmUUID", StringType) .add("regUUID", StringType) .add("rtlrsSbmtd", StringType) On Wed, Jul 19, 2017 at 6:42 PM, Jules Damji wrote: > > Another tutorial that complements and shows how to work and extract data > from nest

Issue: Hive Table Stored as col(array) instead of Columns with Spark

2017-07-20 Thread Chetan Khatri
Hello All, I am facing issue with storing Dataframe to Hive table with partitioning , without partitioning it works good. *Spark 2.0.1* finalDF.write.mode(SaveMode.Overwrite).partitionBy("week_end_date").saveAsTable(OUTPUT_TABLE.get) and added below configuration too: spark.sqlContext.setConf("h

Re: Issue: Hive Table Stored as col(array) instead of Columns with Spark

2017-07-20 Thread Chetan Khatri
Anyone faced same kind of issue with Spark 2.0.1 ? On Thu, Jul 20, 2017 at 2:08 PM, Chetan Khatri wrote: > Hello All, > I am facing issue with storing Dataframe to Hive table with partitioning , > without partitioning it works good. > > *Spark 2.0.1* > > finalDF.write.mo

Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
Hey Dev/ USer, I am working with Spark 2.0.1 and with dynamic partitioning with Hive facing below issue: org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1344, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
Jorn, Both are same. On Fri, Jul 28, 2017 at 4:18 PM, Jörn Franke wrote: > Try sparksession.conf().set > > On 28. Jul 2017, at 12:19, Chetan Khatri > wrote: > > Hey Dev/ USer, > > I am working with Spark 2.0.1 and with dynamic partitioning with H

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
I think it will be same, but let me try that FYR - https://issues.apache.org/jira/browse/SPARK-19881 On Fri, Jul 28, 2017 at 4:44 PM, ayan guha wrote: > Try running spark.sql("set yourconf=val") > > On Fri, 28 Jul 2017 at 8:51 pm, Chetan Khatri > wrote: > >> Jo

Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
Hello Spark Users, I have Hbase table reading and writing to Hive managed table where i applied partitioning by date column which worked fine but it has generate more number of files in almost 700 partitions but i wanted to use reparation to reduce File I/O by reducing number of files inside each

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
Can anyone please guide me with above issue. On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri wrote: > Hello Spark Users, > > I have Hbase table reading and writing to Hive managed table where i > applied partitioning by date column which worked fine but it has generate > more num

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
ill be used for Spark execution, not reserved whatever is > consuming it and causing the OOM. (If Spark's memory is too low, you'll see > other problems like spilling too much to disk.) > > rb > > On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri > wrote: > >

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-03 Thread Chetan Khatri
stly most people > find this number for their job "experimentally" (e.g. they try a few > different things). > > On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri > wrote: > >> Ryan, >> Thank you for reply. >> >> For 2 TB of Data what should be the value of

Re: Write only one output file in Spark SQL

2017-08-11 Thread Chetan Khatri
What you can do is at hive creates partitioned column for example date and use Val finalDf = repartition(data frame.col("date-column")) and later say insert overwrite tablename partition(date-column) select * from tempview Would work as expected On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed" wro

Re: Spark - Partitions

2017-10-12 Thread Chetan Khatri
Use repartition On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" wrote: > Hi, > > I am reading hive query and wiriting the data back into hive after doing > some transformations. > > I have changed setting spark.sql.shuffle.partitions to 2000 and since then > job completes fast but the main problem

Re: No space left on device

2017-10-17 Thread Chetan Khatri
Process data in micro batch On 18-Oct-2017 10:36 AM, "Chetan Khatri" wrote: > Your hard drive don't have much space > On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote: > >> Hi, >> >> I get "No space left on device" error in my spark wo

Re: No space left on device

2017-10-17 Thread Chetan Khatri
Your hard drive don't have much space On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote: > Hi, > > I get "No space left on device" error in my spark worker: > > Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr > java.io.IOException: No space left on device > > In my spark cluste

Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-21 Thread Chetan Khatri
Hello Spark Users, I am getting below error, when i am trying to write dataset to parquet location. I have enough disk space available. Last time i was facing same kind of error which were resolved by increasing number of cores at hyper parameters. Currently result set data size is almost 400Gig w

Re: Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-22 Thread Chetan Khatri
Anybody reply on this ? On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri wrote: > > Hello Spark Users, > > I am getting below error, when i am trying to write dataset to parquet > location. I have enough disk space available. Last time i was facing same > kind of error whic

Re: NLTK with Spark Streaming

2017-11-26 Thread Chetan Khatri
But you can still use Stanford NLP library and distribute through spark right ! On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau wrote: > So it’s certainly doable (it’s not super easy mind you), but until the > arrow udf release goes out it will be rather slow. > > On Sun, Nov 26, 2017 at 8:01 AM a

Livy Failed error on Yarn with Spark

2018-05-09 Thread Chetan Khatri
All, I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I am running same spark job using spark-submit it is getting success with all transformations are done. When I am trying to do spark submit using Livy, at that time Spark Job is getting invoked and getting success but Yar

Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
All, I am looking for approach to do bulk read / write with MSSQL Server and Apache Spark 2.2 , please let me know if any library / driver for the same. Thank you. Chetan

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Try this https://docs.microsoft.com/en-us/azure/sql-database/sql- > database-spark-connector > > > > > > *From: *Chetan Khatri > *Date: *Wednesday, May 23, 2018 at 7:47 AM > *To: *user > *Subject: *Bulk / Fast Read and Write with MSSQL Server and Spark > > > >

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Super, just giving high level idea what i want to do. I have one source schema which is MS SQL Server 2008 and target is also MS SQL Server 2008. Currently there is c# based ETL application which does extract transform and load as customer specific schema including indexing etc. Thanks On Wed, M

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-25 Thread Chetan Khatri
park slaves tend to send lots of data at once >> to SQL and that slows down the latency of the rest of the system. We >> overcame this by using sqoop and running it in a controlled environment. >> >> On Wed, May 23, 2018 at 7:32 AM Chetan Khatri < >> chetan.opensou...@g

GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
All, I have scenario like this in MSSQL Server SQL where i need to do groupBy without Agg function: Pseudocode: select m.student_id, m.student_name, m.student_std, m.student_group, m.student_d ob from student as m inner join general_register g on m.student_id = g.student_i d group by m.student_

Re: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
this the same as select distinct? > > Chetan Khatri schrieb am Di., 29. Mai 2018 > um 20:21 Uhr: > >> All, >> >> I have scenario like this in MSSQL Server SQL where i need to do groupBy >> without Agg function: >> >> Pseudocode: >> >

Re: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
18 at 12:08 AM, Georg Heiler > > wrote: >> >>> Why do you group if you do not want to aggregate? >>> Isn't this the same as select distinct? >>> >>> Chetan Khatri schrieb am Di., 29. Mai >>> 2018 um 20:21 Uhr: >>> >>&g

Re: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
Georg, Sorry for dumb question. Help me to understand - if i do DF.select(A,B,C,D)*.distinct() *that would be same as above groupBy without agg in sql right ? On Wed, May 30, 2018 at 12:17 AM, Chetan Khatri wrote: > I don't want to get any aggregation, just want to know rather saying &g

Re: 答复: GroupBy in Spark / Scala without Agg functions

2018-05-29 Thread Chetan Khatri
ct(): Dataset[T] = dropDuplicates() > > … > > def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { > > … > > Aggregate(groupCols, aggCols, logicalPlan) > } > > > > > > > > > > *发件人**:* Chetan Khatri [mailto:chetan.opensou.

Apply Core Java Transformation UDF on DataFrame

2018-06-04 Thread Chetan Khatri
All, I would like to Apply Java Transformation UDF on DataFrame created from Table, Flat Files and retrun new Data Frame Object. Any suggestions, with respect to Spark Internals. Thanks.

Re: Apply Core Java Transformation UDF on DataFrame

2018-06-05 Thread Chetan Khatri
Anyone can throw light on this. would be helpful. On Tue, Jun 5, 2018 at 1:41 AM, Chetan Khatri wrote: > All, > > I would like to Apply Java Transformation UDF on DataFrame created from > Table, Flat Files and retrun new Data Frame Object. Any suggestions, with > respect to

Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-03 Thread Chetan Khatri
Hello Dear Spark User / Dev, I would like to pass Python user defined function to Spark Job developed using Scala and return value of that function would be returned to DF / Dataset API. Can someone please guide me, which would be best approach to do this. Python function would be mostly transfor

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-04 Thread Chetan Khatri
Can someone please suggest me , thanks On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, wrote: > Hello Dear Spark User / Dev, > > I would like to pass Python user defined function to Spark Job developed > using Scala and return value of that function would be returned to DF / > Datas

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-05 Thread Chetan Khatri
Prem sure, Thanks for suggestion. On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure wrote: > try .pipe(.py) on RDD > > Thanks, > Prem > > On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri > wrote: > >> Can someone please suggest me , thanks >> >> On Tue 3 J

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-09 Thread Chetan Khatri
Pandas Dataframe for processing and finally write the > results back. > > In the Spark/Scala/Java code, you get an RDD of string, which we convert > back to a Dataframe. > > Feel free to ping me directly in case of questions. > > Thanks, > Jayant > > > On Thu, Jul 5

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-15 Thread Chetan Khatri
n.html > > We will continue adding more there. > > Feel free to ping me directly in case of questions. > > Thanks, > Jayant > > > On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri > wrote: > >> Hello Jayant, >> >> Thank you so much for suggestion.

How to do efficient self join with Spark-SQL and Scala

2018-09-21 Thread Chetan Khatri
Dear Spark Users, I came across little weird MSSQL Query to replace with Spark and I am like no clue how to do it in an efficient way with Scala + SparkSQL. Can someone please throw light. I can create view of DataFrame and do it as *spark.sql *(query) but I would like to do it with Scala + Spark

Spark 2.3.0 with HDP Got completely successfully but status FAILED with error

2018-11-21 Thread Chetan Khatri
Hello Spark Users, I am working with Spark 2.3.0 with HDP Distribution, where my spark job got completed successfully but final job status is failed with below error: What is best way to prevent this kind of errors? Thanks 8/11/21 17:38:15 INFO ApplicationMaster: Final app status: SUCCEEDED, ex

How to Keep Null values in Parquet

2018-11-21 Thread Chetan Khatri
Hello Spark Users, I have a Dataframe with some of Null Values, When I am writing to parquet it is failing with below error: Caused by: java.lang.RuntimeException: Unsupported data type NullType. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.data

Re: How to Keep Null values in Parquet

2018-11-21 Thread Chetan Khatri
gt; > See also https://issues.apache.org/jira/browse/SPARK-10943. > > — Soumya > > > On Nov 21, 2018, at 9:29 PM, Chetan Khatri > wrote: > > Hello Spark Users, > > I have a Dataframe with some of Null Values, When I am writing to parquet > it is failing with below error

Increase time for Spark Job to be in Accept mode in Yarn

2019-01-22 Thread Chetan Khatri
Hello Spark Users, Can you please tell me how to increase the time for Spark job to be in *Accept* mode in Yarn. Thank you. Regards, Chetan

Re: Increase time for Spark Job to be in Accept mode in Yarn

2019-01-23 Thread Chetan Khatri
wrote: > Hi , please tell me why you need to increase the time? > > > > > > At 2019-01-22 18:38:29, "Chetan Khatri" > wrote: > > Hello Spark Users, > > Can you please tell me how to increase the time for Spark job to be in > *Accept* mode in Yarn. > > Thank you. Regards, > Chetan > > > > >

dropDuplicate on timestamp based column unexpected output

2019-04-03 Thread Chetan Khatri
Hello Dear Spark Users, I am using dropDuplicate on a DataFrame generated from large parquet file from(HDFS) and doing dropDuplicate based on timestamp based column, every time I run it drops different - different rows based on same timestamp. What I tried and worked val wSpec = Window.partition

Re: dropDuplicate on timestamp based column unexpected output

2019-04-03 Thread Chetan Khatri
27;wanted_time').dropDuplicates('invoice_id', 'update_time') > > The min() is faster than doing an orderBy() and a row_number(). > And the dropDuplicates at the end ensures records with two values for the > same 'update_time' don't cause issues. >

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
eems like it's meant for cases where you > literally have redundant duplicated data. And not for filtering to get > first/last etc. > > > On Thu, Apr 4, 2019 at 11:46 AM Chetan Khatri > wrote: > >> Hello Abdeali, Thank you for your response. >> >> Can you

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
g that is faster. When I ran is on my data ~8-9GB > I think it took less than 5 mins (don't remember exact time) > > On Thu, Apr 4, 2019 at 1:09 PM Chetan Khatri > wrote: > >> Thanks for awesome clarification / explanation. >> >> I have cases where update_time can

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Chetan Khatri
t; wrote: > >> How much memory do you have per partition? >> >> On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri >> wrote: >> >>> I will get the information and will share with you. >>> >>> On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari

How to print DataFrame.show(100) to text file at HDFS

2019-04-13 Thread Chetan Khatri
Hello Users, In spark when I have a DataFrame and do .show(100) the output which gets printed, I wants to save as it is content to txt file in HDFS. How can I do this? Thanks

Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread Chetan Khatri
el("OFF") > > > spark.table("").show(100,truncate=false) > > But is there any specific reason you want to write it to hdfs? Is this for > human consumption? > > Regards, > Nuthan > > On Sat, Apr 13, 2019 at 6:41 PM Chetan Khatri > wrote: > >

Usage of Explicit Future in Spark program

2019-04-21 Thread Chetan Khatri
Hello Spark Users, Someone has suggested by breaking 5-5 unpredictable transformation blocks to Future[ONE STRING ARGUMENT] and claim this can tune the performance. I am wondering this is a use of explicit Future! in Spark? Sample code is below: def writeData( tableName: String): Future[String]

  1   2   >