Hi Users,
What is equivalent of *df.dropna(axis='columns'**) *of Pandas in the
Spark/Scala?
Thanks
Hi All, Collect in spark is taking huge time. I want to get list of values
of one column to Scala collection. How can I do this?
val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF
.select(col("reporting_table")).except(clientSchemaDF)
logger.info(s"###
May 5, 2021 at 10:15 PM Chetan Khatri
wrote:
> Hi All, Collect in spark is taking huge time. I want to get list of values
> of one column to Scala collection. How can I do this?
> val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF
> .select(col("
Hi Spark Users,
I want to use DropDuplicate, but those records which I discard. I
would like to log to the instrumental table.
What would be the best approach to do that?
Thanks
gt;
> hope this helps
>
> Thanks
> Sachit
>
> On Tue, Jun 22, 2021, 22:23 Chetan Khatri
> wrote:
>
>> Hi Spark Users,
>>
>> I want to use DropDuplicate, but those records which I discard. I
>> would like to log to the instrumental table.
>>
>> What would be the best approach to do that?
>>
>> Thanks
>>
>
I am looking for any built-in API if at all exists?
On Tue, Jun 22, 2021 at 1:16 PM Chetan Khatri
wrote:
> this has been very slow
>
> On Tue, Jun 22, 2021 at 1:15 PM Sachit Murarka
> wrote:
>
>> Hi Chetan,
>>
>> You can substract the data frame or use excep
Hi Everyone, I need help on my Airflow DAG which has Spark Submit and Now I
have Kubernetes Cluster instead Hortonworks Linux Distributed Spark Cluster.My
existing Spark-Submit is through BashOperator as below:
calculation1 = '/usr/hdp/2.6.5.0-292/spark2/bin/spark-submit --conf
spark.yarn.maxAppA
Hi Dear Spark Users,
It has been many years that I have worked on Spark, Please help me. Thanks
much
I have different cities and their co-ordinates in DataFrame[Row], I want to
find distance in KMs and then show only those records /cities which are 10
KMs far.
I have a function created that can
Hi Spark Users,
Is there any Job/Career channel for Apache Spark?
Thank you
va:135)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
What would be resolution for the same ?
Thanks in Advance !
--
Yours Aye,
Chetan Khatri.
ray is one object, it cannot be split into multiple
> partition.
>
>
> On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri
> wrote:
>
>> Hello Community members,
>>
>> I am getting error while reading large JSON file in spark,
>>
>> *Code:*
>>
>
apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
--
Yours Aye,
Chetan Khatri.
. then it
clears broadcast, accumulator shared variables.
Can we sped up this thing ?
Thanks.
--
Yours Aye,
Chetan Khatri.
M.+91 7 80574
Data Science Researcher
INDIA
Statement of Confidentiality
The contents of this e-mail message and any attachments are
f you are appending a small amount of data to a
> large existing Parquet dataset.
>
> If that's the case, you may disable Parquet summary files by setting
> Hadoop configuration " parquet.enable.summary-metadata" to false.
>
> We've disabled it by default since 1
resolve current issue.
It takes more time to clear Broadcast, accumulator etc.
Can we tune up this with spark 1.6.1 MapR distribution.
On Oct 27, 2016 2:34 PM, "Mehrez Alachheb" wrote:
> I think you should just shut down your SparkContext at the end.
> sc.stop()
>
> 201
Hello Guys,
What would be approach to accomplish Spark Multiple Shared Context without
Alluxio and with with Alluxio , and what would be best practice to achieve
parallelism and concurrency for spark jobs.
Thanks.
--
Yours Aye,
Chetan Khatri.
M.+91 7 80574
Data Science Researcher
INDIA
batch where flag is 0 or 1.
I am looking for best practice approach with any distributed tool.
Thanks.
- Chetan Khatri
>
>
> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Guys,
>>
>> I would like to understand different approach for Distributed Incremental
>> load from HBase, Is there any *tool / incubactor tool* which
h for Uber Less Jar, Guys can you please
explain me best practice industry standard for the same.
Thanks,
Chetan Khatri.
Hello Community,
Current approach I am using for Spark Job Development with Scala + SBT and
Uber Jar with yml properties file to pass configuration parameters. But If
i would like to use Dependency Injection and MicroService Development like
Spring Boot feature in Scala then what would be the stan
us).
>
> ---
> Regards,
> Andy
>
> On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Spark Community,
>>
>> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and
>>
dy
>
> On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Andy, Thanks for reply.
>>
>> If we download all the dependencies at separate location and link with
>> spark job jar on spark cluster, is it best way to execute
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ok, Sure will ask.
>>
>> But what would be
Hello Users / Developers,
I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which
version is more compatible with Spark 2.0.2 ?
THanks
Hello Spark Community,
I am reading HBase table from Spark and getting RDD but now i wants to
convert RDD of Spark Rows and want to convert to DF.
*Source Code:*
bin/spark-shell --packages
it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --conf
spark.hbase.host=127.0.0.1
import it.nerdamme
, unable to check with error that what exactly is.
Thanks.,
On Wed, Dec 28, 2016 at 9:00 PM, Chetan Khatri
wrote:
> Hello Spark Community,
>
> I am reading HBase table from Spark and getting RDD but now i wants to
> convert RDD of Spark Rows and want to convert to DF.
>
nd we've found (from having different
> versions as well) that older versions are mostly compatible. Some things
> fail occasionally, but we haven't had too many problems running different
> versions with the same metastore in practice.
>
> rb
>
> On Wed, Dec 28
tlS, https://freebusy.io/la...@mapflat.com
>
>
> On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri
> wrote:
> > Hello Community,
> >
> > Current approach I am using for Spark Job Development with Scala + SBT
> and
> > Uber Jar with yml properties file to pass config
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
> data into hbase.
>
> For your use case, the producer needs to find rows where the flag is 0 or
> 1.
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, De
t at Row level.
>
> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ted Yu,
>>
>> You understood wrong, i said Incremental load from HBase to Hive,
>> individually you can say Incremental Import f
Ayan, Thanks
Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses !
On Fri, Jan 6, 2017 at 3:23 PM, ayan guha wrote:
> IMHO you should not "think" HBase in RDMBS terms, but you can use
> ColumnFilters to filter out new records
>
> On Fri, Jan 6, 2017 at
Hello Community,
I am struggling to save Dataframe to Hive Table,
Versions:
Hive 1.2.1
Spark 2.0.1
*Working code:*
/*
@Author: Chetan Khatri
/* @Author: Chetan Khatri Description: This Scala script has written for
HBase to Hive module, which reads table from HBase and dump it out to Hive
chema.struct);
stdDf: org.apache.spark.sql.DataFrame = [stid: string, name: string ... 3
more fields]
Thanks.
On Tue, Jan 17, 2017 at 12:48 AM, Chetan Khatri wrote:
> Hello Community,
>
> I am struggling to save Dataframe to Hive Table,
>
> Versions:
>
> Hive 1.2.
Hello,
I have following services are configured and installed successfully:
Hadoop 2.7.x
Spark 2.0.x
HBase 1.2.4
Hive 1.2.1
*Installation Directories:*
/usr/local/hadoop
/usr/local/spark
/usr/local/hbase
*Hive Environment variables:*
#HIVE VARIABLES START
export HIVE_HOME=/usr/local/hive
expo
Connect with Bangalore - Spark Meetup group.
On Thu, Jan 19, 2017 at 3:07 PM, Deepak Sharma
wrote:
> Yes.
> I will be there before 4 PM .
> Whats your contact number ?
> Thanks
> Deepak
>
> On Thu, Jan 19, 2017 at 2:38 PM, Sirisha Cheruvu
> wrote:
>
>> Are we meeting today?!
>>
>> On Jan 18, 20
Hello Spark Community Folks,
Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load
from Hbase to Hive.
I have seen couple of good example at HBase Github Repo: https://github.com/
apache/hbase/tree/master/hbase-spark
If I would like to use HBaseContext with HBase 1.2.4, how
Yu wrote:
> Though no hbase release has the hbase-spark module, you can find the
> backport patch on HBASE-14160 (for Spark 1.6)
>
> You can build the hbase-spark module yourself.
>
> Cheers
>
> On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri <
> chetan.opensou...@gmai
Not outdated at all, because there are other methods having dependencies on
sparkcontext so you have to create it.
For example,
https://gist.github.com/chetkhatri/f75c2b743e6cb2d7066188687448c5a1
On Fri, Jan 27, 2017 at 2:06 PM, Wojciech Indyk
wrote:
> Hi!
> In this doc http://spark.apache.org/d
use Hive EXTERNAL TABLE
> with
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'.
>
>
> Try this if you problem can be solved
>
>
> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>
>
> Regards
>
> Amrit
>
>
>
TotalOrderPartitioner
(sorts data, producing a large number of region files)
Import HFiles into HBase
HBase can merge files if necessary
On Sat, Jan 28, 2017 at 11:32 AM, Chetan Khatri wrote:
> @Ted, I dont think so.
>
> On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu wrote:
>
>> Does t
if you are using any other package give it as argument --packages
On Sat, Jan 28, 2017 at 8:14 PM, Jacek Laskowski wrote:
> Hi,
>
> How did you start spark-shell?
>
> Jacek
>
> On 28 Jan 2017 11:20 a.m., "Mich Talebzadeh"
> wrote:
>
>>
>> Hi,
>>
>> My spark-streaming application works fine whe
Hello Spark Users,
I am getting error while saving Spark Dataframe to Hive Table:
Hive 1.2.1
Spark 2.0.0
Local environment.
Note: Job is getting executed successfully and the way I want but still
Exception raised.
*Source Code:*
package com.chetan.poc.hbase
/**
* Created by chetan on 24/1/17.
> since.
>
> Jacek
>
>
> On 29 Jan 2017 9:24 a.m., "Chetan Khatri"
> wrote:
>
> Hello Spark Users,
>
> I am getting error while saving Spark Dataframe to Hive Table:
> Hive 1.2.1
> Spark 2.0.0
> Local environment.
> Note: Job is getting execut
Hello All,
What would be the best approches to monitor Spark Performance, is there any
tools for Spark Job Performance monitoring ?
Thanks.
> github.com/SparkMonitor/varOne https://github.com/groupon/sparklint
>
> Chetan Khatri schrieb am Do., 16. Feb. 2017
> um 06:15 Uhr:
>
>> Hello All,
>>
>> What would be the best approches to monitor Spark Performance, is there
>> any tools for Spark Job Performance monitoring ?
>>
>> Thanks.
>>
>
Hello Dev / Users,
I am working with PySpark Code migration to scala, with Python - Iterating
Spark with dictionary and generating JSON with null is possible with
json.dumps() which will be converted to SparkSQL[Row] but in scala how can
we generate json will null values as a Dataframe ?
Thanks.
ot in omitted form, like:
>
> {
> "first_name": "Dongjin"
> }
>
> right?
>
> - Dongjin
>
> On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri > wrote:
>
>> Hello Dev / Users,
>>
>> I am working with PySpark Code migration to
Hello Spark Dev's,
Can you please guide me, how to flatten JSON to multiple columns in Spark.
*Example:*
Sr No Title ISBN Info
1 Calculus Theory 1234567890 [{"cert":[{
"authSbmtr":"009415da-c8cd-418d-869e-0a19601d79fa",
009415da-c8cd-418d-869e-0a19601d79fa
"certUUID":"03ea5a1a-5530-4fa3-8871-9d1
Georg,
Thank you for revert, it throws error because it is coming as string.
On Tue, Jul 18, 2017 at 11:38 AM, Georg Heiler
wrote:
> df.select ($"Info.*") should help
> Chetan Khatri schrieb am Di. 18. Juli 2017
> um 08:06:
>
>> Hello Spark Dev's,
>
Explode is not working in this scenario with error - string cannot be used
in explore either array or map in spark
On Tue, Jul 18, 2017 at 11:39 AM, 刘虓 wrote:
> Hi,
> have you tried to use explode?
>
> Chetan Khatri 于2017年7月18日 周二下午2:06写道:
>
>> Hello Spark Dev's,
>
t; <https://github.com/bazaarvoice/jolt>
>>
>>
>>
>>
>> On Monday, July 17, 2017, 11:18:24 PM PDT, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>
>> Explode is not working in this scenario with error - string cannot be
&
>> schemas and I didn't want to own the transforms.
>>>>
>>>> I also recommend persisting anything that isn't part of your schema in
>>>> an 'extras field' So when you parse out your json, if you've got anything
>>>> lef
mRole", StringType)
.add("pgmUUID", StringType)
.add("regUUID", StringType)
.add("rtlrsSbmtd", StringType)
On Wed, Jul 19, 2017 at 6:42 PM, Jules Damji wrote:
>
> Another tutorial that complements and shows how to work and extract data
> from nest
Hello All,
I am facing issue with storing Dataframe to Hive table with partitioning ,
without partitioning it works good.
*Spark 2.0.1*
finalDF.write.mode(SaveMode.Overwrite).partitionBy("week_end_date").saveAsTable(OUTPUT_TABLE.get)
and added below configuration too:
spark.sqlContext.setConf("h
Anyone faced same kind of issue with Spark 2.0.1 ?
On Thu, Jul 20, 2017 at 2:08 PM, Chetan Khatri
wrote:
> Hello All,
> I am facing issue with storing Dataframe to Hive table with partitioning ,
> without partitioning it works good.
>
> *Spark 2.0.1*
>
> finalDF.write.mo
Hey Dev/ USer,
I am working with Spark 2.0.1 and with dynamic partitioning with Hive
facing below issue:
org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1344, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1
Jorn, Both are same.
On Fri, Jul 28, 2017 at 4:18 PM, Jörn Franke wrote:
> Try sparksession.conf().set
>
> On 28. Jul 2017, at 12:19, Chetan Khatri
> wrote:
>
> Hey Dev/ USer,
>
> I am working with Spark 2.0.1 and with dynamic partitioning with H
I think it will be same, but let me try that
FYR - https://issues.apache.org/jira/browse/SPARK-19881
On Fri, Jul 28, 2017 at 4:44 PM, ayan guha wrote:
> Try running spark.sql("set yourconf=val")
>
> On Fri, 28 Jul 2017 at 8:51 pm, Chetan Khatri
> wrote:
>
>> Jo
Hello Spark Users,
I have Hbase table reading and writing to Hive managed table where i
applied partitioning by date column which worked fine but it has generate
more number of files in almost 700 partitions but i wanted to use
reparation to reduce File I/O by reducing number of files inside each
Can anyone please guide me with above issue.
On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri
wrote:
> Hello Spark Users,
>
> I have Hbase table reading and writing to Hive managed table where i
> applied partitioning by date column which worked fine but it has generate
> more num
ill be used for Spark execution, not reserved whatever is
> consuming it and causing the OOM. (If Spark's memory is too low, you'll see
> other problems like spilling too much to disk.)
>
> rb
>
> On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri > wrote:
>
>
stly most people
> find this number for their job "experimentally" (e.g. they try a few
> different things).
>
> On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri > wrote:
>
>> Ryan,
>> Thank you for reply.
>>
>> For 2 TB of Data what should be the value of
What you can do is at hive creates partitioned column for example date and
use Val finalDf = repartition(data frame.col("date-column")) and later say
insert overwrite tablename partition(date-column) select * from tempview
Would work as expected
On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed"
wro
Use repartition
On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed"
wrote:
> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes fast but the main problem
Process data in micro batch
On 18-Oct-2017 10:36 AM, "Chetan Khatri"
wrote:
> Your hard drive don't have much space
> On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote:
>
>> Hi,
>>
>> I get "No space left on device" error in my spark wo
Your hard drive don't have much space
On 18-Oct-2017 10:35 AM, "Mina Aslani" wrote:
> Hi,
>
> I get "No space left on device" error in my spark worker:
>
> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
> java.io.IOException: No space left on device
>
> In my spark cluste
Hello Spark Users,
I am getting below error, when i am trying to write dataset to parquet
location. I have enough disk space available. Last time i was facing same
kind of error which were resolved by increasing number of cores at hyper
parameters. Currently result set data size is almost 400Gig w
Anybody reply on this ?
On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri
wrote:
>
> Hello Spark Users,
>
> I am getting below error, when i am trying to write dataset to parquet
> location. I have enough disk space available. Last time i was facing same
> kind of error whic
But you can still use Stanford NLP library and distribute through spark
right !
On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau wrote:
> So it’s certainly doable (it’s not super easy mind you), but until the
> arrow udf release goes out it will be rather slow.
>
> On Sun, Nov 26, 2017 at 8:01 AM a
All,
I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I am
running same spark job using spark-submit it is getting success with all
transformations are done.
When I am trying to do spark submit using Livy, at that time Spark Job is
getting invoked and getting success but Yar
All,
I am looking for approach to do bulk read / write with MSSQL Server and
Apache Spark 2.2 , please let me know if any library / driver for the same.
Thank you.
Chetan
Try this https://docs.microsoft.com/en-us/azure/sql-database/sql-
> database-spark-connector
>
>
>
>
>
> *From: *Chetan Khatri
> *Date: *Wednesday, May 23, 2018 at 7:47 AM
> *To: *user
> *Subject: *Bulk / Fast Read and Write with MSSQL Server and Spark
>
>
>
>
Super, just giving high level idea what i want to do. I have one source
schema which is MS SQL Server 2008 and target is also MS SQL Server 2008.
Currently there is c# based ETL application which does extract transform
and load as customer specific schema including indexing etc.
Thanks
On Wed, M
park slaves tend to send lots of data at once
>> to SQL and that slows down the latency of the rest of the system. We
>> overcame this by using sqoop and running it in a controlled environment.
>>
>> On Wed, May 23, 2018 at 7:32 AM Chetan Khatri <
>> chetan.opensou...@g
All,
I have scenario like this in MSSQL Server SQL where i need to do groupBy
without Agg function:
Pseudocode:
select m.student_id, m.student_name, m.student_std, m.student_group,
m.student_d
ob from student as m inner join general_register g on m.student_id =
g.student_i
d group by m.student_
this the same as select distinct?
>
> Chetan Khatri schrieb am Di., 29. Mai 2018
> um 20:21 Uhr:
>
>> All,
>>
>> I have scenario like this in MSSQL Server SQL where i need to do groupBy
>> without Agg function:
>>
>> Pseudocode:
>>
>
18 at 12:08 AM, Georg Heiler > > wrote:
>>
>>> Why do you group if you do not want to aggregate?
>>> Isn't this the same as select distinct?
>>>
>>> Chetan Khatri schrieb am Di., 29. Mai
>>> 2018 um 20:21 Uhr:
>>>
>>&g
Georg, Sorry for dumb question. Help me to understand - if i do
DF.select(A,B,C,D)*.distinct() *that would be same as above groupBy without
agg in sql right ?
On Wed, May 30, 2018 at 12:17 AM, Chetan Khatri wrote:
> I don't want to get any aggregation, just want to know rather saying
&g
ct(): Dataset[T] = dropDuplicates()
>
> …
>
> def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan {
>
> …
>
> Aggregate(groupCols, aggCols, logicalPlan)
> }
>
>
>
>
>
>
>
>
>
> *发件人**:* Chetan Khatri [mailto:chetan.opensou.
All,
I would like to Apply Java Transformation UDF on DataFrame created from
Table, Flat Files and retrun new Data Frame Object. Any suggestions, with
respect to Spark Internals.
Thanks.
Anyone can throw light on this. would be helpful.
On Tue, Jun 5, 2018 at 1:41 AM, Chetan Khatri
wrote:
> All,
>
> I would like to Apply Java Transformation UDF on DataFrame created from
> Table, Flat Files and retrun new Data Frame Object. Any suggestions, with
> respect to
Hello Dear Spark User / Dev,
I would like to pass Python user defined function to Spark Job developed
using Scala and return value of that function would be returned to DF /
Dataset API.
Can someone please guide me, which would be best approach to do this.
Python function would be mostly transfor
Can someone please suggest me , thanks
On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri,
wrote:
> Hello Dear Spark User / Dev,
>
> I would like to pass Python user defined function to Spark Job developed
> using Scala and return value of that function would be returned to DF /
> Datas
Prem sure, Thanks for suggestion.
On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure wrote:
> try .pipe(.py) on RDD
>
> Thanks,
> Prem
>
> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri > wrote:
>
>> Can someone please suggest me , thanks
>>
>> On Tue 3 J
Pandas Dataframe for processing and finally write the
> results back.
>
> In the Spark/Scala/Java code, you get an RDD of string, which we convert
> back to a Dataframe.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Thu, Jul 5
n.html
>
> We will continue adding more there.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri > wrote:
>
>> Hello Jayant,
>>
>> Thank you so much for suggestion.
Dear Spark Users,
I came across little weird MSSQL Query to replace with Spark and I am like
no clue how to do it in an efficient way with Scala + SparkSQL. Can someone
please throw light. I can create view of DataFrame and do it as
*spark.sql *(query)
but I would like to do it with Scala + Spark
Hello Spark Users,
I am working with Spark 2.3.0 with HDP Distribution, where my spark job got
completed successfully but final job status is failed with below error:
What is best way to prevent this kind of errors? Thanks
8/11/21 17:38:15 INFO ApplicationMaster: Final app status: SUCCEEDED,
ex
Hello Spark Users,
I have a Dataframe with some of Null Values, When I am writing to parquet
it is failing with below error:
Caused by: java.lang.RuntimeException: Unsupported data type NullType.
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.execution.data
gt;
> See also https://issues.apache.org/jira/browse/SPARK-10943.
>
> — Soumya
>
>
> On Nov 21, 2018, at 9:29 PM, Chetan Khatri
> wrote:
>
> Hello Spark Users,
>
> I have a Dataframe with some of Null Values, When I am writing to parquet
> it is failing with below error
Hello Spark Users,
Can you please tell me how to increase the time for Spark job to be in
*Accept* mode in Yarn.
Thank you. Regards,
Chetan
wrote:
> Hi , please tell me why you need to increase the time?
>
>
>
>
>
> At 2019-01-22 18:38:29, "Chetan Khatri"
> wrote:
>
> Hello Spark Users,
>
> Can you please tell me how to increase the time for Spark job to be in
> *Accept* mode in Yarn.
>
> Thank you. Regards,
> Chetan
>
>
>
>
>
Hello Dear Spark Users,
I am using dropDuplicate on a DataFrame generated from large parquet file
from(HDFS) and doing dropDuplicate based on timestamp based column, every
time I run it drops different - different rows based on same timestamp.
What I tried and worked
val wSpec = Window.partition
27;wanted_time').dropDuplicates('invoice_id', 'update_time')
>
> The min() is faster than doing an orderBy() and a row_number().
> And the dropDuplicates at the end ensures records with two values for the
> same 'update_time' don't cause issues.
>
eems like it's meant for cases where you
> literally have redundant duplicated data. And not for filtering to get
> first/last etc.
>
>
> On Thu, Apr 4, 2019 at 11:46 AM Chetan Khatri
> wrote:
>
>> Hello Abdeali, Thank you for your response.
>>
>> Can you
g that is faster. When I ran is on my data ~8-9GB
> I think it took less than 5 mins (don't remember exact time)
>
> On Thu, Apr 4, 2019 at 1:09 PM Chetan Khatri
> wrote:
>
>> Thanks for awesome clarification / explanation.
>>
>> I have cases where update_time can
t; wrote:
>
>> How much memory do you have per partition?
>>
>> On Thu, Apr 4, 2019 at 7:49 AM Chetan Khatri
>> wrote:
>>
>>> I will get the information and will share with you.
>>>
>>> On Thu, Apr 4, 2019 at 5:03 PM Abdeali Kothari
Hello Users,
In spark when I have a DataFrame and do .show(100) the output which gets
printed, I wants to save as it is content to txt file in HDFS.
How can I do this?
Thanks
el("OFF")
>
>
> spark.table("").show(100,truncate=false)
>
> But is there any specific reason you want to write it to hdfs? Is this for
> human consumption?
>
> Regards,
> Nuthan
>
> On Sat, Apr 13, 2019 at 6:41 PM Chetan Khatri
> wrote:
>
>
Hello Spark Users,
Someone has suggested by breaking 5-5 unpredictable transformation blocks
to Future[ONE STRING ARGUMENT] and claim this can tune the performance. I
am wondering this is a use of explicit Future! in Spark?
Sample code is below:
def writeData( tableName: String): Future[String]
1 - 100 of 152 matches
Mail list logo