Hi,
I have running multiple jobs in same driver with FAIR scheduling enabled.
Intermittently one of the Stage gets stuck and not completing even after long
time.
Each job flow is something like this
* Create JDBC RDD to load data from SQL Server
* Create temporary table
* Query Tem
Thanks Dhaval for the suggestion, but in the case i mentioned in previous mail
still data can be missed as the row number will change.
-
Manjunath
From: Dhaval Patel
Sent: Monday, May 25, 2020 3:01 PM
To: Manjunath Shetty H
Subject: Re: Parallelising JDBC
Thanks Georg for the suggestion, but at this point changing the design is not
really the option.
Any other pointer would be helpful.
Thanks
Manjunath
From: Georg Heiler
Sent: Monday, May 25, 2020 11:52 AM
To: Manjunath Shetty H
Cc: Mike Artz ; user
Subject
Hi Georg,
Thanks for the response, can please elaborate what do mean by change data
capture ?
Thanks
Manjunath
From: Georg Heiler
Sent: Monday, May 25, 2020 11:14 AM
To: Manjunath Shetty H
Cc: Mike Artz ; user
Subject: Re: Parallelising JDBC reads in spark
Shetty H
Cc: user
Subject: Re: Parallelising JDBC reads in spark
Does anything different happened when you set the isolationLevel to do Dirty
Reads i.e. "READ_UNCOMMITTED"
On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H
mailto:manjunathshe...@live.com>> wrote:
Hi,
We a
Hi,
We are writing a ETL pipeline using Spark, that fetch the data from SQL server
in batch mode (every 15mins). Problem we are facing when we try to
parallelising single table reads into multiple tasks without missing any data.
We have tried this,
* Use `ROW_NUMBER` window function in th
Hi,
I have a dataframe with some columns and data that is fetched from JDBC, as i
have to maintain the schema consistent in the ORC file i have to apply
different schema for that dataframe. Column names will be same, but Data or
Schema may contain some extra columns.
Is there any way i can app
Hi All,
Is there anyway to store the exact written timestamp in the ORC file through
spark ?.
Use case something like `current_timestamp()` function in SQL. Generating in
the program will not be equal to actual write time in ORC/hdfs file.
Any suggestions will be helpful.
Thanks
Manjunath
Hi,
Is it possible to do ORC bucked queries in Spark 1.6 ?
Folder structure is like this:
/
bucket1.orc
bucket2.orc
bucket3.orc
And the Spark SQL query will be like `select * from where partition =
partition1 and bucket = bucket1`,
Hi,
i am on spark 1.6. I am getting error if i try to run a hive query in Spark
that involves joining ORC and non-ORC tables in hive.
Find the error below, any help would be appreciated
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenExchange hashpartition
Thanks for suggestion Netanel,
Sorry for less information, I am specifically looking for something inside
Hadoop ecosystem.
-
Manjunath
From: Netanel Malka
Sent: Wednesday, March 18, 2020 5:26 PM
To: Manjunath Shetty H
Subject: Re: Saving Spark run stats and
Hi All,
Want to save each spark batch run stats (start, end, ID etc) and watermark (
Last processed timestamp from external data source).
We have tried Hive JDBC, but it is very slow due MR jobs it will trigger. Cant
save to normal Hive tables as it will create lots of small files in HDFS.
Ple
o., 16. März 2020 um 04:27 Uhr schrieb Manjunath Shetty H
mailto:manjunathshe...@live.com>>:
Hi Georg,
Thanks for the suggestion. Can you please explain bit more about what you meant
exactly ?
Bdw i am on Spark 1.6
-
Manjunath
From: Georg Heiler m
Only partitioned and Join keys are not sorted coz those are written
incrementally with batch jobs
From: Georg Heiler
Sent: Sunday, March 15, 2020 8:30:53 PM
To: Manjunath Shetty H
Cc: ayan guha ; Magnus Nilsson ; user
Subject: Re: Optimising multiple hive
ut hough. I really
would like a native way to tell catalyst not to reshuffle just because you use
more columns in the join condition.
On Sun, Mar 15, 2020 at 6:04 AM Manjunath Shetty H
mailto:manjunathshe...@live.com>> wrote:
Hi All,
We have 10 tables in data warehouse (hdfs/hive) written usi
Hi All,
We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We
are serving a usecase on top of that by joining 4-5 tables using Hive as of
now. But it is not fast as we wanted it to be, so we are thinking of using
spark for this use case.
Any suggestion on this ? Is it go
Or is there any way to provide a Unique file name to the ORC write function
itself ?
Any suggestions will be helpful.
Regards
Manjunath Shetty
From: Manjunath Shetty H
Sent: Wednesday, March 4, 2020 2:28 PM
To: user
Subject: Way to get the file name of the
Thanks Zohar,
Will try that
-
Manjunath
From: Zohar Stiro
Sent: Tuesday, March 3, 2020 1:49 PM
To: Manjunath Shetty H
Cc: user
Subject: Re: How to collect Spark dataframe write metrics
Hi,
to get DataFrame level write metrics you can take a look at the
Hi,
I wanted to know if there is any way to get the output file name that
`Dataframe.orc()` will write to ?. This is needed to track which file is
written by which job during incremental batch jobs.
Thanks
Manjunath
Hi all,
Basically my use case is to validate the DataFrame rows count before and after
writing to HDFS. Is this even to good practice ? Or Should relay on spark for
guaranteed writes ?.
If it is a good practice to follow then how to get the DataFrame level write
metrics ?
Any pointers would b
Minack
Sent: Thursday, February 27, 2020 8:51 PM
To: Manjunath Shetty H ; user@spark.apache.org
Subject: Re: Convert each partition of RDD to Dataframe
Manjunath,
You can define your DataFrame in parallel in a multi-threaded driver.
Enrico
Am 27.02.20 um 15:50 schrieb Manjunath Shetty H:
Hi
different tables in the first place?
Enrico
Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:
Hi Vinodh,
Thanks for the quick response. Didn't got what you meant exactly, any reference
or snippet will be helpful.
To explain the problem more,
* I have 10 partitions , each partition load
rame..
On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H
mailto:manjunathshe...@live.com>> wrote:
Hello All,
In spark i am creating the custom partitions with Custom RDD, each partition
will have different schema. Now in the transformation step we need to get the
schema and run so
Hello All,
In spark i am creating the custom partitions with Custom RDD, each partition
will have different schema. Now in the transformation step we need to get the
schema and run some Dataframe SQL queries per partition, because each partition
data has different schema.
How to get the Data
24 matches
Mail list logo