Hi Team,
I have a dataset like the below one in .dat file:
13/07/2022abc
PWJ PWJABC 513213217ABC GM20 05. 6/20/39
#01000count
Now I want to extract the header and tail records which I was able to do
it. Now, from the header, I need to extract the date and match it with the
current system d
Yeah, I understood that now.
Thanks for the explanation, Bjorn.
Sid
On Wed, Jul 6, 2022 at 1:46 AM Bjørn Jørgensen
wrote:
> Ehh.. What is "*duplicate column*" ? I don't think Spark supports that.
>
> duplicate column = duplicate rows
>
>
> tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen <
>
Hello Ayan,
Thank you for the suggestion. But, I would lose correlation of the JSON
file with the other identifier fields. Also, if there are too many files,
will it be an issue? Plus, I may not have the same schema across all the
files.
Hello Enrico,
>how does RDD's mapPartitions make a differe
Hello,
Spark is adding entry to pending microbatches queue at periodic batch interval.
Is there config to set the max size for pending microbatches queue ?
Thanks
Another option is:
1. collect the dataframe with file path
2. create a list of paths
3. create a new dataframe with spark.read.json and pass the list of path
This will save you lots of headache
Ayan
On Wed, Jul 13, 2022 at 7:35 AM Enrico Minack
wrote:
> Hi,
>
> how does RDD's mapPartitions m
Hi,
how does RDD's mapPartitions make a difference regarding 1. and 2.
compared to Dataset's mapPartitions / map function?
Enrico
Am 12.07.22 um 22:13 schrieb Muthu Jayakumar:
Hello Enrico,
Thanks for the reply. I found that I would have to use `mapPartitions`
API of RDD to perform this s
Hello Enrico,
Thanks for the reply. I found that I would have to use `mapPartitions` API
of RDD to perform this safely as I have to
1. Read each file from GCS using HDFS FileSystem API.
2. Parse each JSON record in a safe manner.
For (1) to work, I do have to broadcast HadoopConfiguration from
sp
I have some problems that I am looking for if there is no solution for them
(due to the current implementation) or if there is a way and I was not aware of
it.
1)
Currently, we can enable and configure dynamic resource allocation based on
below documentation.
https://spark.apache.org/docs/late