How use pattern matching in spark

2022-07-12 Thread Sid
Hi Team, I have a dataset like the below one in .dat file: 13/07/2022abc PWJ PWJABC 513213217ABC GM20 05. 6/20/39 #01000count Now I want to extract the header and tail records which I was able to do it. Now, from the header, I need to extract the date and match it with the current system d

Re: How reading works?

2022-07-12 Thread Sid
Yeah, I understood that now. Thanks for the explanation, Bjorn. Sid On Wed, Jul 6, 2022 at 1:46 AM Bjørn Jørgensen wrote: > Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. > > duplicate column = duplicate rows > > > tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen < >

Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar
Hello Ayan, Thank you for the suggestion. But, I would lose correlation of the JSON file with the other identifier fields. Also, if there are too many files, will it be an issue? Plus, I may not have the same schema across all the files. Hello Enrico, >how does RDD's mapPartitions make a differe

Spark streaming pending mircobatches queue max length

2022-07-12 Thread Anil Dasari
Hello, Spark is adding entry to pending microbatches queue at periodic batch interval. Is there config to set the max size for pending microbatches queue ? Thanks

Re: reading each JSON file from dataframe...

2022-07-12 Thread ayan guha
Another option is: 1. collect the dataframe with file path 2. create a list of paths 3. create a new dataframe with spark.read.json and pass the list of path This will save you lots of headache Ayan On Wed, Jul 13, 2022 at 7:35 AM Enrico Minack wrote: > Hi, > > how does RDD's mapPartitions m

Re: reading each JSON file from dataframe...

2022-07-12 Thread Enrico Minack
Hi, how does RDD's mapPartitions make a difference regarding 1. and 2. compared to Dataset's mapPartitions / map function? Enrico Am 12.07.22 um 22:13 schrieb Muthu Jayakumar: Hello Enrico, Thanks for the reply. I found that I would have to use `mapPartitions` API of RDD to perform this s

Re: reading each JSON file from dataframe...

2022-07-12 Thread Muthu Jayakumar
Hello Enrico, Thanks for the reply. I found that I would have to use `mapPartitions` API of RDD to perform this safely as I have to 1. Read each file from GCS using HDFS FileSystem API. 2. Parse each JSON record in a safe manner. For (1) to work, I do have to broadcast HadoopConfiguration from sp

[Spark][Core] Resource Allocation

2022-07-12 Thread Amin Borjian
I have some problems that I am looking for if there is no solution for them (due to the current implementation) or if there is a way and I was not aware of it. 1) Currently, we can enable and configure dynamic resource allocation based on below documentation. https://spark.apache.org/docs/late