Gzip files are not splittable. Hence using very large (i.e. non
partitioned) gzip files lead to contention at reading the files as readers
cannot scale beyond the number of gzip files to read.
Better to use a splittable compression format instead to allow frameworks
to scale up. Or manually manage
Hello Stelios, friendly reminder if you could share any sample code/repo
Are you using a schema registry?
Thanks
Kiran
On Fri, Apr 8, 2022 at 4:37 PM Kiran Biswal wrote:
> Hello Stelios
>
> Just a gentle follow up if you can share any sample code/repo
>
> Regards
> Kiran
>
> On Wed, Apr 6, 202
Yeah, Stelios. It worked. Could you please post it as an answer so that I
can accept it on the post and can be of help to people?
Thanks,
Sid
On Mon, May 30, 2022 at 4:42 PM Stelios Philippou
wrote:
> Sid,
>
> According to the error that i am seeing there, this is the Date Format
> issue.
>
> T
Sid,
According to the error that i am seeing there, this is the Date Format
issue.
Text '5/1/2019 1:02:16' could not be parsed
But your time format is specific as such
'M/dd/ H:mm:ss')
You can see that the day specific is /1/ but your format is dd which
expects two digits.
Please try the
Hi Team,
I am able to convert to timestamp. However, when I try to filter out the
records based on a specific value it gives an error as mentioned in the
post. Could you please help me with this?
https://stackoverflow.com/questions/72422897/unable-to-format-timestamp-in-pyspark/72423394#72423394
Thanks.
Eventually the problem was solved. I am still not 100% sure what caused it
but when I said the input was identical I simplified a bit because it was
not (sorry for misleading, I thought this information would just be noise).
Explanation: the input to the EMR job was gzips created by Fireho