Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-05 Thread Gourav Sengupta
Hi, SPARK is just one of the technologies out there now, there are several other technologies far outperforming SPARK or at least as good as SPARK. Regards, Gourav On Sat, Jul 2, 2022 at 7:42 PM Sid wrote: > So as per the discussion, shuffle stages output is also stored on disk and > not in

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-05 Thread Apostolos N. Papadopoulos
First of all, define "far outperforming". For sure, there is no GOD system that does everything perfectly. In which use-cases are you referring to? It would be interesting to the community to see some comparisons. a. On 5/7/22 12:29, Gourav Sengupta wrote: Hi, SPARK is just one of the tec

Reading snappy/lz4 compressed csv/json files

2022-07-05 Thread Yeachan Park
Hi all, We are trying to read csv/json files that have been snappy/lz4 compressed with spark. Files were compressed with the lz4 command line tool and the python snappy library. Both did not succeed, while other formats (bzip2 & gzip) worked fine. I've read in some places that the codec is not f

Re: How reading works?

2022-07-05 Thread Sid
Hi Team, I still need help in understanding how reading works exactly? Thanks, Sid On Mon, Jun 20, 2022 at 2:23 PM Sid wrote: > Hi Team, > > Can somebody help? > > Thanks, > Sid > > On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: > >> Hi, >> >> I already have a partitioned JSON dataset in s3 like

Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen
"*but I am getting the issue of the duplicate column which was present in the old dataset.*" So you have answered your question! spark.read.option("multiline","true").json("path").filter( col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark read all the json files in "path" t

Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen
Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. duplicate column = duplicate rows tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen : > "*but I am getting the issue of the duplicate column which was present in > the old dataset.*" > > So you have answered your question!

Reading parquet strips non-nullability from schema

2022-07-05 Thread Greg Kopff
Hi. I’ve spent the last couple of hours trying to chase down an issue with writing/reading parquet files. I was trying to save (and then read in) a parquet file with a schema that sets my non-nullability details correctly. After having no success for some time, I posted to Stack Overflow abou