date:20220705

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-05 Thread Gourav Sengupta

Hi, SPARK is just one of the technologies out there now, there are several other technologies far outperforming SPARK or at least as good as SPARK. Regards, Gourav On Sat, Jul 2, 2022 at 7:42 PM Sid wrote: > So as per the discussion, shuffle stages output is also stored on disk and > not in

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-05 Thread Apostolos N. Papadopoulos

First of all, define "far outperforming". For sure, there is no GOD system that does everything perfectly. In which use-cases are you referring to? It would be interesting to the community to see some comparisons. a. On 5/7/22 12:29, Gourav Sengupta wrote: Hi, SPARK is just one of the tec

Reading snappy/lz4 compressed csv/json files

2022-07-05 Thread Yeachan Park

Hi all, We are trying to read csv/json files that have been snappy/lz4 compressed with spark. Files were compressed with the lz4 command line tool and the python snappy library. Both did not succeed, while other formats (bzip2 & gzip) worked fine. I've read in some places that the codec is not f

Re: How reading works?

2022-07-05 Thread Sid

Hi Team, I still need help in understanding how reading works exactly? Thanks, Sid On Mon, Jun 20, 2022 at 2:23 PM Sid wrote: > Hi Team, > > Can somebody help? > > Thanks, > Sid > > On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: > >> Hi, >> >> I already have a partitioned JSON dataset in s3 like

Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen

"*but I am getting the issue of the duplicate column which was present in the old dataset.*" So you have answered your question! spark.read.option("multiline","true").json("path").filter( col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark read all the json files in "path" t

Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen

Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. duplicate column = duplicate rows tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen : > "*but I am getting the issue of the duplicate column which was present in > the old dataset.*" > > So you have answered your question!

Reading parquet strips non-nullability from schema

2022-07-05 Thread Greg Kopff

Hi. I’ve spent the last couple of hours trying to chase down an issue with writing/reading parquet files. I was trying to save (and then read in) a parquet file with a schema that sets my non-nullability details correctly. After having no success for some time, I posted to Stack Overflow abou

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reading snappy/lz4 compressed csv/json files

Re: How reading works?

Re: How reading works?

Re: How reading works?

Reading parquet strips non-nullability from schema

7 matches

Site Navigation

Mail list logo

Footer information