subject:"Reading TB of JSON file"

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri

Thanks, you meant in a for loop. could you please put pseudocode in spark On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke wrote: > Make every json object a line and then read t as jsonline not as multiline > > Am 19.06.2020 um 14:37 schrieb Chetan Khatri >: > > > All transactions in JSON, It is n

Re: Reading TB of JSON file

2020-06-19 Thread Jörn Franke

Make every json object a line and then read t as jsonline not as multiline > Am 19.06.2020 um 14:37 schrieb Chetan Khatri : > > > All transactions in JSON, It is not a single array. > >> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner >> wrote: >> It's an interesting problem. What is the s

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri

All transactions in JSON, It is not a single array. On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner wrote: > It's an interesting problem. What is the structure of the file? One big > array? On hash with many key-value pairs? > > Stephan > > On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri > wrote:

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri

Yes On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta wrote: > Hi, > So you have a single JSON record in multiple lines? > And all the 50 GB is in one file? > > Regards, > Gourav > > On Thu, 18 Jun 2020, 14:34 Chetan Khatri, > wrote: > >> It is dynamically generated and written at s3 bucket not

Re: Reading TB of JSON file

2020-06-18 Thread Stephan Wehner

It's an interesting problem. What is the structure of the file? One big array? On hash with many key-value pairs? Stephan On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri wrote: > Hi Spark Users, > > I have a 50GB of JSON file, I would like to read and persist at HDFS so it > can be taken into nex

Re: Reading TB of JSON file

2020-06-18 Thread Gourav Sengupta

Hi, So you have a single JSON record in multiple lines? And all the 50 GB is in one file? Regards, Gourav On Thu, 18 Jun 2020, 14:34 Chetan Khatri, wrote: > It is dynamically generated and written at s3 bucket not historical data > so I guess it doesn't have jsonlines format > > On Thu, Jun 18,

Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri

It is dynamically generated and written at s3 bucket not historical data so I guess it doesn't have jsonlines format On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke wrote: > Depends on the data types you use. > > Do you have in jsonlines format? Then the amount of memory plays much less > a role. >

Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri

File is available at S3 Bucket. On Thu, Jun 18, 2020 at 9:15 AM Patrick McCarthy wrote: > Assuming that the file can be easily split, I would divide it into a > number of pieces and move those pieces to HDFS before using spark at all, > using `hdfs dfs` or similar. At that point you can use you

Re: Reading TB of JSON file

2020-06-18 Thread nihed mbarek

Hi, What is the size of one json document ? There is also the scan of your json to define the schema, the overhead can be huge. 2 solution: define a schema and use directly during the load or ask spark to analyse a small part of the json file (I don't remember how to do it) Regards, On Thu, Ju

Re: Reading TB of JSON file

2020-06-18 Thread Jörn Franke

Depends on the data types you use. Do you have in jsonlines format? Then the amount of memory plays much less a role. Otherwise if it is one large object or array I would not recommend it. > Am 18.06.2020 um 15:12 schrieb Chetan Khatri : > > > Hi Spark Users, > > I have a 50GB of JSON file,

Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy

Assuming that the file can be easily split, I would divide it into a number of pieces and move those pieces to HDFS before using spark at all, using `hdfs dfs` or similar. At that point you can use your executors to perform the reading instead of the driver. On Thu, Jun 18, 2020 at 9:12 AM Chetan

Reading TB of JSON file

2020-06-18 Thread Chetan Khatri

Hi Spark Users, I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Re: Reading TB of JSON file

Reading TB of JSON file

12 matches

Site Navigation

Mail list logo

Footer information