Thanks, you meant in a for loop. could you please put pseudocode in spark
On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke wrote:
> Make every json object a line and then read t as jsonline not as multiline
>
> Am 19.06.2020 um 14:37 schrieb Chetan Khatri >:
>
>
> All transactions in JSON, It is n
Make every json object a line and then read t as jsonline not as multiline
> Am 19.06.2020 um 14:37 schrieb Chetan Khatri :
>
>
> All transactions in JSON, It is not a single array.
>
>> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner
>> wrote:
>> It's an interesting problem. What is the s
All transactions in JSON, It is not a single array.
On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner
wrote:
> It's an interesting problem. What is the structure of the file? One big
> array? On hash with many key-value pairs?
>
> Stephan
>
> On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri
> wrote:
Yes
On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta
wrote:
> Hi,
> So you have a single JSON record in multiple lines?
> And all the 50 GB is in one file?
>
> Regards,
> Gourav
>
> On Thu, 18 Jun 2020, 14:34 Chetan Khatri,
> wrote:
>
>> It is dynamically generated and written at s3 bucket not
It's an interesting problem. What is the structure of the file? One big
array? On hash with many key-value pairs?
Stephan
On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri
wrote:
> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it
> can be taken into nex
Hi,
So you have a single JSON record in multiple lines?
And all the 50 GB is in one file?
Regards,
Gourav
On Thu, 18 Jun 2020, 14:34 Chetan Khatri,
wrote:
> It is dynamically generated and written at s3 bucket not historical data
> so I guess it doesn't have jsonlines format
>
> On Thu, Jun 18,
It is dynamically generated and written at s3 bucket not historical data so
I guess it doesn't have jsonlines format
On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke wrote:
> Depends on the data types you use.
>
> Do you have in jsonlines format? Then the amount of memory plays much less
> a role.
>
File is available at S3 Bucket.
On Thu, Jun 18, 2020 at 9:15 AM Patrick McCarthy
wrote:
> Assuming that the file can be easily split, I would divide it into a
> number of pieces and move those pieces to HDFS before using spark at all,
> using `hdfs dfs` or similar. At that point you can use you
Hi,
What is the size of one json document ?
There is also the scan of your json to define the schema, the overhead can
be huge.
2 solution:
define a schema and use directly during the load or ask spark to analyse a
small part of the json file (I don't remember how to do it)
Regards,
On Thu, Ju
Depends on the data types you use.
Do you have in jsonlines format? Then the amount of memory plays much less a
role.
Otherwise if it is one large object or array I would not recommend it.
> Am 18.06.2020 um 15:12 schrieb Chetan Khatri :
>
>
> Hi Spark Users,
>
> I have a 50GB of JSON file,
Assuming that the file can be easily split, I would divide it into a number
of pieces and move those pieces to HDFS before using spark at all, using
`hdfs dfs` or similar. At that point you can use your executors to perform
the reading instead of the driver.
On Thu, Jun 18, 2020 at 9:12 AM Chetan
Hi Spark Users,
I have a 50GB of JSON file, I would like to read and persist at HDFS so it
can be taken into next transformation. I am trying to read as
spark.read.json(path) but this is giving Out of memory error on driver.
Obviously, I can't afford having 50 GB on driver memory. In general, what
12 matches
Mail list logo