Hi everyone,
I have a huge dataframe with 1 billion rows and each row is a nested list.
That being said, I want to train some ML models on this df but due to the
huge size, I get out memory error on one of my nodes when I run fit
function.
currently, my configuration is:
144 cores, 16 cores for e
Hi Marco,
thanks a ton, I will surely use those alternatives.
Regards,
Gourav Sengupta
On Sun, Aug 6, 2017 at 3:45 PM, Marco Mistroni wrote:
> Sengupta
> further to this, if you try the following notebook in databricks cloud,
> it will read a .csv file , write to a parquet file and read it a
Sengupta
further to this, if you try the following notebook in databricks cloud, it
will read a .csv file , write to a parquet file and read it again (just to
count the number of rows stored)
Please note that the path to the csv file might differ for you.
So, what you will need todo is
1 - cre
Uh believe me there are lots of ppl on this list who will send u code
snippets if u ask... 😀
Yes that is what Steve pointed out, suggesting also that for that simple
exercise you should perform all operations on a spark standalone instead
(or alt. Use an nfs on the cluster)
I'd agree with his sugg
Hi Marco,
For the first time in several years FOR THE VERY FIRST TIME. I am seeing
someone actually executing code and providing response. It feel wonderful
that at least someone considered to respond back by executing code and just
did not filter out each and every technical details to brood only
Hi Marco,
I am sincerely obliged for your kind time and response. Can you please try
the solution that you have so kindly suggested?
It will be a lot of help if you could kindly execute the code that I have
given. I dont think that anyone has yet.
There are lots of fine responses to my question
I use CIFS and it works reasonably well and easily cross platform, well
documented...
> On Aug 4, 2017, at 6:50 AM, Steve Loughran wrote:
>
>
>> On 3 Aug 2017, at 19:59, Marco Mistroni wrote:
>>
>> Hello
>> my 2 cents here, hope it helps
>> If you want to just to play around with Spark, i'd
> On 3 Aug 2017, at 19:59, Marco Mistroni wrote:
>
> Hello
> my 2 cents here, hope it helps
> If you want to just to play around with Spark, i'd leave Hadoop out, it's an
> unnecessary dependency that you dont need for just running a python script
> Instead do the following:
> - got to the roo
Hello
my 2 cents here, hope it helps
If you want to just to play around with Spark, i'd leave Hadoop out, it's
an unnecessary dependency that you dont need for just running a python
script
Instead do the following:
- got to the root of our master / slave node. create a directory
/root/pyscripts
-
Hi Steve,
I love you mate, thanks a ton once again for ACTUALLY RESPONDING.
I am now going through the documentation (
https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer_architecture.md)
and it makes
On 2 Aug 2017, at 20:05, Gourav Sengupta
mailto:gourav.sengu...@gmail.com>> wrote:
Hi Steve,
I have written a sincere note of apology to everyone in a separate email. I
sincerely request your kind forgiveness before hand if anything does sound
impolite in my emails, in advance.
Let me first
because the executor ran on
>> the driver itself. There is not much use to a set up where you don’t have
>> some kind of distributed file system, so I would encourage you to use hdfs,
>> or a mounted file system shared by all nodes.
>>
>>
>>
>> Regards,
&g
>
>
>
> From: Gourav Sengupta [mailto:gourav.sengu...@gmail.com
> <mailto:gourav.sengu...@gmail.com>]
> Sent: Monday, July 31, 2017 9:54 PM
> To: Riccardo Ferrari
> Cc: user
> Subject: Re: SPARK Issue in Standalone cluster
>
>
>
> Hi Riccardo
y, July 31, 2017 9:54 PM
To: Riccardo Ferrari
Cc: user
Subject: Re: SPARK Issue in Standalone cluster
Hi Riccardo,
I am grateful for your kind response.
Also I am sure that your answer is completely wrong and errorneous. SPARK must
be having a method so that different executors do not pick up the
so I would encourage you to use hdfs,
> or a mounted file system shared by all nodes.
>
>
>
> Regards,
>
> Mahesh
>
>
>
>
>
> *From:* Gourav Sengupta [mailto:gourav.sengu...@gmail.com]
> *Sent:* Monday, July 31, 2017 9:54 PM
> *To:* Riccardo Ferrari
> *Cc
, or a mounted
file system shared by all nodes.
Regards,
Mahesh
From: Gourav Sengupta [mailto:gourav.sengu...@gmail.com]
Sent: Monday, July 31, 2017 9:54 PM
To: Riccardo Ferrari
Cc: user
Subject: Re: SPARK Issue in Standalone cluster
Hi Riccardo,
I am grateful for your kind response.
Also I am
Hi Riccardo,
I am grateful for your kind response.
Also I am sure that your answer is completely wrong and errorneous. SPARK
must be having a method so that different executors do not pick up the same
files to process. You also did not answer the question why was the
processing successful in SPAR
Hi Gourav,
The issue here is the location where you're trying to write/read from :
/Users/gouravsengupta/Development/spark/sparkdata/test1/p...
When dealing with clusters all the paths and resources should be available
to all executors (and driver), and that is reason why you generally use
HDFS, S
18 matches
Mail list logo