Re: bug for large textfiles on windows

Christopher Bourez Thu, 28 Jan 2016 01:55:42 -0800

Dears,

I recompiled Spark on Windows, sounds to work better. My problem with
Pyspark remains :
https://issues.apache.org/jira/browse/SPARK-12261


I do not know how to debug this, sounds to be linked with Pickle, the
garbage collector... I would like to clear the Spark context to see if I
can gain anything.

Christopher Bourez
06 17 17 50 60

On Mon, Jan 25, 2016 at 10:14 PM, Christopher Bourez <
christopher.bou...@gmail.com> wrote:

> Here is a pic of memory
> If I put --conf spark.driver.memory=3g, it increases the displaid memory,
> but the problem remains... for a file that is only 13M.
>
> Christopher Bourez
> 06 17 17 50 60
>
> On Mon, Jan 25, 2016 at 10:06 PM, Christopher Bourez <
> christopher.bou...@gmail.com> wrote:
>
>> The same problem occurs on my desktop at work.
>> What's great with AWS Workspace is that you can easily reproduce it.
>>
>> I created the test file with commands :
>>
>> for i in {0..300000}; do
>>   VALUE="$RANDOM"
>>   for j in {0..6}; do
>>     VALUE="$VALUE;$RANDOM";
>>   done
>>   echo $VALUE >> test.csv
>> done
>>
>> Christopher Bourez
>> 06 17 17 50 60
>>
>> On Mon, Jan 25, 2016 at 10:01 PM, Christopher Bourez <
>> christopher.bou...@gmail.com> wrote:
>>
>>> Josh,
>>>
>>> Thanks a lot !
>>>
>>> You can download a video I created :
>>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov
>>>
>>> I created a sample file of 13 MB as explained :
>>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv
>>>
>>> Here are the commands I did :
>>>
>>> I created an Aws Workspace with Windows 7 (that I can share you if you'd
>>> like) with Standard instance, 2GiB RAM
>>> On this instance :
>>> I downloaded spark (1.5 or 1.6 same pb) with hadoop 2.6
>>> installed java 8 jdk
>>> downloaded python 2.7.8
>>>
>>> downloaded the sample file
>>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv
>>>
>>> And then the command lines I launch are :
>>> bin\pyspark --master local[1]
>>> sc.textFile("test.csv").take(1)
>>>
>>> As you can see, sc.textFile("test.csv", 2000).take(1) works well
>>>
>>> Thanks a lot !
>>>
>>>
>>> Christopher Bourez
>>> 06 17 17 50 60
>>>
>>> On Mon, Jan 25, 2016 at 8:02 PM, Josh Rosen <joshro...@databricks.com>
>>> wrote:
>>>
>>>> Hi Christopher,
>>>>
>>>> What would be super helpful here is a standalone reproduction. Ideally
>>>> this would be a single Scala file or set of commands that I can run in
>>>> `spark-shell` in order to reproduce this. Ideally, this code would generate
>>>> a giant file, then try to read it in a way that demonstrates the bug. If
>>>> you have such a reproduction, could you attach it to that JIRA ticket?
>>>> Thanks!
>>>>
>>>> On Mon, Jan 25, 2016 at 7:53 AM Christopher Bourez <
>>>> christopher.bou...@gmail.com> wrote:
>>>>
>>>>> Dears,
>>>>>
>>>>> I would like to re-open a case for a potential bug (current status is
>>>>> resolved but it sounds not) :
>>>>>
>>>>> *https://issues.apache.org/jira/browse/SPARK-12261
>>>>> <https://issues.apache.org/jira/browse/SPARK-12261>*
>>>>>
>>>>> I believe there is something wrong about the memory management under
>>>>> windows
>>>>>
>>>>> It has no sense to work with files smaller than a few Mo...
>>>>>
>>>>> Do not hesitate to ask me questions if you try to help and reproduce
>>>>> the bug,
>>>>>
>>>>> Best
>>>>>
>>>>> Christopher Bourez
>>>>> 06 17 17 50 60
>>>>>
>>>>
>>>
>>
>

Re: bug for large textfiles on windows

Reply via email to