Re: SparkR DataFrame , Out of memory exception for very small file.

Jeff Zhang Mon, 23 Nov 2015 02:15:37 -0800

>>> names(SALES)[which(names(SALES)=="div_no")]<-"DIV_NO"


This line only create a new data frame. The memory overhead is only the new
dataframe object itself, not including its data. So it would be a very
little memory consumption.

Which line cause the OOM in your case ?



On Mon, Nov 23, 2015 at 5:33 PM, Vipul Rai <vipulrai8...@gmail.com> wrote:

> Hi Jeff,
>
> This is only part of the actual code.
>
> My questions are mentioned in comments near the code.
>
> SALES<- SparkR::sql(hiveContext, "select * from sales")
> PRICING<- SparkR::sql(hiveContext, "select * from pricing")
>
>
> ## renaming of columns ##
> #sales file#
>
> # Is this right ??? Do we have to create a new DF for every column
> Addition to the original DF.
>
> # And if we do that , then what about the older DF , they will also take
> memory ?
>
> names(SALES)[which(names(SALES)=="div_no")]<-"DIV_NO"
> names(SALES)[which(names(SALES)=="store_no")]<-"STORE_NO"
>
> #pricing file#
> names(PRICING)[which(names(PRICING)=="price_type_cd")]<-"PRICE_TYPE"
> names(PRICING)[which(names(PRICING)=="price_amt")]<-"PRICE_AMT"
>
> registerTempTable(SALES,"sales")
> registerTempTable(PRICING,"pricing")
>
> #merging sales and pricing file#
> merg_sales_pricing<- SparkR::sql(hiveContext,"select
> .....................")
>
> head(merg_sales_pricing)
>
>
> Thanks,
> Vipul
>
>
> On 23 November 2015 at 14:52, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> If possible, could you share your code ? What kind of operation are you
>> doing on the dataframe ?
>>
>> On Mon, Nov 23, 2015 at 5:10 PM, Vipul Rai <vipulrai8...@gmail.com>
>> wrote:
>>
>>> Hi Zeff,
>>>
>>> Thanks for the reply, but could you tell me why is it taking so much
>>> time.
>>> What could be wrong , also when I remove the DataFrame from memory using
>>> rm().
>>> It does not clear the memory but the object is deleted.
>>>
>>> Also , What about the R functions which are not supported in SparkR.
>>> Like ddply ??
>>>
>>> How to access the nth ROW of SparkR DataFrame.
>>>
>>> Regards,
>>> Vipul
>>>
>>> On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> >>> Do I need to create a new DataFrame for every update to the
>>>> DataFrame like
>>>> addition of new column or  need to update the original sales DataFrame.
>>>>
>>>> Yes, DataFrame is immutable, and every mutation of DataFrame will
>>>> produce a new DataFrame.
>>>>
>>>>
>>>>
>>>> On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Rui,
>>>>>
>>>>> Sorry , What I meant was the resultant of the original dataframe to
>>>>> which a new column was added gives a new DataFrame.
>>>>>
>>>>> Please check this for more
>>>>>
>>>>> https://spark.apache.org/docs/1.5.1/api/R/index.html
>>>>>
>>>>> Check for
>>>>> WithColumn
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Vipul
>>>>>
>>>>>
>>>>> On 23 November 2015 at 12:42, Sun, Rui <rui....@intel.com> wrote:
>>>>>
>>>>>> Vipul,
>>>>>>
>>>>>> Not sure if I understand your question. DataFrame is immutable. You
>>>>>> can't update a DataFrame.
>>>>>>
>>>>>> Could you paste some log info for the OOM error?
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: vipulrai [mailto:vipulrai8...@gmail.com]
>>>>>> Sent: Friday, November 20, 2015 12:11 PM
>>>>>> To: user@spark.apache.org
>>>>>> Subject: SparkR DataFrame , Out of memory exception for very small
>>>>>> file.
>>>>>>
>>>>>> Hi Users,
>>>>>>
>>>>>> I have a general doubt regarding DataFrames in SparkR.
>>>>>>
>>>>>> I am trying to read a file from Hive and it gets created as DataFrame.
>>>>>>
>>>>>> sqlContext <- sparkRHive.init(sc)
>>>>>>
>>>>>> #DF
>>>>>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true',
>>>>>>                  source = "com.databricks.spark.csv",
>>>>>> inferSchema='true')
>>>>>>
>>>>>> registerTempTable(sales,"Sales")
>>>>>>
>>>>>> Do I need to create a new DataFrame for every update to the DataFrame
>>>>>> like addition of new column or  need to update the original sales 
>>>>>> DataFrame.
>>>>>>
>>>>>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as
>>>>>> a")
>>>>>>
>>>>>>
>>>>>> Please help me with this , as the orignal file is only 20MB but it
>>>>>> throws out of memory exception on a cluster of 4GB Master and Two workers
>>>>>> of 4GB each.
>>>>>>
>>>>>> Also, what is the logic with DataFrame do I need to register and drop
>>>>>> tempTable after every update??
>>>>>>
>>>>>> Thanks,
>>>>>> Vipul
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>>>>> additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Vipul Rai
>>>>> www.vipulrai.me
>>>>> +91-8892598819
>>>>> <http://in.linkedin.com/in/vipulrai/>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vipul Rai
>>> www.vipulrai.me
>>> +91-8892598819
>>> <http://in.linkedin.com/in/vipulrai/>
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Regards,
> Vipul Rai
> www.vipulrai.me
> +91-8892598819
> <http://in.linkedin.com/in/vipulrai/>
>



-- 
Best Regards

Jeff Zhang

Re: SparkR DataFrame , Out of memory exception for very small file.

Reply via email to