Do you mean the input data size as 10M or the task result size ?

>>> But my way is to setup a forever loop to handle continued income data. Not
sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in
the forever loop ?



On Wed, Apr 20, 2016 at 3:55 PM, 李明伟 <kramer2...@126.com> wrote:

> Hi Jeff
>
> The total size of my data is less than 10M. I already set the driver
> memory to 4GB.
>
>
>
>
>
>
>
> 在 2016-04-20 13:42:25,"Jeff Zhang" <zjf...@gmail.com> 写道:
>
> Seems it is OOM in driver side when fetching task result.
>
> You can try to increase spark.driver.memory and spark.driver.maxResultSize
>
> On Tue, Apr 19, 2016 at 4:06 PM, 李明伟 <kramer2...@126.com> wrote:
>
>> Hi Zhan Zhang
>>
>>
>> Please see the exception trace below. It is saying some GC overhead limit
>> error
>> I am not a java or scala developer so it is hard for me to understand
>> these infor.
>> Also reading coredump is too difficult to me..
>>
>> I am not sure if the way I am using spark is correct. I understand that
>> spark can do batch or stream calculation. But my way is to setup a forever
>> loop to handle continued income data.
>> Not sure if it is the right way to use spark
>>
>>
>> 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
>> task-result-getter-2
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>> at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>> at
>> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at
>> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
>> at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC
>> overhead limit exceeded
>> at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>> at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>> at
>> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at
>> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
>> at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>>
>>
>>
>>
>>
>> At 2016-04-19 13:10:20, "Zhan Zhang" <zzh...@hortonworks.com> wrote:
>>
>> What kind of OOM? Driver or executor side? You can use coredump to find
>> what cause the OOM.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>> On Apr 18, 2016, at 9:44 PM, 李明伟 <kramer2...@126.com> wrote:
>>
>> Hi Samaga
>>
>> Thanks very much for your reply and sorry for the delay reply.
>>
>> Cassandra or Hive is a good suggestion.
>> However in my situation I am not sure if it will make sense.
>>
>> My requirements is that to get the recent 24 hour data to generate
>> report. The frequency is 5 minute.
>> So if use cassandra or hive, it means spark will have to read 24 hour
>> data every 5 mintues. And among those data, a big part (like 23 hours or
>> more ) will be repeatedly read.
>>
>> The window in spark is for stream computing. I did not use it but I will
>> consider it
>>
>>
>> Thanks again
>>
>> Regards
>> Mingwei
>>
>>
>>
>>
>>
>> At 2016-04-11 19:09:48, "Lohith Samaga M" <lohith.sam...@mphasis.com> wrote:
>> >Hi Kramer,
>> >    Some options:
>> >    1. Store in Cassandra with TTL = 24 hours. When you read the full 
>> > table, you get the latest 24 hours data.
>> >    2. Store in Hive as ORC file and use timestamp field to filter out the 
>> > old data.
>> >    3. Try windowing in spark or flink (have not used either).
>> >
>> >
>> >Best regards / Mit freundlichen Grüßen / Sincères salutations
>> >M. Lohith Samaga
>> >
>> >
>> >-----Original Message-----
>> >From: kramer2...@126.com [mailto:kramer2...@126.com <kramer2...@126.com>]
>> >Sent: Monday, April 11, 2016 16.18
>> >To: user@spark.apache.org
>> >Subject: Why Spark having OutOfMemory Exception?
>> >
>> >I use spark to do some very simple calculation. The description is like 
>> >below (pseudo code):
>> >
>> >
>> >While timestamp == 5 minutes
>> >
>> >    df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
>> >
>> >    my_dict[timestamp] = df # Put the data frame into a dict
>> >
>> >    delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
>> >24 hour before)
>> >
>> >    big_df = merge(my_dict) # Merge the recent 24 hours data frame
>> >
>> >To explain..
>> >
>> >I have new files comes in every 5 minutes. But I need to generate report on 
>> >recent 24 hours data.
>> >The concept of 24 hours means I need to delete the oldest data frame every 
>> >time I put a new one into it.
>> >So I maintain a dict (my_dict in above code), the dict contains map like
>> >timestamp: dataframe. Everytime I put dataframe into the dict, I will go 
>> >through the dict to delete those old data frame whose timestamp is 24 hour 
>> >ago.
>> >After delete and input. I merge the data frames in the dict to a big one 
>> >and run SQL on it to get my report.
>> >
>> >*
>> >I want to know if any thing wrong about this model? Because it is very slow 
>> >after started for a while and hit OutOfMemory. I know that my memory is 
>> >enough. Also size of file is very small for test purpose. So should not 
>> >have memory problem.
>> >
>> >I am wondering if there is lineage issue, but I am not sure.
>> >
>> >*
>> >
>> >
>> >
>> >--
>> >View this message in context: 
>> >http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
>> >Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
>> >commands, e-mail: user-h...@spark.apache.org
>> >
>> >Information transmitted by this e-mail is proprietary to Mphasis, its 
>> >associated companies and/ or its customers and is intended
>> >for use only by the individual or entity to which it is addressed, and may 
>> >contain information that is privileged, confidential or
>> >exempt from disclosure under applicable law. If you are not the intended 
>> >recipient or it appears that this mail has been forwarded
>> >to you without proper authority, you are notified that any use or 
>> >dissemination of this information in any manner is strictly
>> >prohibited. In such cases, please notify us immediately at 
>> >mailmas...@mphasis.com and delete this mail from your records.
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang

Reply via email to