Hi All

I use spark doing some calculation. 
The situation is 
1. New file will come into a folder periodically
2. I turn the new files into data frame and insert it into an previous data
frame.

The code is like below :


    # Get the file list in the HDFS directory
    client = InsecureClient('http://10.79.148.184:50070')
    file_list = client.list('/test')

    df_total = None
    counter = 0
    for file in file_list:
        counter += 1

        # turn each file (CSV format) into data frame
        lines = sc.textFile("/test/%s" % file)
        parts = lines.map(lambda l: l.split(","))
        rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]),
protocol=p[7],bit=int(p[10])))
        df = sqlContext.createDataFrame(rows)

        # do some transform on the data frame
        df_protocol =
df.groupBy(['protocol']).agg(func.sum('bit').alias('bit'))

        # add the current data frame to previous data frame set
        if not df_total:
            df_total = df_protocol
        else:
            df_total = df_total.unionAll(df_protocol)

        # cache the df_total
        df_total.cache()
        if counter % 5 == 0:
            df_total.rdd.checkpoint()

        # get the df_total information
        df_total.show()
 

I know that as time goes on, the df_total could be big. But actually, before
that time come, the above code already raise exception.

When the loop is about 30 loops. The code throw GC overhead limit exceeded
exception. The file is very small so even 300 loops the data size could only
be about a few MB. I do not know why it throw GC error.

The exception detail is below :

        16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
task-result-getter-2
        java.lang.OutOfMemoryError: GC overhead limit exceeded
                at
scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
                at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
                at
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
                at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
                at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
                at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
                at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
                at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
                at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
                at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
                at
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
                at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
                at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
                at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
                at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
                at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
                at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
                at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
                at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
                at
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
                at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
                at
org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
                at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
                at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
                at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
                at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
                at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
                at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
        Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: 
GC
overhead limit exceeded
                at
scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
                at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
                at
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
                at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
                at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
                at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
                at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
                at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
                at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
                at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
                at
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
                at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
                at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
                at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
                at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
                at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
                at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
                at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
                at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
                at
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
                at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
                at
org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
                at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
                at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
                at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
                at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
                at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
                at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-very-small-work-load-cause-GC-overhead-limit-tp26803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to