Hi All I use spark doing some calculation. The situation is 1. New file will come into a folder periodically 2. I turn the new files into data frame and insert it into an previous data frame.
The code is like below : # Get the file list in the HDFS directory client = InsecureClient('http://10.79.148.184:50070') file_list = client.list('/test') df_total = None counter = 0 for file in file_list: counter += 1 # turn each file (CSV format) into data frame lines = sc.textFile("/test/%s" % file) parts = lines.map(lambda l: l.split(",")) rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), protocol=p[7],bit=int(p[10]))) df = sqlContext.createDataFrame(rows) # do some transform on the data frame df_protocol = df.groupBy(['protocol']).agg(func.sum('bit').alias('bit')) # add the current data frame to previous data frame set if not df_total: df_total = df_protocol else: df_total = df_total.unionAll(df_protocol) # cache the df_total df_total.cache() if counter % 5 == 0: df_total.rdd.checkpoint() # get the df_total information df_total.show() I know that as time goes on, the df_total could be big. But actually, before that time come, the above code already raise exception. When the loop is about 30 loops. The code throw GC overhead limit exceeded exception. The file is very small so even 300 loops the data size could only be about a few MB. I do not know why it throw GC error. The exception detail is below : 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2 java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328) at scala.collection.immutable.HashMap.updated(HashMap.scala:54) at scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516) at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) at org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328) at scala.collection.immutable.HashMap.updated(HashMap.scala:54) at scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516) at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) at org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-very-small-work-load-cause-GC-overhead-limit-tp26803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org