I use spark to do some very simple calculation. The description is like below (pseudo code):
While timestamp == 5 minutes df = read_hdf() # Read hdfs to get a dataframe every 5 minutes my_dict[timestamp] = df # Put the data frame into a dict delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one 24 hour before) big_df = merge(my_dict) # Merge the recent 24 hours data frame To explain.. I have new files comes in every 5 minutes. But I need to generate report on recent 24 hours data. The concept of 24 hours means I need to delete the oldest data frame every time I put a new one into it. So I maintain a dict (my_dict in above code), the dict contains map like timestamp: dataframe. Everytime I put dataframe into the dict, I will go through the dict to delete those old data frame whose timestamp is 24 hour ago. After delete and input. I merge the data frames in the dict to a big one and run SQL on it to get my report. * I want to know if any thing wrong about this model? Because it is very slow after started for a while and hit OutOfMemory. I know that my memory is enough. Also size of file is very small for test purpose. So should not have memory problem. I am wondering if there is lineage issue, but I am not sure. * -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org