I use spark to do some very simple calculation. The description is like below
(pseudo code):


While timestamp == 5 minutes
    
    df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
    
    my_dict[timestamp] = df # Put the data frame into a dict

    delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
24 hour before)

    big_df = merge(my_dict) # Merge the recent 24 hours data frame

To explain..

I have new files comes in every 5 minutes. But I need to generate report on
recent 24 hours data. 
The concept of 24 hours means I need to delete the oldest data frame every
time I put a new one into it.
So I maintain a dict (my_dict in above code), the dict contains map like
timestamp: dataframe. Everytime I put dataframe into the dict, I will go
through the dict to delete those old data frame whose timestamp is 24 hour
ago.
After delete and input. I merge the data frames in the dict to a big one and
run SQL on it to get my report.

*
I want to know if any thing wrong about this model? Because it is very slow
after started for a while and hit OutOfMemory. I know that my memory is
enough. Also size of file is very small for test purpose. So should not have
memory problem.

I am wondering if there is lineage issue, but I am not sure. 

*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to