Hi Kramer, Some options: 1. Store in Cassandra with TTL = 24 hours. When you read the full table, you get the latest 24 hours data. 2. Store in Hive as ORC file and use timestamp field to filter out the old data. 3. Try windowing in spark or flink (have not used either).
Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga -----Original Message----- From: kramer2...@126.com [mailto:kramer2...@126.com] Sent: Monday, April 11, 2016 16.18 To: user@spark.apache.org Subject: Why Spark having OutOfMemory Exception? I use spark to do some very simple calculation. The description is like below (pseudo code): While timestamp == 5 minutes df = read_hdf() # Read hdfs to get a dataframe every 5 minutes my_dict[timestamp] = df # Put the data frame into a dict delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one 24 hour before) big_df = merge(my_dict) # Merge the recent 24 hours data frame To explain.. I have new files comes in every 5 minutes. But I need to generate report on recent 24 hours data. The concept of 24 hours means I need to delete the oldest data frame every time I put a new one into it. So I maintain a dict (my_dict in above code), the dict contains map like timestamp: dataframe. Everytime I put dataframe into the dict, I will go through the dict to delete those old data frame whose timestamp is 24 hour ago. After delete and input. I merge the data frames in the dict to a big one and run SQL on it to get my report. * I want to know if any thing wrong about this model? Because it is very slow after started for a while and hit OutOfMemory. I know that my memory is enough. Also size of file is very small for test purpose. So should not have memory problem. I am wondering if there is lineage issue, but I am not sure. * -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please notify us immediately at mailmas...@mphasis.com and delete this mail from your records. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org