Hi Samaga
Thanks very much for your reply and sorry for the delay reply. Cassandra or Hive is a good suggestion. However in my situation I am not sure if it will make sense. My requirements is that to get the recent 24 hour data to generate report. The frequency is 5 minute. So if use cassandra or hive, it means spark will have to read 24 hour data every 5 mintues. And among those data, a big part (like 23 hours or more ) will be repeatedly read. The window in spark is for stream computing. I did not use it but I will consider it Thanks again Regards Mingwei At 2016-04-11 19:09:48, "Lohith Samaga M" <lohith.sam...@mphasis.com> wrote: >Hi Kramer, > Some options: > 1. Store in Cassandra with TTL = 24 hours. When you read the full > table, you get the latest 24 hours data. > 2. Store in Hive as ORC file and use timestamp field to filter out the > old data. > 3. Try windowing in spark or flink (have not used either). > > >Best regards / Mit freundlichen Grüßen / Sincères salutations >M. Lohith Samaga > > >-----Original Message----- >From: kramer2...@126.com [mailto:kramer2...@126.com] >Sent: Monday, April 11, 2016 16.18 >To: user@spark.apache.org >Subject: Why Spark having OutOfMemory Exception? > >I use spark to do some very simple calculation. The description is like below >(pseudo code): > > >While timestamp == 5 minutes > > df = read_hdf() # Read hdfs to get a dataframe every 5 minutes > > my_dict[timestamp] = df # Put the data frame into a dict > > delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one >24 hour before) > > big_df = merge(my_dict) # Merge the recent 24 hours data frame > >To explain.. > >I have new files comes in every 5 minutes. But I need to generate report on >recent 24 hours data. >The concept of 24 hours means I need to delete the oldest data frame every >time I put a new one into it. >So I maintain a dict (my_dict in above code), the dict contains map like >timestamp: dataframe. Everytime I put dataframe into the dict, I will go >through the dict to delete those old data frame whose timestamp is 24 hour ago. >After delete and input. I merge the data frames in the dict to a big one and >run SQL on it to get my report. > >* >I want to know if any thing wrong about this model? Because it is very slow >after started for a while and hit OutOfMemory. I know that my memory is >enough. Also size of file is very small for test purpose. So should not have >memory problem. > >I am wondering if there is lineage issue, but I am not sure. > >* > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional >commands, e-mail: user-h...@spark.apache.org > >Information transmitted by this e-mail is proprietary to Mphasis, its >associated companies and/ or its customers and is intended >for use only by the individual or entity to which it is addressed, and may >contain information that is privileged, confidential or >exempt from disclosure under applicable law. If you are not the intended >recipient or it appears that this mail has been forwarded >to you without proper authority, you are notified that any use or >dissemination of this information in any manner is strictly >prohibited. In such cases, please notify us immediately at >mailmas...@mphasis.com and delete this mail from your records. >