from:"kramer2...@126.com"

How to design the input source of spark stream

2016-03-30 Thread kramer2...@126.com

Hi My environment is described like below: 5 nodes, each nodes generate a big csv file every 5 minutes. I need spark stream to analyze these 5 files in every five minutes to generate some report. I am planning to do it in this way: 1. Put those 5 files into HDSF directory called /data 2. Merge

How to release data frame to avoid memory leak

2016-03-31 Thread kramer2...@126.com

Hi I have data frames created every 5 minutes. I use a dict to keep the recent 1 hour data frames. So only 12 data frame can be kept in the dict. New data frame come in, old data frame pop out. My question is when I pop out the old data frame, do I have to call dataframe.unpersist to release the

Why Spark having OutOfMemory Exception?

2016-04-11 Thread kramer2...@126.com

I use spark to do some very simple calculation. The description is like below (pseudo code): While timestamp == 5 minutes df = read_hdf() # Read hdfs to get a dataframe every 5 minutes my_dict[timestamp] = df # Put the data frame into a dict delete_old_dataframe( my_dict )

Is it possible that spark worker require more resource than the cluster?

2016-04-18 Thread kramer2...@126.com

I have a stand alone cluster running on one node The ps command will show that Worker is having 1 GB memory and Driver is having 256m. root 23182 1 0 Apr01 ?00:19:30 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.

Why very small work load cause GC overhead limit?

2016-04-19 Thread kramer2...@126.com

Hi All I use spark doing some calculation. The situation is 1. New file will come into a folder periodically 2. I turn the new files into data frame and insert it into an previous data frame. The code is like below : # Get the file list in the HDFS directory client = InsecureClient('h

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread kramer2...@126.com

I am using python and spark. I think one problem might be to communicate spark with third product. For example, combine spark with elasticsearch. You have to use java or scala. Python is not supported -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Scala-v

How big the spark stream window could be ?

2016-05-08 Thread kramer2...@126.com

We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the d

Why I have memory leaking for such simple spark stream code?

2016-05-09 Thread kramer2...@126.com

Hi I wrote some Python code to do calculation on spark stream. The code works fine for about half an hour then the memory usage for the executor become very high. I assign 4GB in the submit command but it using 80% of my physical memory which is 16GB. I see this from top command. In this situat

Re: Why I have memory leaking for such simple spark stream code?

2016-05-09 Thread kramer2...@126.com

To add more infor: This is the setting in my spark-env.sh [root@ES01 conf]# grep -v "#" spark-env.sh SPARK_EXECUTOR_CORES=1 SPARK_EXECUTOR_INSTANCES=1 SPARK_DAEMON_MEMORY=4G So I did not set the executor to use more memory here. Also here is the top output KiB Mem : 16268156 total, 161116 fr

What does the spark stand alone cluster do?

2016-05-10 Thread kramer2...@126.com

Hello. My question here is what the spark stand alone cluster do here. Because when we submit program like below ./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G --num-executors 1 --total-executor-cores 1 --conf "spark.storage.memoryFraction=0.2" We specified the resour

Will the HiveContext cause memory leak ?

2016-05-10 Thread kramer2...@126.com

I submit my code to a spark stand alone cluster. Find the memory usage executor process keeps growing. Which cause the program to crash. I modified the code and submit several times. Find below 4 line may causing the issue dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bi

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com

After 8 hours. The usage of memory become stable. Use the Top command will find it will be 75%. So means 12GB memory. But it still do not make sense. Because my workload is very small. I use this spark to calculate on one csv file every 20 seconds. The size of the csv file is 1.3M. So spark i

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com

sorry I have to correction again. It may still a memory leak. Because at last the memory usage goes up again... eventually , the stream program crashed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26933.

Re:Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com

Hi Simon Can you describe your problem in more details? I suspect that my problem is because the window function (or may be the groupBy agg functions). If you are the same. May be we should report a bug At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]" wrote: I have t

Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com

It seems we hit the same issue. There was a bug on 1.5.1 about memory leak. But I am using 1.6.1 Here is the link about the bug in 1.5.1 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" wrote: I

Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com

Sorry, the bug link in previous mail was is wrong. Here is the real link: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html At 2016-05-13 09:49:05, "李明伟" wrote: It seems we hit the same issue

Will spark swap memory out to disk if the memory is not enough?

2016-05-16 Thread kramer2...@126.com

I know the cache operation can cache data in memoyr/disk... But I am expecting to know will other operation will do the same? For example, I created a dataframe called df. The df is big so when I run some action like : df.sort(column_name).show() df.collect() It will throw error like :

How to design the input source of spark stream

How to release data frame to avoid memory leak

Why Spark having OutOfMemory Exception?

Is it possible that spark worker require more resource than the cluster?

Why very small work load cause GC overhead limit?

Re: Scala vs Python for Spark ecosystem

How big the spark stream window could be ?

Why I have memory leaking for such simple spark stream code?

Re: Why I have memory leaking for such simple spark stream code?

What does the spark stand alone cluster do?

Will the HiveContext cause memory leak ?

Re: Will the HiveContext cause memory leak ?

Re: Will the HiveContext cause memory leak ?

Re:Re: Will the HiveContext cause memory leak ?

Re:Re: Re:Re: Will the HiveContext cause memory leak ?

Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?

Will spark swap memory out to disk if the memory is not enough?

17 matches

Site Navigation

Mail list logo

Footer information