Hi
My environment is described like below:
5 nodes, each nodes generate a big csv file every 5 minutes. I need spark
stream to analyze these 5 files in every five minutes to generate some
report.
I am planning to do it in this way:
1. Put those 5 files into HDSF directory called /data
2. Merge
Hi
I have data frames created every 5 minutes. I use a dict to keep the recent
1 hour data frames. So only 12 data frame can be kept in the dict. New data
frame come in, old data frame pop out.
My question is when I pop out the old data frame, do I have to call
dataframe.unpersist to release the
I use spark to do some very simple calculation. The description is like below
(pseudo code):
While timestamp == 5 minutes
df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
my_dict[timestamp] = df # Put the data frame into a dict
delete_old_dataframe( my_dict )
I have a stand alone cluster running on one node
The ps command will show that Worker is having 1 GB memory and Driver is
having 256m.
root 23182 1 0 Apr01 ?00:19:30 java -cp
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.
Hi All
I use spark doing some calculation.
The situation is
1. New file will come into a folder periodically
2. I turn the new files into data frame and insert it into an previous data
frame.
The code is like below :
# Get the file list in the HDFS directory
client = InsecureClient('h
I am using python and spark.
I think one problem might be to communicate spark with third product. For
example, combine spark with elasticsearch. You have to use java or scala.
Python is not supported
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-v
We have some stream data need to be calculated and considering use spark
stream to do it.
We need to generate three kinds of reports. The reports are based on
1. The last 5 minutes data
2. The last 1 hour data
3. The last 24 hour data
The frequency of reports is 5 minutes.
After reading the d
Hi
I wrote some Python code to do calculation on spark stream. The code works
fine for about half an hour then the memory usage for the executor become
very high. I assign 4GB in the submit command but it using 80% of my
physical memory which is 16GB. I see this from top command. In this
situat
To add more infor:
This is the setting in my spark-env.sh
[root@ES01 conf]# grep -v "#" spark-env.sh
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_INSTANCES=1
SPARK_DAEMON_MEMORY=4G
So I did not set the executor to use more memory here.
Also here is the top output
KiB Mem : 16268156 total, 161116 fr
Hello.
My question here is what the spark stand alone cluster do here. Because when
we submit program like below
./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 --conf
"spark.storage.memoryFraction=0.2"
We specified the resour
I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.
I modified the code and submit several times. Find below 4 line may causing
the issue
dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bi
After 8 hours. The usage of memory become stable. Use the Top command will
find it will be 75%. So means 12GB memory.
But it still do not make sense. Because my workload is very small.
I use this spark to calculate on one csv file every 20 seconds. The size of
the csv file is 1.3M.
So spark i
sorry I have to correction again. It may still a memory leak. Because at last
the memory usage goes up again...
eventually , the stream program crashed.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26933.
Hi Simon
Can you describe your problem in more details?
I suspect that my problem is because the window function (or may be the groupBy
agg functions).
If you are the same. May be we should report a bug
At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]"
wrote:
I have t
It seems we hit the same issue.
There was a bug on 1.5.1 about memory leak. But I am using 1.6.1
Here is the link about the bug in 1.5.1
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]"
wrote:
I
Sorry, the bug link in previous mail was is wrong.
Here is the real link:
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html
At 2016-05-13 09:49:05, "ζζδΌ" wrote:
It seems we hit the same issue
I know the cache operation can cache data in memoyr/disk...
But I am expecting to know will other operation will do the same?
For example, I created a dataframe called df. The df is big so when I run
some action like :
df.sort(column_name).show()
df.collect()
It will throw error like :
17 matches
Mail list logo