How to design the input source of spark stream

2016-03-30 Thread kramer2...@126.com
Hi

My environment is described like below:

5 nodes, each nodes generate a big csv file every 5 minutes. I need spark
stream to analyze these 5 files in every five minutes to generate some
report.

I am planning to do it in this way:

1. Put those 5 files into HDSF directory called /data
2. Merge them into one big file in that directory
3. Use spark stream constructor textFileStream('/data') to generate my
inputDStream

The problem of this way is I do not know how to merge the 5 files in HDFS.
It seems very difficult to do it in python.

So question is 

1. Can you tell me how to merge files in hdfs by python?
2. Do you know some other way to input those files into spark?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-design-the-input-source-of-spark-stream-tp26641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to release data frame to avoid memory leak

2016-03-31 Thread kramer2...@126.com
Hi

I have data frames created every 5 minutes. I use a dict to keep the recent
1 hour data frames. So only 12 data frame can be kept in the dict. New data
frame come in, old data frame pop out.

My question is when I pop out the old data frame, do I have to call
dataframe.unpersist to release the memory?

For example 

If currentTime == fiveMinutes:

myDict[currentTime] = dataframe

oldestDataFrame = myDict.pop(oldest)

Now do I have to call oldestDataFrame.unpresist?  Because I think python
will automatically release unused variable






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-release-data-frame-to-avoid-memory-leak-tp26656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why Spark having OutOfMemory Exception?

2016-04-11 Thread kramer2...@126.com
I use spark to do some very simple calculation. The description is like below
(pseudo code):


While timestamp == 5 minutes

df = read_hdf() # Read hdfs to get a dataframe every 5 minutes

my_dict[timestamp] = df # Put the data frame into a dict

delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
24 hour before)

big_df = merge(my_dict) # Merge the recent 24 hours data frame

To explain..

I have new files comes in every 5 minutes. But I need to generate report on
recent 24 hours data. 
The concept of 24 hours means I need to delete the oldest data frame every
time I put a new one into it.
So I maintain a dict (my_dict in above code), the dict contains map like
timestamp: dataframe. Everytime I put dataframe into the dict, I will go
through the dict to delete those old data frame whose timestamp is 24 hour
ago.
After delete and input. I merge the data frames in the dict to a big one and
run SQL on it to get my report.

*
I want to know if any thing wrong about this model? Because it is very slow
after started for a while and hit OutOfMemory. I know that my memory is
enough. Also size of file is very small for test purpose. So should not have
memory problem.

I am wondering if there is lineage issue, but I am not sure. 

*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is it possible that spark worker require more resource than the cluster?

2016-04-18 Thread kramer2...@126.com
I have a stand alone cluster running on one node

The ps command will show that Worker is having 1 GB memory and Driver is
having 256m.

root 23182 1  0 Apr01 ?00:19:30 java -cp
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker
--webui-port 8081 spark://ES01:7077

root 23053 1  0 Apr01 ?00:25:00 java -cp
/opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master
--ip ES01 --port 7077 --webui-port 8080


But when I submit my application to the cluster. I can specify that driver
use 10G memory and worker use 10G memory also. 

So is it make sense that I assign more memory to the application than the
cluster it self?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-that-spark-worker-require-more-resource-than-the-cluster-tp26799.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why very small work load cause GC overhead limit?

2016-04-19 Thread kramer2...@126.com
Hi All

I use spark doing some calculation. 
The situation is 
1. New file will come into a folder periodically
2. I turn the new files into data frame and insert it into an previous data
frame.

The code is like below :


# Get the file list in the HDFS directory
client = InsecureClient('http://10.79.148.184:50070')
file_list = client.list('/test')

df_total = None
counter = 0
for file in file_list:
counter += 1

# turn each file (CSV format) into data frame
lines = sc.textFile("/test/%s" % file)
parts = lines.map(lambda l: l.split(","))
rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]),
protocol=p[7],bit=int(p[10])))
df = sqlContext.createDataFrame(rows)

# do some transform on the data frame
df_protocol =
df.groupBy(['protocol']).agg(func.sum('bit').alias('bit'))

# add the current data frame to previous data frame set
if not df_total:
df_total = df_protocol
else:
df_total = df_total.unionAll(df_protocol)

# cache the df_total
df_total.cache()
if counter % 5 == 0:
df_total.rdd.checkpoint()

# get the df_total information
df_total.show()
 

I know that as time goes on, the df_total could be big. But actually, before
that time come, the above code already raise exception.

When the loop is about 30 loops. The code throw GC overhead limit exceeded
exception. The file is very small so even 300 loops the data size could only
be about a few MB. I do not know why it throw GC error.

The exception detail is below :

16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
task-result-getter-2
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at
org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: 
GC
overhead limit exceeded
at
scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(Hash

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread kramer2...@126.com
I am using python and spark. 

I think one problem might be to communicate spark with third product. For
example, combine spark with elasticsearch. You have to use java or scala.
Python is not supported



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-for-Spark-ecosystem-tp26805p26806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How big the spark stream window could be ?

2016-05-08 Thread kramer2...@126.com
We have some stream data need to be calculated and considering use spark
stream to do it.

We need to generate three kinds of reports. The reports are based on 

1. The last 5 minutes data
2. The last 1 hour data
3. The last 24 hour data

The frequency of reports is 5 minutes. 

After reading the docs, the most obvious way to solve this seems to set up a
spark stream with 5 minutes interval and two window which are 1 hour and 1
day.


But I am worrying that if the window is too big for one day and one hour. I
do not have much experience on spark stream, so what is the window length in
your environment? 

Any official docs talking about this?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why I have memory leaking for such simple spark stream code?

2016-05-09 Thread kramer2...@126.com
Hi 


I wrote some Python code to do calculation on spark stream. The code works
fine for about half an hour then the memory usage for the executor become
very high. I assign 4GB in the submit command but it using 80% of my
physical memory which is 16GB. I see this from top command.  In this
situation the code just hang there..


You may say the workload is too big so have memory issue. But it is not. My
stream interval is 30 seconds. The workload is one source that generating a
file with 10 000 lines every 10 seconds. So in one batch interval it is 30
000 lines of csv file. Only a few kb. So it can not be the workload.


The cluster I use is spark stand alone cluster on only one node.


The submit command I use is 


*./bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 ./latest5min.py  1>a.log 2>b.log*


The code is all in the file latest5min.py.  The logic is very simple and the
file contains less than 100 lines.


I will attache the file here..  latest5min.py
 
 


I know it is not happy experience to ready other peoples code .. I will try
to reduce my code to see where is the problem.  But every time I need to
wait half an hour or longer to hit the error. So it will take some time.


Please help to check the current code first if possible. I will be very
happy to answer any question


Very appreciate for the help. This is a really headache problem. Totally no
clue what is happening







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-tp26904.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why I have memory leaking for such simple spark stream code?

2016-05-09 Thread kramer2...@126.com
To add more infor:

This is the setting in my spark-env.sh
[root@ES01 conf]# grep -v "#" spark-env.sh
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_INSTANCES=1
SPARK_DAEMON_MEMORY=4G

So I did not set the executor to use more memory here.

Also here is the top output 

KiB Mem : 16268156 total,   161116 free, 15213076 used,   893964 buff/cache
KiB Swap:  6291452 total,  3332460 free,  2958992 used.   238788 avail Mem 

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND 

   
 6629 root  20   0 49.990g 0.011t   5324 S   0.0 72.1  78:28.99 java

   


As you can see the process 6629 which is executor is using 72% MEM.


So I wonder why it causing such high memory usage




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-tp26904p26910.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



What does the spark stand alone cluster do?

2016-05-10 Thread kramer2...@126.com
Hello.

My question here is what the spark stand alone cluster do here. Because when
we submit program like below

./bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 --conf
"spark.storage.memoryFraction=0.2" 


We specified the resource allocation manually
We specified the config manually 

Then what the cluster do here?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-does-the-spark-stand-alone-cluster-do-tp26920.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Will the HiveContext cause memory leak ?

2016-05-10 Thread kramer2...@126.com
I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.

I modified the code and submit several times. Find below 4 line may causing
the issue

dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec =
Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret =
dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
rank.alias('rank')).filter("rank<=2")

It looks a little complicated but it is just some Window function on
dataframe. I use the HiveContext because SQLContext do not support window
function yet. Without the 4 line, my code can run all night. Adding them
will cause the memory leak. Program will crash in a few hours.

I will provided the whole code (50 lines)here.  ForAsk01.py
  
Please advice me if it is a bug..

Also here is the submit command 

nohup ./bin/spark-submit  \  
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2"  \
./ForAsk.py 1>a.log 2>b.log &





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
After 8 hours. The usage of memory become stable. Use the Top command will
find it will be 75%. So means 12GB memory.


But it still do not make sense. Because my workload is very small.


I use this spark to calculate on one csv file every 20 seconds. The size of
the csv file is 1.3M.


So spark is using almost 10 000 times of memory than my workload. Does that
mean I need prepare 1TB RAM if the workload is 100M?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26927.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
sorry I have to correction again. It may still a memory leak. Because at last
the memory usage goes up again... 

eventually , the stream program crashed.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re:Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
Hi Simon


Can you describe your problem in more details? 
I suspect that my problem is because the window function (or may be the groupBy 
agg functions).
If you are the same. May be we should report a bug 






At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]" 
 wrote:
I have the same Problem with Spark-2.0.0 Snapshot with Streaming. There I use 
Datasets instead of Dataframes. I hope you or someone will find a solution.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26930.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26934.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
It seems we hit the same issue.


There was a bug on 1.5.1 about memory leak. But I am using 1.6.1


Here is the link about the bug in 1.5.1 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark






At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" 
 wrote:
I read with Spark-Streaming from a Port. The incoming data consists of key and 
value pairs. Then I call forEachRDD on each window. There I create a Dataset 
from the window and do some SQL Querys on it. On the result i only do show, to 
see the content. It works well, but the memory usage increases. When it reaches 
the maximum nothing works anymore. When I use more memory. The Program runs 
some time longer, but the problem persists. Because I run a Programm which 
writes to the Port, I can control perfectly how much Data Spark has to Process. 
When I write every one ms one key and value Pair the Problem is the same as 
when i write only every second a key and value pair to the port.

When I dont create a Dataset in the foreachRDD and only count the Elements in 
the RDD, then everything works fine. I also use groupBy agg functions in the 
querys.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26946.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
Sorry, the bug link in previous mail was is wrong. 


Here is the real link:


http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html











At 2016-05-13 09:49:05, "李明伟"  wrote:

It seems we hit the same issue.


There was a bug on 1.5.1 about memory leak. But I am using 1.6.1


Here is the link about the bug in 1.5.1 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark






At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" 
 wrote:
I read with Spark-Streaming from a Port. The incoming data consists of key and 
value pairs. Then I call forEachRDD on each window. There I create a Dataset 
from the window and do some SQL Querys on it. On the result i only do show, to 
see the content. It works well, but the memory usage increases. When it reaches 
the maximum nothing works anymore. When I use more memory. The Program runs 
some time longer, but the problem persists. Because I run a Programm which 
writes to the Port, I can control perfectly how much Data Spark has to Process. 
When I write every one ms one key and value Pair the Problem is the same as 
when i write only every second a key and value pair to the port.

When I dont create a Dataset in the foreachRDD and only count the Elements in 
the RDD, then everything works fine. I also use groupBy agg functions in the 
querys.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML




 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Will spark swap memory out to disk if the memory is not enough?

2016-05-16 Thread kramer2...@126.com
I know the cache operation can cache data in memoyr/disk... 

But I am expecting to know will other operation will do the same?

For example, I created a dataframe called df. The df is big so when I run
some action like :

df.sort(column_name).show()
df.collect()

It will throw error like :
16/05/17 10:53:36 ERROR Executor: Managed memory leak detected; size =
2359296 bytes, TID = 15
16/05/17 10:53:36 ERROR Executor: Exception in task 0.0 in stage 12.0 
(TID
15)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File
"/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
line 111, in main
process()
  File
"/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File
"/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File "", line 1, in 
IndexError: list index out of range


I want to know is there any way or configuration to let spark swap memory
into disk for this situation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-spark-swap-memory-out-to-disk-if-the-memory-is-not-enough-tp26968.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org