Hi,
I have 1 master and 4 slave node. Input data size is 14GB.
Slave Node config : 32GB Ram,16 core
I am trying to train word embedding model using spark. It is going out of
memory. To train 14GB of data how much memory do i require?.
I have givem 20gb per executor but below shows it is using 11.8GB out of 20
GB.
BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-.-.-.dev:35035
(size: 4.6 KB, free: 11.8 GB)
This is the code
if __name__ == "__main__":
sc = SparkContext(appName="Word2VecExample") # SparkContext
# $example on$
inp =
sc.textFile("s3://word2vec/data/word2vec_word_data.txt/").map(lambda row:
row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)
model.save(sc, "s3://pysparkml/word2vecresult2/")
sc.stop()
Spark-submit Command:
spark-submit --master yarn --conf
'spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/mnt/tmp -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintFlagsFinal
-XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy
-XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark' --num-executors 4
--executor-cores 2 --executor-memory 20g Word2VecExample.py
--
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"