SparkR Jobs Hanging in collectPartitions

Eskilson,Aleksander Tue, 26 May 2015 07:31:58 -0700

I’ve been attempting to run a SparkR translation of a similar Scala job that 
identifies words from a corpus not existing in a newline delimited dictionary. 
The R code is:


dict <- SparkR:::textFile(sc, src1)
corpus <- SparkR:::textFile(sc, src2)
words <- distinct(SparkR:::flatMap(corpus, function(line) { gsub(“[[:punct:]]”, 
“”, tolower(strsplit(line, “ |,|-“)[[1]]))}))
found <- subtract(words, dict)

(where src1, src2 are locations on HDFS)

Then attempting something like take(found, 10) or saveAsTextFile(found, dest) 
should realize the collection, but that stage of the DAG hangs in Scheduler 
Delay during the collectPartitions phase.

Synonymous Scala code however,
val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“))
val dict = sc.textFile(src2)
val words = corpus.map(word => word.filter(Character.isLetter(_))).disctinct()
val found = words.subtract(dict)

performs as expected. Any thoughts?

Thanks,
Alek Eskilson

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

SparkR Jobs Hanging in collectPartitions

Reply via email to