I have a job that runs fine on relatively small input datasets but then
reaches a threshold where I begin to consistently get "Fetch failure" for
the Failure Reason, late in the job, during a saveAsText() operation.
The first error we are seeing on the "Details for Stage" page is
"ExecutorLostFai
On hdfs I created:
/one/one.txt # contains text "one"
/one/two/two.txt # contains text "two"
Then:
val data = sc.textFile("/one/*")
data.collect
This returned:
Array(one, two)
So the above path designation appears to automatically recurse for you.
--
View this message in context:
http
I recently had the same problem. I'm not an expert but will suggest that you
concatenate your files into a smaller number of larger files. E.g. in Linux
cat >> a_larger_file. This helped greatly.
Likely others better qualified will weigh in on this later but that's
something to get you start
We have a very large RDD and I need to create a new RDD whose values are
derived from each record of the original RDD, and we only retain the few new
records that meet a criteria. I want to avoid creating a second large RDD
and then filtering it since I believe this could tax system resources
unne
Thanks! zipWithIndex() works well. I had overlooked it because the name
'zip' is rather odd
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20722.html
Sent from the Apache Spark User List mailing l
I think this is sort of a newbie question, but I've checked the api closely
and don't see an obvious answer:
Given an RDD, how would I create a new RDD of Tuples where the first Tuple
value is an incremented Int e.g. 1,2,3 ... and the second value of the Tuple
is the original RDD record? I'm tryi
Thank you! I had known about the small-files problem in HDFS but didn't
realize that it affected sc.textFile().
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712p20717.html
Sent from the Apache Spark User List mailing list archive at
Our job is creating what appears to be an inordinate number of very small
tasks, which blow out our os inode and file limits. Rather than continually
upping those limits, we are seeking to understand whether our real problem
is that too many tasks are running, perhaps because we are mis-configured
We are relatively new to spark and so far have been manually submitting
single jobs at a time for ML training, during our development process, using
spark-submit. Each job accepts a small user-submitted data set and compares
it to every data set in our hdfs corpus, which only changes incrementally