from:"bethesda"

"Fetch Failure"

2014-12-19 Thread bethesda

I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get "Fetch failure" for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the "Details for Stage" page is "ExecutorLostFai

Re: reading files recursively using spark

2014-12-19 Thread bethesda

On hdfs I created: /one/one.txt # contains text "one" /one/two/two.txt # contains text "two" Then: val data = sc.textFile("/one/*") data.collect This returned: Array(one, two) So the above path designation appears to automatically recurse for you. -- View this message in context: http

Re: too many small files and task

2014-12-19 Thread bethesda

I recently had the same problem. I'm not an expert but will suggest that you concatenate your files into a smaller number of larger files. E.g. in Linux cat >> a_larger_file. This helped greatly. Likely others better qualified will weigh in on this later but that's something to get you start

Creating a smaller, derivative RDD from an RDD

2014-12-18 Thread bethesda

We have a very large RDD and I need to create a new RDD whose values are derived from each record of the original RDD, and we only retain the few new records that meet a criteria. I want to avoid creating a second large RDD and then filtering it since I believe this could tax system resources unne

Re: Appending an incrental value to each RDD record

2014-12-16 Thread bethesda

Thanks! zipWithIndex() works well. I had overlooked it because the name 'zip' is rather odd -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20722.html Sent from the Apache Spark User List mailing l

Appending an incrental value to each RDD record

2014-12-16 Thread bethesda

I think this is sort of a newbie question, but I've checked the api closely and don't see an obvious answer: Given an RDD, how would I create a new RDD of Tuples where the first Tuple value is an incremented Int e.g. 1,2,3 ... and the second value of the Tuple is the original RDD record? I'm tryi

Re: Why so many tasks?

2014-12-16 Thread bethesda

Thank you! I had known about the small-files problem in HDFS but didn't realize that it affected sc.textFile(). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712p20717.html Sent from the Apache Spark User List mailing list archive at

Why so many tasks?

2014-12-16 Thread bethesda

Our job is creating what appears to be an inordinate number of very small tasks, which blow out our os inode and file limits. Rather than continually upping those limits, we are seeking to understand whether our real problem is that too many tasks are running, perhaps because we are mis-configured

Best practice for multi-user web controller in front of Spark

2014-11-11 Thread bethesda

We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes incrementally

"Fetch Failure"

Re: reading files recursively using spark

Re: too many small files and task

Creating a smaller, derivative RDD from an RDD

Re: Appending an incrental value to each RDD record

Appending an incrental value to each RDD record

Re: Why so many tasks?

Why so many tasks?

Best practice for multi-user web controller in front of Spark

9 matches

Site Navigation

Mail list logo

Footer information