Re: wholeTextfiles not parallel, runs out of memory

2017-02-14 Thread Jörn Franke
Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz i.e. you will have only one executor per file maximum > On 14 Feb 2017, at 09:36, Henry Tremblay wrote: > > When I use wholeTextFiles, spark does not run in parallel, and yarn runs out > of memory. > I have documen

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Henry Tremblay
51,000 files at about 1/2 MB per file. I am wondering if I need this http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html Although if I am understanding you correctly, even if I copy the S3 files to HDFS on EMR, and use wholeTextFiles, I am still only going to be able to u

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Jörn Franke
Can you post more information about the number of files, their size and the executor logs. A gzipped file is not splittable i.e. Only one executor can gunzip it (the unzipped data can then be processed in parallel). Wholetextfile was designed to be executed only on one executor (e.g. For proce

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Paul Tremblay
I've been working on this problem for several days (I am doing more to increase my knowledge of Spark). The code you linked to hangs because after reading in the file, I have to gunzip it. Another way that seems to be working is reading each file in using sc.textFile, and then writing it the H

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Jon Gregg
Strange that it's working for some directories but not others. Looks like wholeTextFiles maybe doesn't work with S3? https://issues.apache.org/jira/browse/SPARK-4414 . If it's possible to load the data into EMR and run Spark from there that may be a workaround. This blogspot shows a python worka

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles, I get an i

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Also, in case the issue was not due to the string length (however it is still valid and may get you later), the issue may be due to some other indexing issues which are currently being worked on here https://issues.apache.org/jira/browse/SPARK-6235 On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky w

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Hi Pradeep, I'm afraid you're running into a hard Java issue. Strings are indexed with signed integers and can therefore not be longer than approximately 2 billion characters. Could you use `textFile` as a workaround? It will give you an RDD of the files' lines instead. In general, this guide htt

Re: wholeTextFiles("/x/*/*.txt") runs single threaded

2015-07-02 Thread Kostas Kougios
In SparkUI I can see it creating 2 stages. I tried wholeTextFiles().repartition(32) but same threading results. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html Sent from the Apache Spark User List

Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I forgot to say, I am using bin/spark-shell, spark-1.0.2 That host has scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_11) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12678.html Sent from t

Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue seems related to Hadoop1. When switching to using spark-1.0.2-bin-hadoop*2*, the issue disappears. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDF

Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
That worked for me as well, I was using spark 1.0 compiled against Hadoop 1.0, switching to 1.0.1 compiled against hadoop 2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html Sent from the Apache Spark User

Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
I have the same issue val a = sc.textFile("s3n://MyBucket/MyFolder/*.tif") a.first works perfectly fine, but val d = sc.wholeTextFiles("s3n://MyBucket/MyFolder/*.tif") does not work d.first Gives the following error message java.io.FileNotFoundExceptio

Re: wholeTextFiles like for binary files ?

2014-06-25 Thread Akhil Das
You cannot read image files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gripped files because they are not splittable (source proving it): override def createRecordReader( split: InputSplit, contex

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I can write one if you'll point me to where I need to write it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Xusen Yin
Hi Sguj and littlebird, I'll try to fix it tomorrow evening and the day after tomorrow, because I am now busy preparing a talk (slides) tomorrow. Sorry for the inconvenience to you. Would you mind to write an issue on Spark JIRA? 2014-06-17 20:55 GMT+08:00 Sguj : > I didn't fix the issue so muc

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I didn't fix the issue so much as work around it. I was running my cluster locally, so using HDFS was just a preference. The code worked with the local file system, so that's what I'm using until I can get some help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

Re: wholeTextFiles not working with HDFS

2014-06-16 Thread littlebird
Hi, I have the same exception. Can you tell me how did you fix it? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

2014-06-13 Thread Sguj
My exception stack looks about the same. java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFil

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-13 Thread visenger
Hi guys, I ran into the same exception (while trying the same example), and after overriding hadoop-client artifact in my pom.xml, I got another error (below). System config: ubuntu 12.04 intellijj 13. scala 2.10.3 maven: org.apache.spark spark-core_2.10 1.

Re: wholeTextFiles not working with HDFS

2014-06-12 Thread yinxusen
Hi Sguj, Could you give me the exception stack? I test it on my laptop and find that it gets the wrong FileSystem. It should be DistributedFileSystem, but it finds the RawLocalFileSystem. If we get the same exception stack, I'll try to fix it. Here is my exception stack: java.io.FileNotFoundEx

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Matei Zaharia
Yeah unfortunately Hadoop 2 requires these binaries on Windows. Hadoop 1 runs just fine without them. Matei On Jun 3, 2014, at 10:33 AM, Sean Owen wrote: > I'd try the internet / SO first -- these are actually generic > Hadoop-related issues. Here I think you don't have HADOOP_HOME or > simila

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Sean Owen
I'd try the internet / SO first -- these are actually generic Hadoop-related issues. Here I think you don't have HADOOP_HOME or similar set. http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path On Tue, Jun 3, 2014 at 5:54 PM, toivoa wrote: >

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread toivoa
Wow! What a quick reply! adding org.apache.hadoop hadoop-client 2.4.0 solved the problem. But now I get 14/06/03 19:52:50 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Sean Owen
"Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected" is the classic error meaning "you compiled against Hadoop 1, but are running against Hadoop 2" I think you need to override the hadoop-client artifact that Spark depends on to be a Hadoop 2.x version. On Tue,