closure issues: wholeTextFiles

2018-03-27 Thread Gourav Sengupta
Hi, I can understand facing closure issues while executing this code: package spark //this package is about understanding closures as mentioned in: http://spark.apache.org/docs/latest/rdd-programming-guide.

small job runs out of memory using wholeTextFiles

2017-04-07 Thread Paul Tremblay
As part of my processing, I have the following code: rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10) rdd.count() The s3 directory has about 8GB of data and 61,878 files. I am using Spark 2.1, and running it with 15 modes of m3.xlarge nodes on EMR. The job fails with this error:

Re: wholeTextfiles not parallel, runs out of memory

2017-02-14 Thread Jörn Franke
Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz i.e. you will have only one executor per file maximum > On 14 Feb 2017, at 09:36, Henry Tremblay wrote: > > When I use wholeTextFiles, spark does not run in parallel, and yarn runs out > of memor

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Henry Tremblay
51,000 files at about 1/2 MB per file. I am wondering if I need this http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html Although if I am understanding you correctly, even if I copy the S3 files to HDFS on EMR, and use wholeTextFiles, I am still only going to be able to

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Jörn Franke
File, and then writing it the HDFS, and then using wholeTextFiles for > the HDFS result. > But the bigger issue is that both methods are not executed in parallel. When > I open my yarn manager, it shows that only one node is being used. > > Henry > >> On 02/06/2017 03:39

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Paul Tremblay
t the HDFS, and then using wholeTextFiles for the HDFS result. But the bigger issue is that both methods are not executed in parallel. When I open my yarn manager, it shows that only one node is being used. Henry On 02/06/2017 03:39 PM, Jon Gregg wrote: Strange that it's workin

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Jon Gregg
Strange that it's working for some directories but not others. Looks like wholeTextFiles maybe doesn't work with S3? https://issues.apache.org/jira/browse/SPARK-4414 . If it's possible to load the data into EMR and run Spark from there that may be a workaround. This blogspot

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles

wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd = sc.wholeText

Handling Input Error in wholeTextFiles

2017-01-11 Thread khwunchai jaengsawang
Hi all, I have a requirement to process multiple splittable gzip files and the results need to include each individual file name. I come across a problem when loading multiple gzip files using wholeTextFiles method and some files are corrupted causing ‘unexpected end of input stream' erro

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Also, in case the issue was not due to the string length (however it is still valid and may get you later), the issue may be due to some other indexing issues which are currently being worked on here https://issues.apache.org/jira/browse/SPARK-6235 On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky w

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Hi Pradeep, I'm afraid you're running into a hard Java issue. Strings are indexed with signed integers and can therefore not be longer than approximately 2 billion characters. Could you use `textFile` as a workaround? It will give you an RDD of the files' lines instead. In general, this guide htt

wholeTextFiles()

2016-12-12 Thread Pradeep
Hi, Why there is an restriction on max file size that can be read by wholeTextFile() method. I can read a 1.5 gigs file but get Out of memory for 2 gig file. Also, how can I raise this as an defect in spark jira. Can someone please guide. Thanks, Pradeep -

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
s would not work in cluster. I dont even see the print >>>> statements in `map` and `foreach` getting printed in cluster mode of >>>> execution. >>>> >>>> I notice a particular line in standalone output that I do NOT see in >>>> cluster exec

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread ayan guha
do NOT see in >>> cluster execution. >>> >>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: >>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inpu

How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
ser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Sonal Goyal
,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* > > I had a similar code with textFile() that worked earlier for

How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-07 Thread Nisha Menon
+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* I had a similar code with textFile() that worked earlier for individual files on cluster. The issue is with whol

Re: Issue with wholeTextFiles

2016-03-21 Thread Akhil Das
file from > HDFS and all it's content has to be read at one shot. So I'm using spark > context's wholeTextFiles API passing the HDFS URL for the file. > > When I try this from a spark shell it's works as mentioned in the > documentation, but when I try the

Issue with wholeTextFiles

2016-03-21 Thread Sarath Chandra
I'm using Hadoop 1.0.4 and Spark 1.2.0. I'm facing a strange issue. I have a requirement to read a small file from HDFS and all it's content has to be read at one shot. So I'm using spark context's wholeTextFiles API passing the HDFS URL for the file. When I try thi

Re: wholeTextFiles("/x/*/*.txt") runs single threaded

2015-07-02 Thread Kostas Kougios
In SparkUI I can see it creating 2 stages. I tried wholeTextFiles().repartition(32) but same threading results. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html Sent from the Apache Spark User List

wholeTextFiles("/x/*/*.txt") runs single threaded

2015-07-02 Thread Kostas Kougios
1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional comman

wholeTextFiles on 20 nodes

2014-11-23 Thread Simon Hafner
I have 20 nodes via EC2 and an application that reads the data via wholeTextFiles. I've tried to copy the data into hadoop via copyFromLocal, and I get 14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad connect ack

Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I forgot to say, I am using bin/spark-shell, spark-1.0.2 That host has scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_11) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12678.html Sent from

Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue seems related to Hadoop1. When switching to using spark-1.0.2-bin-hadoop*2*, the issue disappears. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with

Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
That worked for me as well, I was using spark 1.0 compiled against Hadoop 1.0, switching to 1.0.1 compiled against hadoop 2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html Sent from the Apache Spark

Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
lowing error message java.io.FileNotFoundException: File /MyBucket/MyFolder.tif does not exist. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10505.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles like for binary files ?

2014-06-25 Thread Akhil Das
You cannot read image files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gripped files because they are not splittable <http://www.bigdataspeak.com/2013_01_01_archive.html> (source proving it): override def createRecordReader( split: Inpu

wholeTextFiles like for binary files ?

2014-06-25 Thread Jaonary Rabarisoa
Is there an equivalent of wholeTextFiles for binary files for example a set of images ? Cheers, Jaonary

wholeTextFiles and gzip

2014-06-25 Thread Nick Chammas
Interesting question on Stack Overflow: http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles Is it possible to read gzipped files using wholeTextFiles()? Alternately, is it possible to read the source file names using textFile()? ​ -- View this

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
I can write one if you'll point me to where I need to write it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7737.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Xusen Yin
ttp://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Best Regards --- Xusen Yin(尹绪森) Intel Labs China Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Re: wholeTextFiles not working with HDFS

2014-06-17 Thread Sguj
list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7726.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

2014-06-16 Thread littlebird
Hi, I have the same exception. Can you tell me how did you fix it? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

2014-06-13 Thread Sguj
, and everything else I've tried in Spark with that version has worked, so I doubt it's a version error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7570.html Sent from the Apache Spark User List mailing

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-13 Thread visenger
-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p7563.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles not working with HDFS

2014-06-12 Thread yinxusen
-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7548.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

wholeTextFiles not working with HDFS

2014-06-12 Thread Sguj
I'm trying to get a list of every filename in a directory from HDFS using pySpark, and the only thing that seems like it would return the filenames is the wholeTextFiles function. My code for just trying to collect that data is this: files = sc.wholeTextFiles("hdfs://localhost:

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Matei Zaharia
n.setConfiguration(UserGroupInformation.java:283) >>at >> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36) >>at >> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) >>at >> org.apache.spark.deploy.S

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Sean Owen
ala) > > thanks > toivo > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818p6820.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread toivoa
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) thanks toivo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache

wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread toivoa
ted at org.apache.spark.input.WholeTextFileRecordReader.(WholeTextFileRecordReader.scala:40) ... 18 more Any idea? thanks toivo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoo

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-03 Thread Sean Owen
but interface was expected > at > org.apache.spark.input.WholeTextFileRecordReader.(WholeTextFileRecordReader.scala:40) > ... 18 more > > Any idea? > > thanks > toivo > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-java-lang-IncompatibleClassChangeError-Found-class-org-apache-hadoop-mapreduce-TaskAtd-tp6818.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.