To clarify one thing, is count() the first "action" ( http://spark.apache.org/docs/latest/programming-guide.html#actions) you're attempting? As defined in the programming guide, an action forces evaluation of the pipeline of RDDs. It's only then that reading the data actually occurs. So, count() might not be the issue, but some upstream step that attempted to read the file.
As a sanity check, if you just read the text file and don't convert the strings, then call count(), does that work? If so, it might be something about your JavaBean BERecord after all. Can you post its definition? Also calling take(1) to grab the first element should also work, even if the RDD is empty. (It will return an empty RDD in that case, but not throw an exception.) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Thu, Apr 2, 2015 at 10:16 AM, Ashley Rose <ashley.r...@telarix.com> wrote: > That’s precisely what I was trying to check. It should have 42577 > records in it, because that’s how many there were in the text file I read > in. > > > > // Load a text file and convert each line to a JavaBean. > > JavaRDD<String> lines = sc.textFile("file.txt"); > > > > JavaRDD<BERecord> tbBER = lines.map(s -> convertToBER(s)); > > > > // Apply a schema to an RDD of JavaBeans and register it as a > table. > > schemaBERecords = sqlContext.createDataFrame(tbBER, > BERecord.class); > > schemaBERecords.registerTempTable("tbBER"); > > > > The BERecord class is a standard Java Bean that implements Serializable, > so that shouldn’t be the issue. As you said, count() shouldn’t fail like > this even if the table was empty. I was able to print out the schema of the > DataFrame just fine with df.printSchema(), and I just wanted to see if data > was populated correctly. > > > > *From:* Dean Wampler [mailto:deanwamp...@gmail.com] > *Sent:* Wednesday, April 01, 2015 6:05 PM > *To:* Ashley Rose > *Cc:* user@spark.apache.org > *Subject:* Re: Spark 1.3.0 DataFrame count() method throwing > java.io.EOFException > > > > Is it possible "tbBER" is empty? If so, it shouldn't fail like this, of > course. > > > Dean Wampler, Ph.D. > > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > > http://polyglotprogramming.com > > > > On Wed, Apr 1, 2015 at 5:57 PM, ARose <ashley.r...@telarix.com> wrote: > > Note: I am running Spark on Windows 7 in standalone mode. > > In my app, I run the following: > > DataFrame df = sqlContext.sql("SELECT * FROM tbBER"); > System.out.println("Count: " + df.count()); > > tbBER is registered as a temp table in my SQLContext. When I try to print > the number of rows in the DataFrame, the job fails and I get the following > error message: > > java.io.EOFException > at > > java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747) > at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033) > at > > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) > at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216) > at org.apache.hadoop.io.UTF8.readString(UTF8.java:208) > at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87) > at > org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237) > at > org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66) > at > > org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1137) > at > > org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) > at > > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > This only happens when I try to call df.count(). The rest runs fine. Is the > count() function not supported in standalone mode? The stack trace makes it > appear to be Hadoop functionality... > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-DataFrame-count-method-throwing-java-io-EOFException-tp22344.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >