To clarify one thing, is count() the first "action" (
http://spark.apache.org/docs/latest/programming-guide.html#actions) you're
attempting? As defined in the programming guide, an action forces
evaluation of the pipeline of RDDs. It's only then that reading the data
actually occurs. So, count() might not be the issue, but some upstream step
that attempted to read the file.

As a sanity check, if you just read the text file and don't convert the
strings, then call count(), does that work? If so, it might be something
about your JavaBean BERecord after all. Can you post its definition?

Also calling take(1) to grab the first element should also work, even if
the RDD is empty. (It will return an empty RDD in that case, but not throw
an exception.)

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 10:16 AM, Ashley Rose <ashley.r...@telarix.com>
wrote:

>  That’s precisely what I was trying to check. It should have 42577
> records in it, because that’s how many there were in the text file I read
> in.
>
>
>
>         // Load a text file and convert each line to a JavaBean.
>
>         JavaRDD<String> lines = sc.textFile("file.txt");
>
>
>
>         JavaRDD<BERecord> tbBER = lines.map(s -> convertToBER(s));
>
>
>
>         // Apply a schema to an RDD of JavaBeans and register it as a
> table.
>
>         schemaBERecords = sqlContext.createDataFrame(tbBER,
> BERecord.class);
>
>         schemaBERecords.registerTempTable("tbBER");
>
>
>
> The BERecord class is a standard Java Bean that implements Serializable,
> so that shouldn’t be the issue. As you said, count() shouldn’t fail like
> this even if the table was empty. I was able to print out the schema of the
> DataFrame just fine with df.printSchema(), and I just wanted to see if data
> was populated correctly.
>
>
>
> *From:* Dean Wampler [mailto:deanwamp...@gmail.com]
> *Sent:* Wednesday, April 01, 2015 6:05 PM
> *To:* Ashley Rose
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark 1.3.0 DataFrame count() method throwing
> java.io.EOFException
>
>
>
> Is it possible "tbBER" is empty? If so, it shouldn't fail like this, of
> course.
>
>
>   Dean Wampler, Ph.D.
>
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
>
> http://polyglotprogramming.com
>
>
>
> On Wed, Apr 1, 2015 at 5:57 PM, ARose <ashley.r...@telarix.com> wrote:
>
> Note: I am running Spark on Windows 7 in standalone mode.
>
> In my app, I run the following:
>
>         DataFrame df = sqlContext.sql("SELECT * FROM tbBER");
>         System.out.println("Count: " + df.count());
>
> tbBER is registered as a temp table in my SQLContext. When I try to print
> the number of rows in the DataFrame, the job fails and I get the following
> error message:
>
>         java.io.EOFException
>         at
>
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
>         at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
>         at
>
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>         at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>         at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
>         at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
>         at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
>         at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
>         at
> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
>         at
>
> org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)
>         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1137)
>         at
>
> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:483)
>         at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>         at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>         at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>         at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>         at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>         at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>         at
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
>         at
>
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> This only happens when I try to call df.count(). The rest runs fine. Is the
> count() function not supported in standalone mode? The stack trace makes it
> appear to be Hadoop functionality...
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-DataFrame-count-method-throwing-java-io-EOFException-tp22344.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Reply via email to