Re: Spark 1.0.0 rc3

Dean Wampler Tue, 29 Apr 2014 09:44:58 -0700

I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using
HDFS classes for file I/O, while the same script compiled and running with
0.9.1 uses only the local-mode File IO.

The script is a variation of the Word Count script. Here are the "guts":

object WordCount2 {
  def main(args: Array[String]) = {

    val sc = new SparkContext("local", "Word Count (2)")

    val input = sc.textFile(".../some/local/file").map(line =>
line.toLowerCase)
    input.cache

    val wc2 = input
      .flatMap(line => line.split("""\W+"""))
      .map(word => (word, 1))
      .reduceByKey((count1, count2) => count1 + count2)

    wc2.saveAsTextFile("output/some/directory")

    sc.stop()

It works fine compiled and executed with 0.9.1. If I recompile and run with
1.0.0-RC1, where the same output directory still exists, I get this
familiar Hadoop-ish exception:

[error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory
file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc
already exists
 at
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749)
 at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057)
at spark.activator.WordCount2$.main(WordCount2.scala:42)
 at spark.activator.WordCount2.main(WordCount2.scala)
...

Thoughts?

On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pwend...@gmail.com> wrote:

> Hey All,
>
> This is not an official vote, but I wanted to cut an RC so that people can
> test against the Maven artifacts, test building with their configuration,
> etc. We are still chasing down a few issues and updating docs, etc.
>
> If you have issues or bug reports for this release, please send an e-mail
> to the Spark dev list and/or file a JIRA.
>
> Commit: d636772 (v1.0.0-rc3)
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221
>
> Binaries:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3/
>
> Docs:
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/
>
> Repository:
> https://repository.apache.org/content/repositories/orgapachespark-1012/
>
> == API Changes ==
> If you want to test building against Spark there are some minor API
> changes. We'll get these written up for the final release but I'm noting a
> few here (not comprehensive):
>
> changes to ML vector specification:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior
>
> Streaming classes have been renamed:
> NetworkReceiver -> Receiver
>

-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Re: Spark 1.0.0 rc3

Reply via email to