I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO.
The script is a variation of the Word Count script. Here are the "guts": object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext("local", "Word Count (2)") val input = sc.textFile(".../some/local/file").map(line => line.toLowerCase) input.cache val wc2 = input .flatMap(line => line.split("""\W+""")) .map(word => (word, 1)) .reduceByKey((count1, count2) => count1 + count2) wc2.saveAsTextFile("output/some/directory") sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell <pwend...@gmail.com> wrote: > Hey All, > > This is not an official vote, but I wanted to cut an RC so that people can > test against the Maven artifacts, test building with their configuration, > etc. We are still chasing down a few issues and updating docs, etc. > > If you have issues or bug reports for this release, please send an e-mail > to the Spark dev list and/or file a JIRA. > > Commit: d636772 (v1.0.0-rc3) > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 > > Binaries: > http://people.apache.org/~pwendell/spark-1.0.0-rc3/ > > Docs: > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ > > Repository: > https://repository.apache.org/content/repositories/orgapachespark-1012/ > > == API Changes == > If you want to test building against Spark there are some minor API > changes. We'll get these written up for the final release but I'm noting a > few here (not comprehensive): > > changes to ML vector specification: > > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 > > changes to the Java API: > > http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark > > coGroup and related functions now return Iterable[T] instead of Seq[T] > ==> Call toSeq on the result to restore the old behavior > > SparkContext.jarOfClass returns Option[String] instead of Seq[String] > ==> Call toSeq on the result to restore old behavior > > Streaming classes have been renamed: > NetworkReceiver -> Receiver > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com