Very coincidentally I ran into something equally puzzling yesterday where something was bizarrely null when it can't have been in a Spark program that extends App. I also changed to use main() and it works fine. So definitely some issue here. If nobody makes a JIRA before I get home I'll do it. On Oct 29, 2014 11:20 PM, "Michael Albert" <m_albert...@yahoo.com.invalid> wrote:
> Greetings! > > This might be a documentation issue as opposed to a coding issue, in that > perhaps the correct answer is "don't do that", but as this is not obvious, > I am writing. > > The following code produces output most would not expect: > > package misc > > import org.apache.spark.SparkConf > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > > object DemoBug extends App { > val conf = new SparkConf() > val sc = new SparkContext(conf) > > val rdd = sc.parallelize(List("A","B","C","D")) > val str1 = "A" > > val rslt1 = rdd.filter(x => { x != "A" }).count > val rslt2 = rdd.filter(x => { str1 != null && x != "A" }).count > > println("DemoBug: rslt1 = " + rslt1 + " rslt2 = " + rslt2) > } > > This produces the output: > DemoBug: rslt1 = 3 rslt2 = 0 > > Compiled with sbt: > libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0" > Run on an EC2 EMR instance with a recent image (hadoop 2.4.0, spark 1.1.0) > > If instead there is a proper "main()", it works as expected. > > Thank you. > > Sincerely, > Mike >