If you don't want to refactor your code, you can put your input into a test file. After the test runs, read the data from the output file you specified (probably want this to be a temp file and delete on exit). Of course, that is not really a unit test - Metei's suggestion is preferable (this is how we test). However, if you have a long and complex flow, you might unit test different parts, and then have an integration test which reads from the files and tests the whole flow together (I do this as well).
On Fri, Jun 13, 2014 at 10:04 PM, Matei Zaharia <[email protected]> wrote: > You need to factor your program so that it’s not just a main(). This is > not a Spark-specific issue, it’s about how you’d unit test any program in > general. In this case, your main() creates a SparkContext, so you can’t > pass one from outside, and your code has to read data from a file and write > it to a file. It would be better to move your code for transforming data > into a new function: > > def processData(lines: RDD[String]): RDD[String] = { > // build and return your “res” variable > } > > Then you can unit-test this directly on data you create in your program: > > val myLines = sc.parallelize(Seq(“line 1”, “line 2”)) > val result = GetInfo.processData(myLines).collect() > assert(result.toSet === Set(“res 1”, “res 2”)) > > Matei > > On Jun 13, 2014, at 2:42 PM, SK <[email protected]> wrote: > > > Hi, > > > > I have looked through some of the test examples and also the brief > > documentation on unit testing at > > http://spark.apache.org/docs/latest/programming-guide.html#unit-testing, > but > > still dont have a good understanding of writing unit tests using the > Spark > > framework. Previously, I have written unit tests using specs2 framework > and > > have got them to work in Scalding. I tried to use the specs2 framework > with > > Spark, but could not find any simple examples I could follow. I am open > to > > specs2 or Funsuite, whichever works best with Spark. I would like some > > additional guidance, or some simple sample code using specs2 or > Funsuite. My > > code is provided below. > > > > > > I have the following code in src/main/scala/GetInfo.scala. It reads a > Json > > file and extracts some data. It takes the input file (args(0)) and output > > file (args(1)) as arguments. > > > > object GetInfo{ > > > > def main(args: Array[String]) { > > val inp_file = args(0) > > val conf = new SparkConf().setAppName("GetInfo") > > val sc = new SparkContext(conf) > > val res = sc.textFile(log_file) > > .map(line => { parse(line) }) > > .map(json => > > { > > implicit lazy val formats = > > org.json4s.DefaultFormats > > val aid = (json \ "d" \ "TypeID").extract[Int] > > val ts = (json \ "d" \ "TimeStamp").extract[Long] > > val gid = (json \ "d" \ "ID").extract[String] > > (aid, ts, gid) > > } > > ) > > .groupBy(tup => tup._3) > > .sortByKey(true) > > .map(g => (g._1, g._2.map(_._2).max)) > > res.map(tuple=> "%s, %d".format(tuple._1, > > tuple._2)).saveAsTextFile(args(1)) > > } > > > > > > I would like to test the above code. My unit test is in src/test/scala. > The > > code I have so far for the unit test appears below: > > > > import org.apache.spark._ > > import org.specs2.mutable._ > > > > class GetInfoTest extends Specification with java.io.Serializable{ > > > > val data = List ( > > ("d": {"TypeID" = 10, "Timestamp": 1234, "ID": "ID1"}), > > ("d": {"TypeID" = 11, "Timestamp": 5678, "ID": "ID1"}), > > ("d": {"TypeID" = 10, "Timestamp": 1357, "ID": "ID2"}), > > ("d": {"TypeID" = 11, "Timestamp": 2468, "ID": "ID2"}) > > ) > > > > val expected_out = List( > > ("ID1",5678), > > ("ID2",2468), > > ) > > > > "A GetInfo job" should { > > //***** How do I pass "data" define above as input and output > > which GetInfo expects as arguments? ****** > > val sc = new SparkContext("local", "GetInfo") > > > > //*** how do I get the output *** > > > > //assuming out_buffer has the output I want to match it to > the > > expected output > > "match expected output" in { > > ( out_buffer == expected_out) must beTrue > > } > > } > > > > } > > > > I would like some help with the tasks marked with "****" in the unit test > > code above. If specs2 is not the right way to go, I am also open to > > FunSuite. I would like to know how to pass the input while calling my > > program from the unit test and get the output. > > > > Thanks for your help. > > > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/guidance-on-simple-unit-testing-with-Spark-tp7604.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: [email protected] W: www.velos.io
