Hi Lars, Very pragmatic ideas around testing of Spark applications end-to-end!
-Shiva On Fri, Mar 18, 2016 at 12:35 PM, Lars Albertsson <la...@mapflat.com> wrote: > I would recommend against writing unit tests for Spark programs, and > instead focus on integration tests of jobs or pipelines of several > jobs. You can still use a unit test framework to execute them. Perhaps > this is what you meant. > > You can use any of the popular unit test frameworks to drive your > tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it > gives you choice of TDD vs BDD, and it is also well integrated with > IntelliJ. > > I would also recommend against using testing frameworks tied to a > processing technology, such as Spark Testing Base. Although it does > seem well crafted, and makes it easy to get started with testing, > there are drawbacks: > > 1. I/O routines are not tested. Bundled test frameworks typically do > not materialise datasets on storage, but pass them directly in memory. > (I have not verified this for Spark Testing Base, but it looks so.) > I/O routines are therefore not exercised, and they often hide bugs, > e.g. related to serialisation. > > 2. You create a strong coupling between processing technology and your > tests. If you decide to change processing technology (which can happen > soon in this fast paced world...), you need to rewrite your tests. > Therefore, during a migration process, the tests cannot detect bugs > introduced in migration, and help you migrate fast. > > I recommend that you instead materialise input datasets on local disk, > run your Spark job, which writes output datasets to local disk, read > output from disk, and verify the results. You can still use Spark > routines to read and write input and output datasets. A Spark context > is expensive to create, so for speed, I would recommend reusing the > Spark context between input generation, running the job, and reading > output. > > This is easy to set up, so you don't need a dedicated framework for > it. Just put your common boilerplate in a shared test trait or base > class. > > In the future, when you want to replace your Spark job with something > shinier, you can still use the old tests, and only replace the part > that runs your job, giving you some protection from regression bugs. > > > Testing Spark Streaming applications is a different beast, and you can > probably not reuse much from your batch testing. > > For testing streaming applications, I recommend that you run your > application inside a unit test framework, e.g, Scalatest, and have the > test setup create a fixture that includes your input and output > components. For example, if your streaming application consumes from > Kafka and updates tables in Cassandra, spin up single node instances > of Kafka and Cassandra on your local machine, and connect your > application to them. Then feed input to a Kafka topic, and wait for > the result to appear in Cassandra. > > With this setup, your application still runs in Scalatest, the tests > run without custom setup in maven/sbt/gradle, and you can easily run > and debug inside IntelliJ. > > Docker is suitable for spinning up external components. If you use > Kafka, the Docker image spotify/kafka is useful, since it bundles > Zookeeper. > > When waiting for output to appear, don't sleep for a long time and > then check, since it will slow down your tests. Instead enter a loop > where you poll for the results and sleep for a few milliseconds in > between, with a long timeout (~30s) before the test fails with a > timeout. > > This poll and sleep strategy both makes tests quick in successful > cases, but still robust to occasional delays. The strategy does not > work if you want to test for absence, e.g. ensure that a particular > message if filtered. You can work around it by adding another message > afterwards and polling for its effect before testing for absence of > the first. Be aware that messages can be processed out of order in > Spark Streaming depending on partitioning, however. > > > I have tested Spark applications with both strategies described above, > and it is straightforward to set up. Let me know if you want > clarifications or assistance. > > Regards, > > > > Lars Albertsson > Data engineering consultant > www.mapflat.com > +46 70 7687109 > > > On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote: > > Hi, > > > > What is a good unit testing framework for Spark batch/streaming jobs? I > have > > core spark, spark sql with dataframes and streaming api getting used. Any > > good framework to cover unit tests for these APIs? > > > > Thanks! > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >