Re: Unit testing framework for Spark Jobs?

Shiva Ramagopal Thu, 24 Mar 2016 01:11:05 -0700

Hi Lars,

Very pragmatic ideas around testing of Spark applications end-to-end!


-Shiva

On Fri, Mar 18, 2016 at 12:35 PM, Lars Albertsson <la...@mapflat.com> wrote:

> I would recommend against writing unit tests for Spark programs, and
> instead focus on integration tests of jobs or pipelines of several
> jobs. You can still use a unit test framework to execute them. Perhaps
> this is what you meant.
>
> You can use any of the popular unit test frameworks to drive your
> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
> gives you choice of TDD vs BDD, and it is also well integrated with
> IntelliJ.
>
> I would also recommend against using testing frameworks tied to a
> processing technology, such as Spark Testing Base. Although it does
> seem well crafted, and makes it easy to get started with testing,
> there are drawbacks:
>
> 1. I/O routines are not tested. Bundled test frameworks typically do
> not materialise datasets on storage, but pass them directly in memory.
> (I have not verified this for Spark Testing Base, but it looks so.)
> I/O routines are therefore not exercised, and they often hide bugs,
> e.g. related to serialisation.
>
> 2. You create a strong coupling between processing technology and your
> tests. If you decide to change processing technology (which can happen
> soon in this fast paced world...), you need to rewrite your tests.
> Therefore, during a migration process, the tests cannot detect bugs
> introduced in migration, and help you migrate fast.
>
> I recommend that you instead materialise input datasets on local disk,
> run your Spark job, which writes output datasets to local disk, read
> output from disk, and verify the results. You can still use Spark
> routines to read and write input and output datasets. A Spark context
> is expensive to create, so for speed, I would recommend reusing the
> Spark context between input generation, running the job, and reading
> output.
>
> This is easy to set up, so you don't need a dedicated framework for
> it. Just put your common boilerplate in a shared test trait or base
> class.
>
> In the future, when you want to replace your Spark job with something
> shinier, you can still use the old tests, and only replace the part
> that runs your job, giving you some protection from regression bugs.
>
>
> Testing Spark Streaming applications is a different beast, and you can
> probably not reuse much from your batch testing.
>
> For testing streaming applications, I recommend that you run your
> application inside a unit test framework, e.g, Scalatest, and have the
> test setup create a fixture that includes your input and output
> components. For example, if your streaming application consumes from
> Kafka and updates tables in Cassandra, spin up single node instances
> of Kafka and Cassandra on your local machine, and connect your
> application to them. Then feed input to a Kafka topic, and wait for
> the result to appear in Cassandra.
>
> With this setup, your application still runs in Scalatest, the tests
> run without custom setup in maven/sbt/gradle, and you can easily run
> and debug inside IntelliJ.
>
> Docker is suitable for spinning up external components. If you use
> Kafka, the Docker image spotify/kafka is useful, since it bundles
> Zookeeper.
>
> When waiting for output to appear, don't sleep for a long time and
> then check, since it will slow down your tests. Instead enter a loop
> where you poll for the results and sleep for a few milliseconds in
> between, with a long timeout (~30s) before the test fails with a
> timeout.
>
> This poll and sleep strategy both makes tests quick in successful
> cases, but still robust to occasional delays. The strategy does not
> work if you want to test for absence, e.g. ensure that a particular
> message if filtered. You can work around it by adding another message
> afterwards and polling for its effect before testing for absence of
> the first. Be aware that messages can be processed out of order in
> Spark Streaming depending on partitioning, however.
>
>
> I have tested Spark applications with both strategies described above,
> and it is straightforward to set up. Let me know if you want
> clarifications or assistance.
>
> Regards,
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> +46 70 7687109
>
>
> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
> > Hi,
> >
> > What is a good unit testing framework for Spark batch/streaming jobs? I
> have
> > core spark, spark sql with dataframes and streaming api getting used. Any
> > good framework to cover unit tests for these APIs?
> >
> > Thanks!
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Unit testing framework for Spark Jobs?

Reply via email to