Re: Unit testing framework for Spark Jobs?

Lars Albertsson Wed, 30 Mar 2016 02:41:44 -0700

Thanks!

It is on my backlog to write a couple of blog posts on the topic, and
eventually some example code, but I am currently busy with clients.


Thanks for the pointer to Eventually - I was unaware. Fast exit on
exception would be a useful addition, indeed.

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109

On Mon, Mar 28, 2016 at 2:00 PM, Steve Loughran <[email protected]>
wrote:
> this is a good summary -Have you thought of publishing it at the end of a
URL for others to refer to
>
>> On 18 Mar 2016, at 07:05, Lars Albertsson <[email protected]> wrote:
>>
>> I would recommend against writing unit tests for Spark programs, and
>> instead focus on integration tests of jobs or pipelines of several
>> jobs. You can still use a unit test framework to execute them. Perhaps
>> this is what you meant.
>>
>> You can use any of the popular unit test frameworks to drive your
>> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
>> gives you choice of TDD vs BDD, and it is also well integrated with
>> IntelliJ.
>>
>> I would also recommend against using testing frameworks tied to a
>> processing technology, such as Spark Testing Base. Although it does
>> seem well crafted, and makes it easy to get started with testing,
>> there are drawbacks:
>>
>> 1. I/O routines are not tested. Bundled test frameworks typically do
>> not materialise datasets on storage, but pass them directly in memory.
>> (I have not verified this for Spark Testing Base, but it looks so.)
>> I/O routines are therefore not exercised, and they often hide bugs,
>> e.g. related to serialisation.
>>
>> 2. You create a strong coupling between processing technology and your
>> tests. If you decide to change processing technology (which can happen
>> soon in this fast paced world...), you need to rewrite your tests.
>> Therefore, during a migration process, the tests cannot detect bugs
>> introduced in migration, and help you migrate fast.
>>
>> I recommend that you instead materialise input datasets on local disk,
>> run your Spark job, which writes output datasets to local disk, read
>> output from disk, and verify the results. You can still use Spark
>> routines to read and write input and output datasets. A Spark context
>> is expensive to create, so for speed, I would recommend reusing the
>> Spark context between input generation, running the job, and reading
>> output.
>>
>> This is easy to set up, so you don't need a dedicated framework for
>> it. Just put your common boilerplate in a shared test trait or base
>> class.
>>
>> In the future, when you want to replace your Spark job with something
>> shinier, you can still use the old tests, and only replace the part
>> that runs your job, giving you some protection from regression bugs.
>>
>>
>> Testing Spark Streaming applications is a different beast, and you can
>> probably not reuse much from your batch testing.
>>
>> For testing streaming applications, I recommend that you run your
>> application inside a unit test framework, e.g, Scalatest, and have the
>> test setup create a fixture that includes your input and output
>> components. For example, if your streaming application consumes from
>> Kafka and updates tables in Cassandra, spin up single node instances
>> of Kafka and Cassandra on your local machine, and connect your
>> application to them. Then feed input to a Kafka topic, and wait for
>> the result to appear in Cassandra.
>>
>> With this setup, your application still runs in Scalatest, the tests
>> run without custom setup in maven/sbt/gradle, and you can easily run
>> and debug inside IntelliJ.
>>
>> Docker is suitable for spinning up external components. If you use
>> Kafka, the Docker image spotify/kafka is useful, since it bundles
>> Zookeeper.
>>
>> When waiting for output to appear, don't sleep for a long time and
>> then check, since it will slow down your tests. Instead enter a loop
>> where you poll for the results and sleep for a few milliseconds in
>> between, with a long timeout (~30s) before the test fails with a
>> timeout.
>
> org.scalatest.concurrent.Eventually is your friend there
>
> eventually(stdTimeout, stdInterval) {
> listRestAPIApplications(connector, webUI, true) should
contain(expectedAppId)
> }
>
> It has good exponential backoff, for fast initial success without using
too much CPU later, and is simple to use
>
> If it has weaknesses in my tests, they are
>
> 1. it will retry on all exceptions, rather than assertions. If there's a
bug in the test code then it manifests as a timeout. ( I think I could play
with Suite.anExceptionThatShouldCauseAnAbort()) here.
> 2. it's timeout action is simply to rethrow the fault; I like to exec a
closure to grab more diagnostics
> 3. It doesn't support some fail-fast exception which your code can raise
to indicate that the desired state is never going to be reached, and so the
test should fail fast. Here a new exception and another entry in
anExceptionThatShouldCauseAnAbort() may be the answer. I should sit down
and play with that some more.
>
>
>>
>> This poll and sleep strategy both makes tests quick in successful
>> cases, but still robust to occasional delays. The strategy does not
>> work if you want to test for absence, e.g. ensure that a particular
>> message if filtered. You can work around it by adding another message
>> afterwards and polling for its effect before testing for absence of
>> the first. Be aware that messages can be processed out of order in
>> Spark Streaming depending on partitioning, however.
>>
>>
>> I have tested Spark applications with both strategies described above,
>> and it is straightforward to set up. Let me know if you want
>> clarifications or assistance.
>>
>> Regards,
>>
>>
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> +46 70 7687109
>>
>>
>> On Wed, Mar 2, 2016 at 6:54 PM, SRK <[email protected]> wrote:
>>> Hi,
>>>
>>> What is a good unit testing framework for Spark batch/streaming jobs? I
have
>>> core spark, spark sql with dataframes and streaming api getting used.
Any
>>> good framework to cover unit tests for these APIs?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: Unit testing framework for Spark Jobs?

Reply via email to