Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Steve Loughran Wed, 12 Apr 2017 04:43:05 -0700

On 11 Apr 2017, at 20:46, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:


And once again JAVA programmers are trying to solve a data analytics and data 
warehousing problem using programming paradigms. It genuinely a pain to see 
this happen.



While I'm happy to be faulted for treating things as software processes, having 
a full automated mechanism for testing the latest code before production is 
something I'd consider foundational today. This is what "Contiunous Deployment" 
was about when it was first conceived. Does it mean you should blindly deploy 
that way? well, not if you worry about security, but having that review process 
and then a final manual "deploy" button can address that.

Cloud infras let you integrate cluster instantiation to the process; which 
helps you automate things like "stage the deployment in some new VMs, run 
acceptance tests (*), then switch the load balancer over to the new cluster, 
being ready to switch back if you need. I've not tried that with streaming apps 
though; I don't know how to do it there. Boot the new cluster off checkpointed 
state requires deserialization to work, which can't be guaranteed if you are 
changing the objects which get serialized.

I'd argue then, it's not a problem which has already been solved by data 
analystics/warehousing —though if you've got pointers there, I'd be grateful. 
Always good to see work by others. Indeed, the telecoms industry have led the 
way in testing and HA deployment: if you look at Erlang you can see a system 
designed with hot upgrades in mind, the way java code "add a JAR to a web 
server" never was.

-Steve


(*) do always make sure this is the test cluster with a snapshot of test data, 
not production machines/data. There are always horror stories there.


Regards,
Gourav

On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin 
<hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote:
Hi Steve


Thanks for the detailed response, I think this problem doesn't have an industry 
standard solution as of yet and I am sure a lot of people would benefit from 
the discussion

I realise now what you are saying so thanks for clarifying, that said let me 
try and explain how we approached the problem

There are 2 problems you highlighted, the first if moving the code from SCM to 
prod, and the other is enusiring the data your code uses is correct. (using the 
latest data from prod)


"how do you get your code from SCM into production?"

We currently have our pipeline being run via airflow, we have our dags in S3, 
with regards to how we get our code from SCM to production

1) Jenkins build that builds our spark applications and runs tests
2) Once the first build is successful we trigger another build to copy the dags 
to an s3 folder

We then routinely sync this folder to the local airflow dags folder every X 
amount of mins

Re test data
" but what's your strategy for test data: that's always the troublespot."

Our application is using versioning against the data, so we expect the source 
data to be in a certain version and the output data to also be in a certain 
version

We have a test resources folder that we have following the same convention of 
versioning - this is the data that our application tests use - to ensure that 
the data is in the correct format

so for example if we have Table X with version 1 that depends on data from 
Table A and B also version 1, we run our spark application then ensure the 
transformed table X has the correct columns and row values

Then when we have a new version 2 of the source data or adding a new column in 
Table X (version 2), we generate a new version of the data and ensure the tests 
are updated

That way we ensure any new version of the data has tests against it

"I've never seen any good strategy there short of "throw it at a copy of the 
production dataset"."

I agree which is why we have a sample of the production data and version the 
schemas we expect the source and target data to look like.

If people are interested I am happy writing a blog about it in the hopes this 
helps people build more reliable pipelines


Love to see that.

Kind Regards
Sam

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

Reply via email to