S3 checkpointing in AWS in Frankfurt

Jonathan Share Tue, 22 Nov 2016 23:53:01 -0800

Hi,

I'm interested in hearing if anyone else has experience with using Amazon
S3 as a state backend in the Frankfurt region. For political reasons we've
been asked to keep all European data in Amazon's Frankfurt region. This
causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
Signature Version 4 "This new Region supports only Signature Version 4" [1]
and this doesn't appear to work with the Hadoop version that Flink is built
against [2].


After some hacking we have managed to create a docker image with a build of
Flink 1.2 master, copying over jar files from the hadoop
3.0.0-alpha1 package and this appears to work, for the most part but we
still suffer from some classpath problems (conflicts between AWS API used
in hadoop and those we want to use in out streams for interacting with
Kinesis) and the whole thing feels a little fragile. Has anyone else tried
this? Is there a simpler solution?

As a follow-up question, we saw that with checkpointing on three relatively
simple streams set to 1 second, our S3 costs were higher than the EC2 costs
for our entire infrastructure. This seems slightly disproportionate. For
now we have reduced checkpointing interval to 10 seconds and that has
greatly improved the cost projections graphed via Amazon Cloud Watch, but
I'm interested in hearing other peoples experience with this. Is that the
kind of billing level we can expect or is this a symptom of a
mis-configuration? Is this a setup others are using? As we are using
Kinesis as the source for all streams I don't see a huge risk with larger
checkpoint intervals and our Sinks are designed to mostly tolerate
duplicates (some improvements can be made).

Thanks in advance
Jonathan


[1] https://aws.amazon.com/blogs/aws/aws-region-germany/
[2] https://issues.apache.org/jira/browse/HADOOP-13324

S3 checkpointing in AWS in Frankfurt

Reply via email to