Re: S3 checkpointing in AWS in Frankfurt

Greg Hogan Wed, 23 Nov 2016 11:00:06 -0800

EMRFS looks to *add* cost (and consistency).

Storing an object to S3 costs "$0.005 per 1,000 requests", so $0.432/day at
1 Hz. Is the number of checkpoint files simply parallelism * number of
operators? That could add up quickly.


Is the recommendation to run HDFS on EBS?

On Wed, Nov 23, 2016 at 12:40 PM, Jonathan Share <jon.sh...@gmail.com>
wrote:

> Hi Greg,
>
> Standard storage class, everything is on defaults, we've not done anything
> special with the bucket.
>
> Cloud Watch only appears to give me total billing for S3 in general, I
> don't see a breakdown unless that's something I can configure somewhere.
>
> Regards,
> Jonathan
>
>
> On 23 November 2016 at 16:29, Greg Hogan <c...@greghogan.com> wrote:
>
>> Hi Jonathan,
>>
>> Which S3 storage class are you using? Do you have a breakdown of the S3
>> costs as storage / API calls / early deletes / data transfer?
>>
>> Greg
>>
>> On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share <jon.sh...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm interested in hearing if anyone else has experience with using
>>> Amazon S3 as a state backend in the Frankfurt region. For political reasons
>>> we've been asked to keep all European data in Amazon's Frankfurt region.
>>> This causes a problem as the S3 endpoint in Frankfurt requires the use of
>>> AWS Signature Version 4 "This new Region supports only Signature
>>> Version 4" [1] and this doesn't appear to work with the Hadoop version
>>> that Flink is built against [2].
>>>
>>> After some hacking we have managed to create a docker image with a build
>>> of Flink 1.2 master, copying over jar files from the hadoop
>>> 3.0.0-alpha1 package and this appears to work, for the most part but we
>>> still suffer from some classpath problems (conflicts between AWS API used
>>> in hadoop and those we want to use in out streams for interacting with
>>> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
>>> this? Is there a simpler solution?
>>>
>>> As a follow-up question, we saw that with checkpointing on three
>>> relatively simple streams set to 1 second, our S3 costs were higher than
>>> the EC2 costs for our entire infrastructure. This seems slightly
>>> disproportionate. For now we have reduced checkpointing interval to 10
>>> seconds and that has greatly improved the cost projections graphed via
>>> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
>>> with this. Is that the kind of billing level we can expect or is this a
>>> symptom of a mis-configuration? Is this a setup others are using? As we are
>>> using Kinesis as the source for all streams I don't see a huge risk with
>>> larger checkpoint intervals and our Sinks are designed to mostly tolerate
>>> duplicates (some improvements can be made).
>>>
>>> Thanks in advance
>>> Jonathan
>>>
>>>
>>> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
>>> [2] https://issues.apache.org/jira/browse/HADOOP-13324
>>>
>>
>>
>

Re: S3 checkpointing in AWS in Frankfurt

Reply via email to