Thomas, When using TransferManager, as we are in CloudStack, the MD5 hashes are calculated by the Amazon AWS Java client. It also determines how best to utilize multi-part upload, if at all. I just want to ensure that folks understand the information below applies when interacting with the HTTP API, but that the Amazon AWS Java client handles most of these details for the developer.
Thanks, -John On Jun 6, 2013, at 9:10 PM, Thomas O'Dowd <tpod...@cloudian.com> wrote: > Hi guys, > > The ETAG is an interesting subject. AWS currently maintains 2 different > types of ETAGS for objects that I know of. > > a) PUT OBJECT - assigned ETAG will be calculated from the MD5 checksum > of the data content that you are uploading. When uploading you should > also always set the Content-MD5 header so that AWS (or other S3 Stores) > can verify your MD5 checksum against what it receives. The ETAG for such > objects will be the MD5 checksum of the content for AWS but doesn't have > to be I guess for other S3 stores. What's important is that AWS will > reject your upload if the MD5 checksum it calculates is not the same as > your Content-MD5 header. > > b) MULTIPART OBJECTS - A multipart object is an object which is > uploaded using mulitple PUT requests each which uploads some part. Parts > can be uploaded out of order and in parallel so AWS cannot calculate the > MD5 checksum for the entire object without actually waiting until all > parts have been uploaded and finally reprocessing all the data. This > would be very heavy for various reasons so they don't do this. The ETAG > therefore can not be calculated from the MD5 checksum of the content > either. I don't know exactly how AWS calculates their ETAG for multipart > objects but the ETAG will always take the form of XXXXXXXX-YYY where the > X part looks like a regular MD5 checksum of sorts and the Y part is the > number of parts that made up the upload. Therefore you can always tell > that an object was uploaded using a multipart upload by checking its > ETAG ends with -YYY. This however may be only true for AWS - other S3 > stores may do it differently. You should just treat the etag as opaque > really. > > Some more best practices about multipart uploads. > 1. Always calculate the MD5 checksum of each part and send the > Content-MD5 header. This way AWS can verify the content of each part as > you upload it. > 2. Always retain the ETAG for each part as returned by the response of > each part upload. You should have an etag for each part you uploaded. > 3. Refrain from asking the server for a list of parts in order to create > the final Multipart Upload complete request. Always use your list of > parts and your list of ETAGS (from point 2). The exception is when you > are doing recovery after some client crash. > > The main reason for this is that AWS and most other S3 stores are based > on eventual consistency and the server may not always (but mostly does) > give you a correct list of parts. The Multipart upload complete request > allows you to drop parts also so if you ask the server for a list of > parts and it misses one temporarily, you may end up with an object that > is missing a part also. > > Btw, shameless plug but Cloudian has very good compatibility with AWS > and has a community edition version that is free for up to 100TB. I'll > test against it but you may also like to. You can run it on a single > node with not much fuss. Feel free to ask me about it offline. > > Anyway hope that helps, > > Tom. > > On Thu, 2013-06-06 at 22:57 +0000, Edison Su wrote: >> The Etag created by both RIAK CS and Amazon S3 seems a little bit different, >> in case of multi part upload. >> >> Here is the result I tested on both RIAK CS and Amazon S3, with s3cmd. >> Test environment: >> S3cmd: version: version 1.5.0-alpha1 >> Riak cs: >> Name : riak >> Arch : x86_64 >> Version : 1.3.1 >> Release : 1.el6 >> Size : 40 M >> Repo : installed >> From repo : basho-products >> >> The command I used to put: >> s3cmd put some-file s3://some-path --multipart-chunk-size-mb=100 -v -d >> >> The etag created for the file, when using Riak CS is WxEUkiQzTWm_2C8A92fLQg== >> >> EBUG: Sending request method_string='POST', >> uri='http://imagestore.s3.amazonaws.com/tmpl/1/1/routing-1/test?uploadId=kfDkh7Q_QCWN7r0ZTqNq4Q==', >> headers={'content-length': '309', 'Authorization': 'AWS >> OYAZXCAFUC1DAFOXNJWI:xlkHI9tUfUV/N+Ekqpi7Jz/pbOI=', 'x-amz-date': 'Thu, 06 >> Jun 2013 22:54:28 +0000'}, body=(309 bytes) >> DEBUG: Response: {'status': 200, 'headers': {'date': 'Thu, 06 Jun 2013 >> 22:40:09 GMT', 'content-length': '326', 'content-type': 'application/xml', >> 'server': 'Riak CS'}, 'reason': 'OK', 'data': '<?xml version="1.0" >> encoding="UTF-8"?><CompleteMultipartUploadResult >> xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://imagestore.s3.amazonaws.com/tmpl/1/1/routing-1/test</Location><Bucket>imagestore</Bucket><Key>tmpl/1/1/routing-1/test</Key><ETag>kfDkh7Q_QCWN7r0ZTqNq4Q==</ETag></CompleteMultipartUploadResult>'} >> >> While the etag created by Amazon S3 is: >> "70e1860be687d43c039873adef4280f2-3" >> >> DEBUG: Sending request method_string='POST', >> uri='/fixes/icecake/systdfdfdfemvm.iso1?uploadId=vdkPSAtaA7g.fdfdfdfdf..iaKRNW_8QGz.bXdfdfdfdfdfkFXwUwLzRcG5obVvJFDvnhYUFdT6fYr1rig--', >> >> DEBUG: Response: {'status': 200, 'headers': {, 'server': 'AmazonS3', >> 'transfer-encoding': 'chunked', 'connection': 'Keep-Alive', >> 'x-amz-request-id': '8DFF5D8025E58E99', 'cache-control': 'proxy-revalidate', >> 'date': 'Thu, 06 Jun 2013 22:39:47 GMT', 'content-type': 'application/xml'}, >> 'reason': 'OK', 'data': '<?xml version="1.0" >> encoding="UTF-8"?>\n\n<CompleteMultipartUploadResult >> xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Location>http://fdfdfdfdfdfdf</Location>Key>fixes/icecake/systemvm.iso1</Key><ETag>"70e1860be687d43c039873adef4280f2-3"</ETag></CompleteMultipartUploadResult>'} >> >> So the etag created on Amazon S3 has "-"(dash) in it, but there is only "_" >> (underscore) on Riak cs. >> >> Do you know the reason? What should we need to do to make it compatible with >> Amazon S3 SDK? >> >>> -----Original Message----- >>> From: John Burwell [mailto:jburw...@basho.com] >>> Sent: Thursday, June 06, 2013 2:03 PM >>> To: dev@cloudstack.apache.org >>> Subject: Re: Object based Secondary storage. >>> >>> Min, >>> >>> Are you calculating the MD5 or letting the Amazon client do it? >>> >>> Thanks, >>> -John >>> >>> On Jun 6, 2013, at 4:54 PM, Min Chen <min.c...@citrix.com> wrote: >>> >>>> Thanks Tom. Indeed I have a S3 question that need some advise from >>>> some S3 experts. To support upload object > 5G, I have used >>>> TransferManager.upload to upload object to S3, upload went fine and >>>> object are successfully put to S3. However, later on when I am using >>>> "s3cmd get <object key>" to retrieve this object, I always got this >>>> exception: >>>> >>>> "MD5 signatures do not match: computed=Y, received="X" >>>> >>>> It seems that Amazon S3 kept a different Md5 sum for the multi-part >>>> uploaded object. We have been using Riak CS for our S3 testing. If I >>>> changed to not using multi-part upload and directly invoking S3 >>>> putObject, I will not run into this issue. Do you have such experience >>> before? >>>> >>>> -min >>>> >>>> On 6/6/13 1:56 AM, "Thomas O'Dowd" <tpod...@cloudian.com> wrote: >>>> >>>>> Thanks Min. I've printed out the material and am reading new threads. >>>>> Can't comment much yet until I understand things a bit more. >>>>> >>>>> Meanwhile, feel free to hit me up with any S3 questions you have. I'm >>>>> looking forward to playing with the object_store branch and testing >>>>> it out. >>>>> >>>>> Tom. >>>>> >>>>> On Wed, 2013-06-05 at 16:14 +0000, Min Chen wrote: >>>>>> Welcome Tom. You can check out this FS >>>>>> >>>>>> >>> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Storage+Backu >>>>>> p+Obj >>>>>> ec >>>>>> t+Store+Plugin+Framework for secondary storage architectural work >>>>>> t+Store+Plugin+done >>>>>> in >>>>>> object_store branch.You may also check out the following recent >>>>>> threads regarding 3 major technical questions raised by community as >>>>>> well as our answers and clarification. >>>>>> >>>>>> http://mail-archives.apache.org/mod_mbox/cloudstack- >>> dev/201306.mbox/ >>>>>> %3C77 >>>>>> B3 >>>>>> >>> 37AF224FD84CBF8401947098DD87036A76%40SJCPEX01CL01.citrite.net%3E >>>>>> >>>>>> http://mail-archives.apache.org/mod_mbox/cloudstack- >>> dev/201306.mbox/ >>>>>> %3CCD >>>>>> D2 >>>>>> 2955.3DDDC%25min.chen%40citrix.com%3E >>>>>> >>>>>> http://mail-archives.apache.org/mod_mbox/cloudstack- >>> dev/201306.mbox/ >>>>>> %3CCD >>>>>> D2 >>>>>> 300D.3DE0C%25min.chen%40citrix.com%3E >>>>>> >>>>>> >>>>>> That branch is mainly worked on by Edison and me, and we are at PST >>>>>> timezone. >>>>>> >>>>>> Thanks >>>>>> -min >>>>> -- >>>>> Cloudian KK - http://www.cloudian.com/get-started.html >>>>> Fancy 100TB of full featured S3 Storage? >>>>> Checkout the Cloudian(r) Community Edition! >>>>> >>>> >> > > -- > Cloudian KK - http://www.cloudian.com/get-started.html > Fancy 100TB of full featured S3 Storage? > Checkout the Cloudian® Community Edition! >