> On Jul 30, 2019, at 7:49 AM, Mainor Daly <c...@renitent-und-betrunken.de> 
> wrote:
> 
> Hello,
> 
> (everything in context of S3)
> 
> 
> I'm currently trying to better understand bucket sharding in combination with 
> an multisite - rgw setup and possible limitations.
> 
> At the moment I understand that a bucket has a bucket index, which is a list 
> of objects within the bucket.
> 
> There are also indexless buckets, but those are not usable for cases like a 
> multisite - rgw bucket, where your need a [delayed] consistent relation/state 
> between bucket n [zone a] and bucket n [zone b].
> 
> Those bucket indexes are stored in "shards" and shards get distributed over 
> to whole zone - cluster for scaling purposes.
> Redhat recommends a maximum size of 102,400 object per shard and recommend 
> this forumular to determine the right shard size for a cluster:
> 
> number of objects expected in a bucket / 100,000 
> max number of supported shards (or tested limit) is 7877 shard.

Back in 2017 this maximum number of shards changed to 65521. This change is in 
luminous, mimic, and nautilus.

> That results in a total limit of 787.700.000 objects, as long you wanna stay 
> in known and tested water.
> 
> Now some the things I did not 100% understand:
> 
> = QUESTION 1 =
> 
> Does each bucket has it's own shards? E.g
> 
> Bucket 1 reached it's shard limit at 7877 shard, can i then create other  
> Buckets wish start with their own frish sets of shards?
> OR is it the other way around which would mean all buckets save their Index 
> in the the same shards and if i reach the shard limit I need to create a 
> second cluster?

Correct, each bucket has its own bucket index. And each bucket index can be 
sharded.

> = QUESTION 2 =
> How are this shards distrbuted over the cluster? I expect they are just 
> objects in the rgw.bucket.index pool, is that correct?
> So. those one:
> rados ls -p a.rgw.buckets.index 
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.274451.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.87683.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.64716.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.78046.2

They are just objects and distributed via the CRUSH algorithm.

> = QUESTION 3 = 
> 
> 
> Does this Bucket Index Shards, has any relation to the RGW Sync shards in a 
> rgw multisite setup?
> E.g. If I have a ton of bucket index shards or buckets, does it have any 
> impact on the sync shards? 

They’re separate.

> radosgw-admin sync status
> realm f0019e09-c830-4fe8-a992-435e6f463b7c (mumu_1)
> zonegroup 307a1bb5-4d93-4a01-af21-0d8467b9bdfe (EU_1)
> zone 5a9c4d16-27a6-4721-aeda-b1a539b3d73a (b)
> metadata sync syncing
> full sync: 0/64 shards                    <= this ones I mean
> incremental sync: 64/64 shards
> metadata is caught up with master
> data sync source: 3638e3a4-8dde-42ee-812a-f98e266548a4 (a)
> syncing
> full sync: 0/128 shards   <= and this ones
> incremental sync: 128/128 shards <= and this ones
> data is caught up with source
> 
> 
> (swi to sync shard related topics)
> = QUESTION 4 = 
> (switching to sync shard related topics)
> 
> 
> What is the exact function and purpose of the sync shards? Do they implement 
> any limit? E.g. maybe a maximum amount of objects entries that waits for 
> synchronization to zone b.

They contain logs of items that need to be synced between zones. RGWs will look 
at them and sync objects. These logs are sharded so different RGWs can take on 
different shards and work on syncing in parallel.

> = QUESTION 5 = 
> Are those  Sync Shards processed parallel or sequentially? And where are 
> those shards stored?

They’re sharded to allow parallelism. At any given moment, each shard is 
claimed by (locked by) one RGW. And each RGW may be claiming multiple shards. 
Collectively, all RGW are claiming all shards. Each RGW is syncing multiple 
shards in parallel and all RGWs are doing this in parallel. So in some sense 
there are two levels of parallelism.

> = QUESTION 6 = 
> As far as I have experienced the sync process pretty much works like that:
> 
> 1.) The client sends a object or a operation to a rados gateway A (RGW A)
> 2.) RGW A logs this operation into one of it's sync shards and execute the 
> operation to it's local storage pool
> 3.) RGW B checks via get requests in a regular intervall if any new entries 
> in the RGW A log appears 
> 4.) If a new entry exists RGW B it's execute the operation to it's local pool 
> or pulls the new object from RGW A
> 
> Did I understand that correct? (For my roughly description of this 
> functionality, I want to apologize at the developers which for sure invested 
> much time and effort into design and building of that sync - process)

That’s about right.

> And If I understand it correct, how would look the exact strategy in a 
> multisite - setup to resync e.g. a single bucket at which one zone got 
> corrupted and must be get back into a synchronous state?

Be aware that there are full syncs and incremental syncs. Full syncs just copy 
every object. Incremental syncs use logs to sync selectively. Perhaps Casey 
will weigh in and discuss the state transitions.

> Hope thats the correct place to ask such questions.
> 
> Best Regards,
> Daly


--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to