On Tue, Dec 22, 2015 at 3:27 PM, Florian Haas <[email protected]> wrote:

> On Tue, Dec 22, 2015 at 3:10 AM, Haomai Wang <[email protected]> wrote:
> >> >> >> Hey everyone,
> >> >> >>
> >> >> >> I recently got my hands on a cluster that has been underperforming
> >> >> >> in
> >> >> >> terms of radosgw throughput, averaging about 60 PUTs/s with 70K
> >> >> >> objects where a freshly-installed cluster with near-identical
> >> >> >> configuration would do about 250 PUTs/s. (Neither of these values
> >> >> >> are
> >> >> >> what I'd consider high throughput, but this is just to give you a
> >> >> >> feel
> >> >> >> about the relative performance hit.)
> >> >> >>
> >> >> >> Some digging turned up that of the less than 200 buckets in the
> >> >> >> cluster, about 40 held in excess of a million objects (1-4M),
> which
> >> >> >> one bucket being an outlier with 45M objects. All buckets were
> >> >> >> created
> >> >> >> post-Hammer, and use 64 index shards. The total number of objects
> in
> >> >> >> radosgw is approx. 160M.
> >> >> >>
> >> >> >> Now this isn't a large cluster in terms of OSD distribution; there
> >> >> >> are
> >> >> >> only 12 OSDs (after all, we're only talking double-digit terabytes
> >> >> >> here). In almost all of these OSDs, the LevelDB omap directory has
> >> >> >> grown to a size of 10-20 GB.
> >> >> >>
> >> >> >> So I have several questions on this:
> >> >> >>
> >> >> >> - Is it correct to assume that such a large LevelDB would be quite
> >> >> >> detrimental to radosgw performance overall?
> >> >> >>
> >> >> >> - If so, would clearing that one large bucket and distributing the
> >> >> >> data over several new buckets reduce the LevelDB size at all?
> >> >> >>
> >> >> >> - Is there even something akin to "ceph mon compact" for OSDs?
> >> >> >>
> >> >> >> - Are these large LevelDB databases a simple consequence of
> having a
> >> >> >> combination of many radosgw objects and few OSDs, with the
> >> >> >> distribution per-bucket being comparatively irrelevant?
> >> >> >>
> >> >> >> I do understand that the 45M object bucket itself would have been
> a
> >> >> >> problem pre-Hammer, with no index sharding available. But with
> what
> >> >> >> others have shared here, a rule of thumb of one index shard per
> >> >> >> million objects should be a good one to follow, so 64 shards for
> 45M
> >> >> >> objects doesn't strike me as totally off the mark. That's why I
> >> >> >> think
> >> >> >> LevelDB I/O is actually the issue here. But I might be totally
> >> >> >> wrong;
> >> >> >> all insights appreciated. :)
> >> >> >
> >> >> >
> >> >> > Do you enable bucket index sharding?
> >> >>
> >> >> As stated above, yes. 64 shards.
> >> >>
> >> >> > I'm not sure your bottleneck regard to your cluster, I guess you
> >> >> > could
> >> >> > disable leveldb compression to test whether reduce compaction
> >> >> > influence.
> >> >>
> >> >> Hmmm, you mean with "leveldb_compression = false"? Could you explain
> >> >> why exactly *disabling* compression would help with large omaps?
> >> >>
> >> >> Also, would "osd_compact_leveldb_on_mount" (undocumented) help here?
> >> >> It looks to me like that is an option with no actual implementing
> >> >> code, but I may be missing something.
> >> >>
> >> >> The similarly named leveldb_compact_on_mount seems to only compact
> >> >> LevelDB data in LevelDBStore. But I may be mistaken there too, as
> that
> >> >> option also seems to be undocumented. Would configuring an osd with
> >> >> leveldb_compact_on_mount=true do omap compaction on OSD daemon
> >> >> startup, in a FileStore OSD?
> >> >
> >> >
> >> > I don't have exact info to sure this is the problem for your case,
> >> > before I
> >> > met this problem and because leveldb own single compaction thread
> which
> >> > consume lots of time on compress/uncompress to do compaction.
> >> >
> >> > what's your version, I guess "leveldb_compression" or
> >> > "osd_leveldb_compression" can help
> >>
> >> This is on Hammer.
> >>
> >> Could you please clarify the semantics of leveldb_compact_on_mount and
> >> leveldb_compression for OSDs though? Like I said, it looks like
> >> neither of those options is documented anywhere.
> >
> >
> > "leveldb_compact_on_mount": when osd boot, it will try to manually call
> > compact, this produce may consume lots of time while booting
> > "leveldb_compression": it's a option pass to leveldb internal, leveldb
> will
> > compress each freeze L1+ block, so when iterate leveldb or compaction
> lots
> > of blocks need to be compressed and uncompressed
>
> Okay, thank you. So to summarize,
>
> - leveldb_compact_on_mount does compaction on boot, which may consume
> a lot of time for a 20GB omap, and is off by default.
>
> - leveldb_compression does compression on every write and
> uncompression on every read (which may be slow if the omap is large
> and needs to be iterated), and is on by default.
>
> Now that raises a few more questions (sorry for persisting here — I
> really want to get to the bottom of this):
>
> - If an omap is already 20G in size, how much larger will it get with
> compression disabled?
>
> - How exactly would slow omap *iteration* also significantly slow down
> radosgw object *creation* (as evident from rest-bench), where there
> really shouldn't be any iteration involved? Or does radosgw have to
> enumerate *all* LevelDB entries associated with the bucket index
> object(s) for some reason, before it can update the index?
>
> - Is your suggestion, given the scenario I've described here, to
> enable leveldb_compact_on_mount and disable leveldb_compression? (I
> believe it is, just making sure.)
>
> - If large, compressed, uncompacted omap directories cause radosgw to
> slow down significantly, wouldn't it be a better idea to reverse the
> defaults (meaning to enable compaction and disable compression)?
>
>
Sorry, I can't answer these questions. As I mentioned before, I don't dive
into this problem deeply. And in my case, the omap directory is very large,
nearly 2000w objects in one bucket.  Hope others can help


> Cheers,
> Florian
>



-- 

Best Regards,

Wheat
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to