On 1/18/2019 6:33 PM, KEVIN MICHAEL HRPCEK wrote:
On 1/18/19 7:26 AM, Igor Fedotov wrote:
Hi Kevin,
On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:
Hey,
I recall reading about this somewhere but I can't find it in the
docs or list archive and confirmation from a dev or someone who
knows for sure would be nice. What I recall is that bluestore has a
max 4GB file size limit based on the design of bluestore not the
osd_max_object_size setting. The bluestore source seems to suggest
that by setting the OBJECT_MAX_SIZE to a 32bit max, giving an error
if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing the
data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in
osd file size int can't exceed 32 bits which is 4GB, like FAT32. Am
I correct or maybe I'm reading all this wrong..?
You're correct, BlueStore doesn't support object larger than
OBJECT_MAX_SIZE(i.e. 4Gb)
Thanks for confirming that!
If bluestore has a hard 4GB object limit using radosstriper to break
up an object would work, but does using an EC pool that breaks up
the object to shards smaller than OBJECT_MAX_SIZE have the same
effect as radosstriper to get around a 4GB limit? We use rados
directly and would like to move to bluestore but we have some large
objects <= 13G that may need attention if this 4GB limit does exist
and an ec pool doesn't get around it.
Theoretically object split using EC might help. But I'm not sure
whether one needs to adjust osd_max_object_size greater than 4Gb to
permit 13Gb object usage in EC pool. If it's needed than
tosd_max_object_size <= OBJECT_MAX_SIZE constraint is violated and
BlueStore wouldn't start.
In my experience I had to increase osd_max_object_size from the 128M
default it changed to a couple versions ago to ~20G to be able to
write our largest objects with some margin. Do you think there is
another way to handle osd_max_object_size > OBJECT_MAX_SIZE so that
bluestore will start and EC pools or striping can be used to write
objects that are greater than OBJECT_MAX_SIZE but each stripe/shard
ends up smaller than OBJECT_MAX_SIZE after striping or being in an ec
pool?
I'm not very familiar with the logic osd_max_object_size controls at OSD
level. But IMO there are might be two logically valid options:
1) This is maximum user (RADOS?) object size. In this case verification
at BlueStore is a bit incorrect as EC might be in the path and hence one
can still have 4+ GB object stored. If that's the case then it's just
enough to remove the corresponding assertion at BlueStore.
2) This is maximum object size provided to Object store. Then one should
be able to upload object longer than this threshold using EC.
I'm going to verify this behavior and come up with corresponding fixes
if any shortly.
Unfortunately in short term I don't see any workarounds for your case
other than having a custom build that has assertion at BlueStore removed.
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0xffffffff // 32 bits
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395
// sanity check(s)
auto osd_max_object_size =
cct->_conf.get_val<Option::size_t>("osd_max_object_size");
if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex <<
OBJECT_MAX_SIZE
<< "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." << std::dec
<< dendl;
return -EINVAL;
}
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
} else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
}
Thanks!
Kevin
--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Thanks,
Igor
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com