Here is a quick update. I found that a CephFS client process was accessing the big 1TB file, which I think had a lock on the file, preventing the flushing of objects to the underlying data pool. Once I killed that process, objects started to flush to the data pool automatically (with target_max_bytes & target_max_objects set); and I can force the flushing with 'rados -p cephfs_cache cache-flush-evict-all' as well. So David appears to be right in saying that "it can only hold full files and not flush partial files". This will be problematic if we want to transfer a file that is bigger in size than the cache pool!
We did this whole scheme (EC data pool plus NVMe cache tier) just for experimentation. I've learned a lot from the experiment and from your guys. Thank you very much! For production, I think I'll simply use a replicated pool for data on the HDDs (with bluestore WAL and DB on the 1st NVMe), and a replicated pool for metadata on the 2nd NVMe. Please let me know if you have any further advice / suggestion. Best, Shaw On Fri, Oct 6, 2017 at 10:07 AM, David Turner <drakonst...@gmail.com> wrote: > All of this data is test data, yeah? I would start by removing the > cache-tier and pool, recreate it and attach it, configure all of the > settings including the maximums, and start testing things again. I would > avoid doing the 1.3TB file test until after you've confirmed that the > smaller files are being flushed appropriately to the data pool (manually > flushing/evicting it) and then scale up your testing to the larger files. > On Fri, Oct 6, 2017 at 12:54 PM Shawfeng Dong <s...@ucsc.edu> wrote: > >> Curiously, it has been quite a while, but there is still no object in the >> underlying data pool: >> # rados -p cephfs_data ls >> >> Any advice? >> >> On Fri, Oct 6, 2017 at 9:45 AM, David Turner <drakonst...@gmail.com> >> wrote: >> >>> Notice in the URL for the documentation the use of "luminous". When you >>> looked a few weeks ago, you might have been looking at the documentation >>> for a different version of Ceph. You can change that to jewel, hammer, >>> kraken, master, etc depending on which version of Ceph you are running or >>> reading about. Google gets confused and will pull up random versions of >>> the ceph documentation for a page. It's on us to make sure that the url is >>> pointing to the version of Ceph that we are using. >>> >>> While it's sitting there in the flush command, can you see if there are >>> any objects in the underlying data pool? Hopefully the count will be >>> growing. >>> >>> On Fri, Oct 6, 2017 at 12:39 PM Shawfeng Dong <s...@ucsc.edu> wrote: >>> >>>> Hi Christian, >>>> >>>> I set those via CLI: >>>> # ceph osd pool set cephfs_cache target_max_bytes 1099511627776 >>>> # ceph osd pool set cephfs_cache target_max_objects 1000000 >>>> >>>> but manual flushing doesn't appear to work: >>>> # rados -p cephfs_cache cache-flush-evict-all >>>> 1000000046a.00000ca6 >>>> >>>> it just gets stuck there for a long time. >>>> >>>> Any suggestion? Do I need to restart the daemons or reboot the nodes? >>>> >>>> Thanks, >>>> Shaw >>>> >>>> >>>> >>>> On Fri, Oct 6, 2017 at 9:31 AM, Christian Balzer <ch...@gol.com> wrote: >>>> >>>>> On Fri, 6 Oct 2017 09:14:40 -0700 Shawfeng Dong wrote: >>>>> >>>>> > I found the command: rados -p cephfs_cache cache-flush-evict-all >>>>> > >>>>> That's not what you want/need. >>>>> Though it will fix your current "full" issue. >>>>> >>>>> > The documentation ( >>>>> > http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/) >>>>> has >>>>> > been improved a lot since I last checked it a few weeks ago! >>>>> > >>>>> The need to set max_bytes and max_objects has been documented for ages >>>>> (since Hammer). >>>>> >>>>> more below... >>>>> >>>>> > -Shaw >>>>> > >>>>> > On Fri, Oct 6, 2017 at 9:10 AM, Shawfeng Dong <s...@ucsc.edu> wrote: >>>>> > >>>>> > > Thanks, Luis. >>>>> > > >>>>> > > I've just set max_bytes and max_objects: >>>>> How? >>>>> Editing the conf file won't help until a restart. >>>>> >>>>> > > target_max_objects: 1000000 (1M) >>>>> > > target_max_bytes: 1099511627776 (1TB) >>>>> > >>>>> I'd lower that or the cache_target_full_ratio by another 10%. >>>>> >>>>> Christian >>>>> > > >>>>> > > but nothing appears to be happening. Is there a way to force >>>>> flushing? >>>>> > > >>>>> > > Thanks, >>>>> > > Shaw >>>>> > > >>>>> > > On Fri, Oct 6, 2017 at 8:55 AM, Luis Periquito < >>>>> periqu...@gmail.com> >>>>> > > wrote: >>>>> > > >>>>> > >> Not looking at anything else, you didn't set the max_bytes or >>>>> > >> max_objects for it to start flushing... >>>>> > >> >>>>> > >> On Fri, Oct 6, 2017 at 4:49 PM, Shawfeng Dong <s...@ucsc.edu> >>>>> wrote: >>>>> > >> > Dear all, >>>>> > >> > >>>>> > >> > Thanks a lot for the very insightful comments/suggestions! >>>>> > >> > >>>>> > >> > There are 3 OSD servers in our pilot Ceph cluster, each with 2x >>>>> 1TB SSDs >>>>> > >> > (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use >>>>> the >>>>> > >> bluestore >>>>> > >> > backend, with the first NVMe as the WAL and DB devices for OSDs >>>>> on the >>>>> > >> HDDs. >>>>> > >> > And we try to create a cache tier out of the second NVMes. >>>>> > >> > >>>>> > >> > Here are the outputs of the commands suggested by David: >>>>> > >> > >>>>> > >> > 1) # ceph df >>>>> > >> > GLOBAL: >>>>> > >> > SIZE AVAIL RAW USED %RAW USED >>>>> > >> > 265T 262T 2847G 1.05 >>>>> > >> > POOLS: >>>>> > >> > NAME ID USED %USED MAX AVAIL >>>>> > >> OBJECTS >>>>> > >> > cephfs_data 1 0 0 248T >>>>> > >> 0 >>>>> > >> > cephfs_metadata 2 8515k 0 248T >>>>> > >> 24 >>>>> > >> > cephfs_cache 3 1381G 100.00 0 >>>>> > >> 355385 >>>>> > >> > >>>>> > >> > 2) # ceph osd df >>>>> > >> > 0 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 174 >>>>> > >> > 1 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 169 >>>>> > >> > 2 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173 >>>>> > >> > 3 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 159 >>>>> > >> > 4 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173 >>>>> > >> > 5 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 162 >>>>> > >> > 6 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 149 >>>>> > >> > 7 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 179 >>>>> > >> > 8 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 163 >>>>> > >> > 9 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 194 >>>>> > >> > 10 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 185 >>>>> > >> > 11 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 168 >>>>> > >> > 36 nvme 1.09149 1.00000 1117G 855G 262G 76.53 73.01 79 >>>>> > >> > 12 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 180 >>>>> > >> > 13 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 168 >>>>> > >> > 14 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 178 >>>>> > >> > 15 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 170 >>>>> > >> > 16 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 149 >>>>> > >> > 17 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 203 >>>>> > >> > 18 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 173 >>>>> > >> > 19 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 158 >>>>> > >> > 20 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 154 >>>>> > >> > 21 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 160 >>>>> > >> > 22 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 167 >>>>> > >> > 23 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 188 >>>>> > >> > 37 nvme 1.09149 1.00000 1117G 1061G 57214M 95.00 90.63 98 >>>>> > >> > 24 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 187 >>>>> > >> > 25 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 200 >>>>> > >> > 26 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 147 >>>>> > >> > 27 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 171 >>>>> > >> > 28 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 162 >>>>> > >> > 29 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 152 >>>>> > >> > 30 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 174 >>>>> > >> > 31 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 176 >>>>> > >> > 32 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 182 >>>>> > >> > 33 hdd 7.27829 1.00000 7452G 2072M 7450G 0.03 0.03 155 >>>>> > >> > 34 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 166 >>>>> > >> > 35 hdd 7.27829 1.00000 7452G 2076M 7450G 0.03 0.03 176 >>>>> > >> > 38 nvme 1.09149 1.00000 1117G 857G 260G 76.71 73.18 79 >>>>> > >> > TOTAL 265T 2847G 262T 1.05 >>>>> > >> > MIN/MAX VAR: 0.03/90.63 STDDEV: 22.81 >>>>> > >> > >>>>> > >> > 3) # ceph osd tree >>>>> > >> > -1 265.29291 root default >>>>> > >> > -3 88.43097 host pulpo-osd01 >>>>> > >> > 0 hdd 7.27829 osd.0 up 1.00000 1.00000 >>>>> > >> > 1 hdd 7.27829 osd.1 up 1.00000 1.00000 >>>>> > >> > 2 hdd 7.27829 osd.2 up 1.00000 1.00000 >>>>> > >> > 3 hdd 7.27829 osd.3 up 1.00000 1.00000 >>>>> > >> > 4 hdd 7.27829 osd.4 up 1.00000 1.00000 >>>>> > >> > 5 hdd 7.27829 osd.5 up 1.00000 1.00000 >>>>> > >> > 6 hdd 7.27829 osd.6 up 1.00000 1.00000 >>>>> > >> > 7 hdd 7.27829 osd.7 up 1.00000 1.00000 >>>>> > >> > 8 hdd 7.27829 osd.8 up 1.00000 1.00000 >>>>> > >> > 9 hdd 7.27829 osd.9 up 1.00000 1.00000 >>>>> > >> > 10 hdd 7.27829 osd.10 up 1.00000 1.00000 >>>>> > >> > 11 hdd 7.27829 osd.11 up 1.00000 1.00000 >>>>> > >> > 36 nvme 1.09149 osd.36 up 1.00000 1.00000 >>>>> > >> > -5 88.43097 host pulpo-osd02 >>>>> > >> > 12 hdd 7.27829 osd.12 up 1.00000 1.00000 >>>>> > >> > 13 hdd 7.27829 osd.13 up 1.00000 1.00000 >>>>> > >> > 14 hdd 7.27829 osd.14 up 1.00000 1.00000 >>>>> > >> > 15 hdd 7.27829 osd.15 up 1.00000 1.00000 >>>>> > >> > 16 hdd 7.27829 osd.16 up 1.00000 1.00000 >>>>> > >> > 17 hdd 7.27829 osd.17 up 1.00000 1.00000 >>>>> > >> > 18 hdd 7.27829 osd.18 up 1.00000 1.00000 >>>>> > >> > 19 hdd 7.27829 osd.19 up 1.00000 1.00000 >>>>> > >> > 20 hdd 7.27829 osd.20 up 1.00000 1.00000 >>>>> > >> > 21 hdd 7.27829 osd.21 up 1.00000 1.00000 >>>>> > >> > 22 hdd 7.27829 osd.22 up 1.00000 1.00000 >>>>> > >> > 23 hdd 7.27829 osd.23 up 1.00000 1.00000 >>>>> > >> > 37 nvme 1.09149 osd.37 up 1.00000 1.00000 >>>>> > >> > 36 nvme 1.09149 osd.36 up 1.00000 1.00000 >>>>> > >> > -5 88.43097 host pulpo-osd02 >>>>> > >> > 12 hdd 7.27829 osd.12 up 1.00000 1.00000 >>>>> > >> > 13 hdd 7.27829 osd.13 up 1.00000 1.00000 >>>>> > >> > 14 hdd 7.27829 osd.14 up 1.00000 1.00000 >>>>> > >> > 15 hdd 7.27829 osd.15 up 1.00000 1.00000 >>>>> > >> > 16 hdd 7.27829 osd.16 up 1.00000 1.00000 >>>>> > >> > 17 hdd 7.27829 osd.17 up 1.00000 1.00000 >>>>> > >> > 18 hdd 7.27829 osd.18 up 1.00000 1.00000 >>>>> > >> > 19 hdd 7.27829 osd.19 up 1.00000 1.00000 >>>>> > >> > 20 hdd 7.27829 osd.20 up 1.00000 1.00000 >>>>> > >> > 21 hdd 7.27829 osd.21 up 1.00000 1.00000 >>>>> > >> > 22 hdd 7.27829 osd.22 up 1.00000 1.00000 >>>>> > >> > 23 hdd 7.27829 osd.23 up 1.00000 1.00000 >>>>> > >> > 37 nvme 1.09149 osd.37 up 1.00000 1.00000 >>>>> > >> > -7 88.43097 host pulpo-osd03 >>>>> > >> > 24 hdd 7.27829 osd.24 up 1.00000 1.00000 >>>>> > >> > 25 hdd 7.27829 osd.25 up 1.00000 1.00000 >>>>> > >> > 26 hdd 7.27829 osd.26 up 1.00000 1.00000 >>>>> > >> > 27 hdd 7.27829 osd.27 up 1.00000 1.00000 >>>>> > >> > 28 hdd 7.27829 osd.28 up 1.00000 1.00000 >>>>> > >> > 29 hdd 7.27829 osd.29 up 1.00000 1.00000 >>>>> > >> > 30 hdd 7.27829 osd.30 up 1.00000 1.00000 >>>>> > >> > 31 hdd 7.27829 osd.31 up 1.00000 1.00000 >>>>> > >> > 32 hdd 7.27829 osd.32 up 1.00000 1.00000 >>>>> > >> > 33 hdd 7.27829 osd.33 up 1.00000 1.00000 >>>>> > >> > 34 hdd 7.27829 osd.34 up 1.00000 1.00000 >>>>> > >> > 35 hdd 7.27829 osd.35 up 1.00000 1.00000 >>>>> > >> > 38 nvme 1.09149 osd.38 up 1.00000 1.00000 >>>>> > >> > >>>>> > >> > 4) # ceph osd pool get cephfs_cache all >>>>> > >> > min_size: 2 >>>>> > >> > crash_replay_interval: 0 >>>>> > >> > pg_num: 128 >>>>> > >> > pgp_num: 128 >>>>> > >> > crush_rule: pulpo_nvme >>>>> > >> > hashpspool: true >>>>> > >> > nodelete: false >>>>> > >> > nopgchange: false >>>>> > >> > nosizechange: false >>>>> > >> > write_fadvise_dontneed: false >>>>> > >> > noscrub: false >>>>> > >> > nodeep-scrub: false >>>>> > >> > hit_set_type: bloom >>>>> > >> > hit_set_period: 14400 >>>>> > >> > hit_set_count: 12 >>>>> > >> > hit_set_fpp: 0.05 >>>>> > >> > use_gmt_hitset: 1 >>>>> > >> > auid: 0 >>>>> > >> > target_max_objects: 0 >>>>> > >> > target_max_bytes: 0 >>>>> > >> > cache_target_dirty_ratio: 0.4 >>>>> > >> > cache_target_dirty_high_ratio: 0.6 >>>>> > >> > cache_target_full_ratio: 0.8 >>>>> > >> > cache_min_flush_age: 0 >>>>> > >> > cache_min_evict_age: 0 >>>>> > >> > min_read_recency_for_promote: 0 >>>>> > >> > min_write_recency_for_promote: 0 >>>>> > >> > fast_read: 0 >>>>> > >> > hit_set_grade_decay_rate: 0 >>>>> > >> > crash_replay_interval: 0 >>>>> > >> > >>>>> > >> > Do you see anything wrong? We had written some small files to >>>>> the CephFS >>>>> > >> > before we tried to write the big 1TB file. What is puzzling to >>>>> me is >>>>> > >> that no >>>>> > >> > data has been written back to the data pool. >>>>> > >> > >>>>> > >> > Best, >>>>> > >> > Shaw >>>>> > >> > >>>>> > >> > On Fri, Oct 6, 2017 at 6:46 AM, David Turner < >>>>> drakonst...@gmail.com> >>>>> > >> wrote: >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> On Fri, Oct 6, 2017, 1:05 AM Christian Balzer <ch...@gol.com> >>>>> wrote: >>>>> > >> >>> >>>>> > >> >>> >>>>> > >> >>> Hello, >>>>> > >> >>> >>>>> > >> >>> On Fri, 06 Oct 2017 03:30:41 +0000 David Turner wrote: >>>>> > >> >>> >>>>> > >> >>> > You're missing most all of the important bits. What the >>>>> osds in your >>>>> > >> >>> > cluster look like, your tree, and your cache pool settings. >>>>> > >> >>> > >>>>> > >> >>> > ceph df >>>>> > >> >>> > ceph osd df >>>>> > >> >>> > ceph osd tree >>>>> > >> >>> > ceph osd pool get cephfs_cache all >>>>> > >> >>> > >>>>> > >> >>> Especially the last one. >>>>> > >> >>> >>>>> > >> >>> My money is on not having set target_max_objects and >>>>> target_max_bytes >>>>> > >> to >>>>> > >> >>> sensible values along with the ratios. >>>>> > >> >>> In short, not having read the (albeit spotty) documentation. >>>>> > >> >>> >>>>> > >> >>> > You have your writeback cache on 3 nvme drives. It looks >>>>> like you >>>>> > >> have >>>>> > >> >>> > 1.6TB available between them for the cache. I don't know the >>>>> > >> behavior >>>>> > >> >>> > of a >>>>> > >> >>> > writeback cache tier on cephfs for large files, but I would >>>>> guess >>>>> > >> that >>>>> > >> >>> > it >>>>> > >> >>> > can only hold full files and not flush partial files. >>>>> > >> >>> >>>>> > >> >>> I VERY much doubt that, if so it would be a massive flaw. >>>>> > >> >>> One assumes that cache operations work on the RADOS object >>>>> level, no >>>>> > >> >>> matter what. >>>>> > >> >> >>>>> > >> >> I hope that it is on the rados level, but not a single object >>>>> had been >>>>> > >> >> flushed to the backing pool. So I hazarded a guess. Seeing his >>>>> > >> settings will >>>>> > >> >> shed more light. >>>>> > >> >>> >>>>> > >> >>> >>>>> > >> >>> > That would mean your >>>>> > >> >>> > cache needs to have enough space for any file being written >>>>> to the >>>>> > >> >>> > cluster. >>>>> > >> >>> > In this case a 1.3TB file with 3x replication would require >>>>> 3.9TB >>>>> > >> (more >>>>> > >> >>> > than double what you have available) of available space in >>>>> your >>>>> > >> >>> > writeback >>>>> > >> >>> > cache. >>>>> > >> >>> > >>>>> > >> >>> > There are very few use cases that benefit from a cache >>>>> tier. The >>>>> > >> docs >>>>> > >> >>> > for >>>>> > >> >>> > Luminous warn as much. >>>>> > >> >>> You keep repeating that like a broken record. >>>>> > >> >>> >>>>> > >> >>> And while certainly not false I for one wouldn't be able to >>>>> use >>>>> > >> (justify >>>>> > >> >>> using) Ceph w/o cache tiers in our main use case. >>>>> > >> >>> >>>>> > >> >>> >>>>> > >> >>> In this case I assume they were following and old cheat sheet >>>>> or such, >>>>> > >> >>> suggesting the previously required cache tier with EC pools. >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> http://docs.ceph.com/docs/luminous/rados/operations/ >>>>> cache-tiering/ >>>>> > >> >> >>>>> > >> >> I know I keep repeating it, especially recently as there have >>>>> been a >>>>> > >> lot >>>>> > >> >> of people asking about it. The Luminous docs added a large >>>>> section >>>>> > >> about how >>>>> > >> >> it is probably not what you want. Like me, it is not saying >>>>> that there >>>>> > >> are >>>>> > >> >> no use cases for it. There was no information provided about >>>>> the use >>>>> > >> case >>>>> > >> >> and I made some suggestions/guesses. I'm also guessing that >>>>> they are >>>>> > >> >> following a guide where a writeback cache was necessary for >>>>> CephFS to >>>>> > >> use EC >>>>> > >> >> prior to Luminous. I also usually add that people should test >>>>> it out >>>>> > >> and >>>>> > >> >> find what works best for them. I will always defer to your >>>>> practical >>>>> > >> use of >>>>> > >> >> cache tiers as well, especially when using rbds. >>>>> > >> >> >>>>> > >> >> I manage a cluster that I intend to continue running a >>>>> writeback cache >>>>> > >> in >>>>> > >> >> front of CephFS on the same drives as the EC pool. The use case >>>>> > >> receives a >>>>> > >> >> good enough benefit from the cache tier that it isn't even >>>>> required to >>>>> > >> use >>>>> > >> >> flash media to see it. It is used for video editing and the >>>>> files are >>>>> > >> >> usually modified and read within the first 24 hours and then >>>>> left in >>>>> > >> cold >>>>> > >> >> storage until deleted. I have the cache timed to keep >>>>> everything in it >>>>> > >> for >>>>> > >> >> 24 hours and then evict it by using a minimum time to flush >>>>> and evict >>>>> > >> at 24 >>>>> > >> >> hours and a target max bytes of 0. All files are in there for >>>>> that >>>>> > >> time and >>>>> > >> >> then it never has to decide what to keep as it doesn't keep >>>>> anything >>>>> > >> longer >>>>> > >> >> than that. Luckily read performance from cold storage is not a >>>>> > >> requirement >>>>> > >> >> of this cluster as any read operation has to first read it >>>>> from EC >>>>> > >> storage, >>>>> > >> >> write it to replica storage, and then read it from replica >>>>> storage... >>>>> > >> Yuck. >>>>> > >> >>> >>>>> > >> >>> >>>>> > >> >>> Christian >>>>> > >> >>> >>>>> > >> >>> >What is your goal by implementing this cache? If the >>>>> > >> >>> > answer is to utilize extra space on the nvmes, then just >>>>> remove it >>>>> > >> and >>>>> > >> >>> > say >>>>> > >> >>> > thank you. The better use of nvmes in that case are as a >>>>> part of the >>>>> > >> >>> > bluestore stack and give your osds larger DB partitions. >>>>> Keeping >>>>> > >> your >>>>> > >> >>> > metadata pool on nvmes is still a good idea. >>>>> > >> >>> > >>>>> > >> >>> > On Thu, Oct 5, 2017, 7:45 PM Shawfeng Dong <s...@ucsc.edu> >>>>> wrote: >>>>> > >> >>> > >>>>> > >> >>> > > Dear all, >>>>> > >> >>> > > >>>>> > >> >>> > > We just set up a Ceph cluster, running the latest stable >>>>> release >>>>> > >> Ceph >>>>> > >> >>> > > v12.2.0 (Luminous): >>>>> > >> >>> > > # ceph --version >>>>> > >> >>> > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24 >>>>> d43bab910c) >>>>> > >> >>> > > luminous >>>>> > >> >>> > > (rc) >>>>> > >> >>> > > >>>>> > >> >>> > > The goal is to serve Ceph filesystem, for which we >>>>> created 3 >>>>> > >> pools: >>>>> > >> >>> > > # ceph osd lspools >>>>> > >> >>> > > 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache, >>>>> > >> >>> > > where >>>>> > >> >>> > > * cephfs_data is the data pool (36 OSDs on HDDs), which is >>>>> > >> >>> > > erased-coded; >>>>> > >> >>> > > * cephfs_metadata is the metadata pool >>>>> > >> >>> > > * cephfs_cache is the cache tier (3 OSDs on NVMes) for >>>>> > >> cephfs_data. >>>>> > >> >>> > > The >>>>> > >> >>> > > cache-mode is writeback. >>>>> > >> >>> > > >>>>> > >> >>> > > Everything had worked fine, until today when we tried to >>>>> copy a >>>>> > >> 1.3TB >>>>> > >> >>> > > file >>>>> > >> >>> > > to the CephFS. We got the "No space left on device" >>>>> error! >>>>> > >> >>> > > >>>>> > >> >>> > > 'ceph -s' says some OSDs are full: >>>>> > >> >>> > > # ceph -s >>>>> > >> >>> > > cluster: >>>>> > >> >>> > > id: e18516bf-39cb-4670-9f13-88ccb7d19769 >>>>> > >> >>> > > health: HEALTH_ERR >>>>> > >> >>> > > full flag(s) set >>>>> > >> >>> > > 1 full osd(s) >>>>> > >> >>> > > 1 pools have many more objects per pg than >>>>> average >>>>> > >> >>> > > >>>>> > >> >>> > > services: >>>>> > >> >>> > > mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo- >>>>> mds01 >>>>> > >> >>> > > mgr: pulpo-mds01(active), standbys: pulpo-admin, >>>>> pulpo-mon01 >>>>> > >> >>> > > mds: pulpos-1/1/1 up {0=pulpo-mds01=up:active} >>>>> > >> >>> > > osd: 39 osds: 39 up, 39 in >>>>> > >> >>> > > flags full >>>>> > >> >>> > > >>>>> > >> >>> > > data: >>>>> > >> >>> > > pools: 3 pools, 2176 pgs >>>>> > >> >>> > > objects: 347k objects, 1381 GB >>>>> > >> >>> > > usage: 2847 GB used, 262 TB / 265 TB avail >>>>> > >> >>> > > pgs: 2176 active+clean >>>>> > >> >>> > > >>>>> > >> >>> > > io: >>>>> > >> >>> > > client: 19301 kB/s rd, 2935 op/s rd, 0 op/s wr >>>>> > >> >>> > > >>>>> > >> >>> > > And indeed the cache pool is full: >>>>> > >> >>> > > # rados df >>>>> > >> >>> > > POOL_NAME USED OBJECTS CLONES COPIES >>>>> MISSING_ON_PRIMARY >>>>> > >> >>> > > UNFOUND >>>>> > >> >>> > > DEGRADED RD_OPS RD >>>>> > >> >>> > > WR_OPS WR >>>>> > >> >>> > > cephfs_cache 1381G 355385 0 710770 >>>>> 0 >>>>> > >> >>> > > 0 >>>>> > >> >>> > > 0 10004954 15 >>>>> > >> >>> > > 22G 1398063 1611G >>>>> > >> >>> > > cephfs_data 0 0 0 0 >>>>> 0 >>>>> > >> >>> > > 0 >>>>> > >> >>> > > 0 0 >>>>> > >> >>> > > 0 0 0 >>>>> > >> >>> > > cephfs_metadata 8515k 24 0 72 >>>>> 0 >>>>> > >> >>> > > 0 >>>>> > >> >>> > > 0 3 3 >>>>> > >> >>> > > 072 3953 10541k >>>>> > >> >>> > > >>>>> > >> >>> > > total_objects 355409 >>>>> > >> >>> > > total_used 2847G >>>>> > >> >>> > > total_avail 262T >>>>> > >> >>> > > total_space 265T >>>>> > >> >>> > > >>>>> > >> >>> > > However, the data pool is completely empty! So it seems >>>>> that data >>>>> > >> has >>>>> > >> >>> > > only >>>>> > >> >>> > > been written to the cache pool, but not written back to >>>>> the data >>>>> > >> >>> > > pool. >>>>> > >> >>> > > >>>>> > >> >>> > > I am really at a loss whether this is due to a setup >>>>> error on my >>>>> > >> >>> > > part, or >>>>> > >> >>> > > a Luminous bug. Could anyone shed some light on this? >>>>> Please let >>>>> > >> me >>>>> > >> >>> > > know if >>>>> > >> >>> > > you need any further info. >>>>> > >> >>> > > >>>>> > >> >>> > > Best, >>>>> > >> >>> > > Shaw >>>>> > >> >>> > > _______________________________________________ >>>>> > >> >>> > > ceph-users mailing list >>>>> > >> >>> > > ceph-users@lists.ceph.com >>>>> > >> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> > >> >>> > > >>>>> > >> >>> >>>>> > >> >>> >>>>> > >> >>> -- >>>>> > >> >>> Christian Balzer Network/Systems Engineer >>>>> > >> >>> ch...@gol.com Rakuten Communications >>>>> > >> >>> _______________________________________________ >>>>> > >> >>> ceph-users mailing list >>>>> > >> >>> ceph-users@lists.ceph.com >>>>> > >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> > >> >> >>>>> > >> >> >>>>> > >> >> _______________________________________________ >>>>> > >> >> ceph-users mailing list >>>>> > >> >> ceph-users@lists.ceph.com >>>>> > >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> > >> >> >>>>> > >> > >>>>> > >> > >>>>> > >> > _______________________________________________ >>>>> > >> > ceph-users mailing list >>>>> > >> > ceph-users@lists.ceph.com >>>>> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> > >> > >>>>> > >> >>>>> > > >>>>> > > >>>>> >>>>> >>>>> -- >>>>> Christian Balzer Network/Systems Engineer >>>>> ch...@gol.com Rakuten Communications >>>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com