be degraded until you can bring the host back, and will not be able to
recover those chunks anywhere (since the ruleset prevents so), so any
further failure of an OSD while a host is down will necessarily lose data.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mar
On 06/02/15 21:07, Udo Lembke wrote:
> Am 06.02.2015 09:06, schrieb Hector Martin:
>> On 02/02/15 03:38, Udo Lembke wrote:
>>> With 3 hosts only you can't survive an full node failure, because for
>>> that you need
>>> host >= k + m.
>>
>> Su
everything by 12? The cluster is
currently very overprovisioned for space, so we're probably not going to
be adding OSDs for quite a while, but we'll be adding pools.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
_
ely larger
factors as you add pools.
We are following the hardware recommendations for RAM: 1GB per 1TB of
storage, so 16GB for each OSD box (4GB per OSD daemon, each OSD being
one 4TB drive).
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/
thout leaving any
evidence behind.
Any ideas what might've happened here? If this happens again / is
reproducible I'll try to see if I can do some more debugging...
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
_
On 2018-06-16 13:04, Hector Martin wrote:
> I'm at a loss as to what happened here.
Okay, I just realized CephFS has a default 1TB file size... that
explains what triggered the problem. I just bumped it to 10TB. What that
doesn't explain is why rsync didn't complain about an
-data hdd1/data1 --block.db ssd/db1
...
ceph-volume lvm activate --all
I think it might be possible to just let ceph-volume create the PV/VG/LV
for the data disks and only manually create the DB LVs, but it shouldn't
hurt to do it on your own and just give ready-made LVs to ceph-volume
f you just try
to start the OSDs again? Maybe check the overall system log with
journalctl for hints.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
systemd is still trying to mount
the old OSDs, which used disk partitions. Look in /etc/fstab and in
/etc/systemd/system for any references to those filesystems and get rid
of them. /dev/sdh1 and company no longer exist, and nothing should
reference them.
--
Hector Martin (hec...@marcansof
On 11/6/18 1:08 AM, Hector Martin wrote:
> On 11/6/18 12:42 AM, Hayashida, Mami wrote:
>> Additional info -- I know that /var/lib/ceph/osd/ceph-{60..69} are not
>> mounted at this point (i.e. mount | grep ceph-60, and 61-69, returns
>> nothing.). They don't show
0G 0 lvm
> >> >> >>> > ├─ssd0-db61 252:1 0 40G 0 lvm
> >> >> >>> > ├─ssd0-db62 252:2 0 40G 0 lvm
> >> >> >>> > ├─ssd0-db63 252:3 0 40G 0 lvm
> >> >> >>> > ├─ssd0-db64 252:4 0 40G 0 lvm
> >> >> >>> > ├─
hat references any of the old partitions that don't exist
(/dev/sdh1 etc) should be removed. The disks are now full-disk LVM PVs
and should have no partitions.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users
On 11/6/18 3:21 AM, Alfredo Deza wrote:
> On Mon, Nov 5, 2018 at 11:51 AM Hector Martin wrote:
>>
>> Those units don't get triggered out of nowhere, there has to be a
>> partition table with magic GUIDs or a fstab or something to cause them
>> to be triggered. The
;change", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", \
ENV{DM_LV_NAME}=="db*", ENV{DM_VG_NAME}=="ssd0", \
OWNER="ceph", GROUP="ceph", MODE="660"
Reboot after that and see if the OSDs come up without further action.
--
Hec
s
with symlinks to block devices. I'm not sure what happened there.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
; sdh 8:112 0 3.7T 0 disk
> └─hdd60-data60 252:1 0 3.7T 0 lvm
>
> and "ceph osd tree" shows
> 60 hdd 3.63689 osd.60 up 1.0 1.0
That looks correct as far as the weight goes, but I'm really confused as
to why you have
quot; and "mount | grep osd" instead and
see if ceph-60 through ceph-69 show up.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
adata is in LVM, it's
safe to move or delete all those OSD directories for BlueStore OSDs and
try activating them cleanly again, which hopefully will do the right thing.
In the end this all might fix your device ownership woes too, making the
udev rule unnecessary. If it all works ou
en care of by the ceph-volume activation.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
d to be safe, and might avoid trouble if some
FileStore remnant tries to mount phantom partitions.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
g to wipe
because there is a backup at the end of the device, but wipefs *should*
know about that as far as I know.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lis
-- 1 ceph ceph 6 Oct 28 16:12 ready
-rw--- 1 ceph ceph 10 Oct 28 16:12 type
-rw--- 1 ceph ceph 3 Oct 28 16:12 whoami
(lockbox.keyring is for encryption, which you do not use)
--
Hector Martin (hec...@marcansoft.com)
Public Key: https
ot
always viable. Right now it seems that besides the cache, OSDs will
creep up in memory usage up to some threshold, and I'm not sure what
determines what that baseline usage is or whether it can be controlled.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
ata:
http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds
Would this be preferable to just restoring the mon from a backup? What
about the MDS map?
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://m
ow. I'll see if I can do some DR tests when I set this up, to
prove to myself that it all works out :-)
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
nly? Or only if several things
go down at once?)
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
hose pages are flushed?
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
On 26/11/2018 11.05, Yan, Zheng wrote:
> On Mon, Nov 26, 2018 at 4:30 AM Hector Martin wrote:
>>
>> On 26/11/2018 00.19, Paul Emmerich wrote:
>>> No, wait. Which system did kernel panic? Your CephFS client running rsync?
>>> In this case this would be expect
M (suspend/resume), which
has higher impact but also probably a much lower chance of messing up
(or having excess latency), since it doesn't involve the guest OS or the
qemu agent at all...
--
Hector Martin (hec...@marcansoft.com)
Public Key
27;m increasing
our timeout to 15 minutes, we'll see if the problem recurs.
Given this, it makes even more sense to just avoid the freeze if at all
reasonable. There's no real way to guarantee that a fsfreeze will
complete in a "reasonable" amount of time as far as I ca
out/retries, then switch to
unconditionally reset the VM if thawing fails.
Ultimately this whole thing is kind of fragile, so if I can get away
without freezing at all it would probably make the whole process a lot
more robust.
--
Hector Martin (hec
On 21/12/2018 03.02, Gregory Farnum wrote:
> RBD snapshots are indeed crash-consistent. :)
> -Greg
Thanks for the confirmation! May I suggest putting this little nugget in
the docs somewhere? This might help clarify things for others :)
--
Hector Martin (hec...@marcansoft.com)
Public Key:
e happy to test this
again with osd.1 if needed and see if I can get it fixed. Otherwise I'll
just re-create it and move on.
# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic
(stable)
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://
ith
CEPH_ARGS="--debug-bluestore 20 --debug-bluefs 20 --log-file
bluefs-bdev-expand.log"
Perhaps it makes sense to open a ticket at ceph bug tracker to proceed...
Thanks,
Igor
On 12/27/2018 12:19 PM, Hector Martin wrote:
Hi list,
I'm slightly expanding the underlying LV for two
o problem then, good to know it isn't *supposed* to work yet :-)
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
). So the OSDs get set up with some
custom code, but then normal usage just uses ceph-disk (it certainly
doesn't care about extra partitions once everything is set up). This was
formerly FileStore and now BlueStore, but it's a legacy setup. I expect
to move this over to ceph-volume at
m to work well
so far in my home cluster, but I haven't finished setting things up yet.
Those are definitely not SMR.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-us
ision thing) to hopefully squash more
lurking Python 3 bugs.
(just my 2c - maybe I got unlucky and otherwise things work well enough
for everyone else in Py3; I'm certainly happy to get rid of Py2 ASAP).
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://
On 18/01/2019 22.33, Alfredo Deza wrote:
> On Fri, Jan 18, 2019 at 7:07 AM Hector Martin wrote:
>>
>> On 17/01/2019 00:45, Sage Weil wrote:
>>> Hi everyone,
>>>
>>> This has come up several times before, but we need to make a final
>>> decis
On 19/01/2019 02.24, Brian Topping wrote:
>
>
>> On Jan 18, 2019, at 4:29 AM, Hector Martin wrote:
>>
>> On 12/01/2019 15:07, Brian Topping wrote:
>>> I’m a little nervous that BlueStore assumes it owns the partition table and
>>> will not be happy tha
d-raid-on-lvm, which as you can imagine required some
tweaking of startup scripts to make work with LVM on both ends!)
Ultimately a lot of this is dictated by whatever tools you feel
comfortable using :-)
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ring
raw storage performance.
* Ceph has a slight disadvantage here because its chunk of the drives is
logically after the traditional RAID, and HDDs get slower towards higher
logical addresses, but this should be on the order of a 15-20% hit at most.
--
Hector Martin (hec.
ho 'rc_need="ceph-mon.0"' > /etc/conf.d/ceph-osd
The Gentoo initscript setup for Ceph is unfortunately not very well
documented. I've been meaning to write a blogpost about this to try to
share what I've learned :-)
--
Hector Martin (hec...@marcansoft.com)
Public
dance, if I can guarantee they're atomic.
Is there any documentation on what write operations incur significant
overhead on CephFS like this, and why? This particular issue isn't
mentioned in http://docs.ceph.com/docs/master/cephfs/app-best-practices/
(which seems like it mostly deals
though. There's some discussion on this here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020510.html
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.
600 or so? You might want to go through your snapshots and check that
you aren't leaking old snapshots forever, or deleting the wrong ones.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users
or writing to
an existing one without truncation does not.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ying pools, one apparently created on deletion (I
wasn't aware of this). So for ~700 snapshots the output you're seeing is
normal. It seems that using a "rolling snapshot" pattern in CephFS
inherently creates a "one present, one deleted" pattern in the
underlying pools.
--
and trying to connect via the external IP of that node.
Does your ceph.conf have the right network settings? Compare it with the
other nodes. Also check that your network interfaces and routes are
correctly configured on the problem node, of course.
--
Hector Martin (hec...@marcansoft.com)
Publi
t data pool.
The FSMap seems to store pools by ID, not by name, so renaming the pools
won't work.
This past thread has an untested procedure for migrating CephFS pools:
https://www.spinics.net/lists/ceph-users/msg29536.html
--
Hector Martin (hec...@marcansoft.com)
Public Ke
27;), e.g. 'ceph osd purge
--yes-i-really-mean-it' and make sure there isn't a spurious entry for
it in ceph.conf, then re-deploy. Once you do that there is no possible
other place for the OSD to somehow remember its old IP.
--
Hector Martin (hec...@marcansoft.com)
Public Key: htt
s
formula and then just do the above dance for every hardlinked file to
move the primaries off, but this seems fragile and likely to break in
certain situations (or do needless work). Any other ideas?
Thanks,
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
h` back to the cluster seed.
>
>
> I appreciate small clusters are not the target use case of Ceph, but
> everyone has to start somewhere!
>
> ___________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph
It's just a 128-byte flag file (formerly
variable length, now I just pad it to the full 128 bytes and rewrite it
in-place). This is good information to know for optimizing things :-)
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
tat(), right. (I only just
realized this :-))
Are there Python bindings for what ceph-dencoder does, or at least a C
API? I could shell out to ceph-dencoder but I imagine that won't be too
great for performance.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrc
to know about all
the files in a pool.
As far as I can tell you *can* read the ceph.file.layout.pool xattr on
any files in CephFS, even those that haven't had it explicitly set.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.s
odified time of 2 Sept 2028, the day and month
are also wrong.
Obvious question: are you sure the date/time on your cluster nodes and
your clients is correct? Can you track down which files (if any) have
the ctime in the future by following the rctime
4 for months now without any issues in two single-host setups. I'm
also in the process of testing and migrating a production cluster
workload from a different setup to CephFS on 13.2.4 and it's looking good.
--
Hector Martin (hec...@marcansoft.com)
Public Key: http
m OSDs without regard for the hosts; you will be able to use
effectively any EC widths you want, but there will be no guarantees of
data durability if you lose a whole host.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph
nt, I have been doing this on two machines (single-host Ceph
clusters) for months with no ill effects. The FUSE client performs a lot
worse than the kernel client, so I switched to the latter, and it's been
working well with no deadlocks.
--
Hector Martin (hec...@marcansoft.com)
Pu
snapshots ( one per day), have one active metadata server,
and change several TB daily - it's much, *much* faster than with fuse.
Cluster has 10 OSD nodes, currently storing 2PB, using ec 8:2 coding.
ta ta
Jake
On 3/6/19 11:10 AM, Hector Martin wrote:
On 06/03/2019 12:07, Zhenshi Zhou
oit.io>
> Tel: +49 89 1896585 90
>
> On Tue, Mar 12, 2019 at 10:07 AM Hector Martin
> mailto:hec...@marcansoft.com>> wrote:
> >
> > It's worth noting that most containerized deployments can effectively
> > limit RAM for containers (cg
/
In particular, you turned on CRUSH_TUNALBLES5, which causes a large
amount of data movement:
http://docs.ceph.com/docs/master/rados/operations/crush-map/#jewel-crush-tunables5
Going from Firefly to Hammer has a much smaller impact (see the CRUSH_V4
section).
--
Hector Martin (hec...@marcansof
ith such a wide EC
encoding, but if you do lose a PG you'll lose more data because there
are fewer PGs.
Feedback on my math welcome.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users
peek and comment.
https://www.memset.com/support/resources/raid-calculator/
I'll take a look tonight :)
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
n, and you need to hit all 3). This is
marginally higher than the ~ 0.00891% with uniformly distributed PGs,
because you've eliminated all sets of OSDs which share a host.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
,
"event": "dispatched"
},
{
"time": "2019-06-12 16:15:59.096318",
"event": "failed to rdlock, waiting"
},
{
"time": "2019-06-12 16:15:59.268368",
"event": "failed to rdlock, waiting"
}
]
}
}
],
"num_ops": 1
}
My guess is somewhere along the line of this process there's a race
condition and the dirty client isn't properly flushing its data.
A 'sync' on host2 does not clear the stuck op. 'echo 1 >
/proc/sys/vm/drop_caches' does not either, while 'echo 2 >
/proc/sys/vm/drop_caches' does fix it. So I guess the problem is a
dentry/inode that is stuck dirty in the cache of host2?
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
NDING_CAPSNAP))
> cap->mark_needsnapflush();
> }
>
>
>
That was quick, thanks! I can build from source but I won't have time to
do so and test it until next week, if that's okay.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
://mirrors.gigenet.com/ceph/
This one is *way* behind on sync, it doesn't even have Nautilus.
Perhaps there should be some monitoring for public mirror quality?
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing
:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
*Pardhiv Karri*
"Rise and Rise again untilLAMBSbecome LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ure you test that they work (not sure if they need to be
base64 decoded or what have you) if you really want to go this route.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
On 13/06/2019 14.31, Hector Martin wrote:
> On 12/06/2019 22.33, Yan, Zheng wrote:
>> I have tracked down the bug. thank you for reporting this. 'echo 2 >
>> /proc/sys/vm/drop_cache' should fix the hang. If you can compile ceph
>> from source, please try follo
roperly and tested and
everything seems fine. I deployed it to production and got rid of the
drop_caches hack and I've seen no stuck ops for two days so far.
If there is a bug or PR opened for this can you point me to it so I can
track when it goes into a release?
Thanks!
--
Hector Ma
nings and
cephfs talking about reconnections and such) and seems to be fine.
I can't find these errors anywhere, so I'm guessing they're not known bugs?
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-use
neg%edx
0xd788 <+536>: mov%edx,0x48(%r15)
That means req->r_reply_info.filelock_reply was NULL.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
side: is there any good documentation about the on-RADOS data
structures used by CephFS? I would like to get more familiar with
everything to have a better chance of fixing problems should I run into
some data corruption in the future)
--
Hector Martin (hec...@marcansoft.com)
Public K
x27;t involve keeping two
months worth of snapshots? That CephFS can't support this kind of use
case (and in general that CephFS uses the stray subdir persistently for
files in snapshots that could remain forever, while the stray dirs don't
scale) sounds like a bug.
--
Hec
oding and
dm-crypt (AES-NI) under the OSDs. Since you'd be running a single OSD
per host, I imagine you should be able to get reasonable aggregate
performance out of the whole thing, but I've never tried a setup like that.
I'm actually considering this kind of thing in the fut
e on the MDS much at that time, so I'm not sure what the
bottleneck is here.
Is this expected for CephFS? I know data deletions are asynchronous, but
not being able to delete metadata/directories without an undue impact on
the whole filesystem performance is somewhat problematic.
--
Hector
On 13/09/2019 16.25, Hector Martin wrote:
> Is this expected for CephFS? I know data deletions are asynchronous, but
> not being able to delete metadata/directories without an undue impact on
> the whole filesystem performance is somewhat problematic.
I think I'm getting a feeli
d some strays are never getting cleaned up. I guess
I'll see once I catch up on snapshot deletions.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
, you need to reduce
bluestore_min_alloc_size.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
hould zap these 10 osds and start over although at
>this point I am afraid even zapping may not be a simple task
>
>
>
>On Tue, Nov 6, 2018 at 3:44 PM, Hector Martin
>wrote:
>
>> On 11/7/18 5:27 AM, Hayashida, Mami wrote:
>> > 1. Stopped osd.60-69:
83 matches
Mail list logo