I have read your post then read the thread you suggested, very
interesting.
Then I read again your post and understood better.
The most important thing is that even with min_size=1 writes are
acknowledged after ceph wrote size=2 copies.
In the thread above there is:
As David already said, when all OSDs are up and in for a PG Ceph will wait for
ALL OSDs to Ack the write. Writes in RADOS are always synchronous.
Only when OSDs go down you need at least min_size OSDs up before writes or
reads are accepted.
So if min_size = 2 and size = 3 you need at least 2 OSDs online for I/O to take
place.
You then show me a sequence of events that may happen in some use cases.
I tell you my use case which is quite different. We use ceph under
proxmox. The servers have disks on raid 5 (I agree that it is better
to expose single disks to Ceph but it is late).
So it is unlikely that a ceph disk fails because of raid. If a disks
fail probabably is because the entire server has failed (and we need
to provide business availability in this case) and so it will never
come up again so in my situation your sequence of events will never
happen.
What shocked me is that I did not expect to see so many inconsistencies.
Thanks,
Mario
2017-11-01 16:45 GMT+01:00 David Turner <drakonst...@gmail.com
<mailto:drakonst...@gmail.com>>:
It looks like you're running with a size = 2 and min_size = 1 (the
min_size is a guess, the size is based on how many osds belong to
your problem PGs). Here's some good reading for you.
https://www.spinics.net/lists/ceph-users/msg32895.html
<https://www.spinics.net/lists/ceph-users/msg32895.html>
Basically the jist is that when running with size = 2 you should
assume that data loss is an eventuality and choose that it is ok
for your use case. This can be mitigated by using min_size = 2,
but then your pool will block while an OSD is down and you'll have
to manually go in and change the min_size temporarily to perform
maintenance.
All it takes for data loss is that an osd on server 1 is marked
down and a write happens to an osd on server 2. Now the osd on
server 2 goes down before the osd on server 1 has finished
backfilling and the first osd receives a request to modify data in
the object that it doesn't know the current state of. Tada, you
have data loss.
How likely is this to happen... eventually it will. PG subfolder
splitting (if you're using filestore) will occasionally take long
enough to perform the task that the osd is marked down while it's
still running, and this usually happens for some time all over the
cluster when it does. Another option is something that causes
segfaults in the osds; another is restarting a node before all pgs
are done backfilling/recovering; OOM killer; power outages; etc; etc.
Why does min_size = 2 prevent this? Because for a write to be
acknowledged by the cluster, it has to be written to every OSD
that is up as long as there are at least min_size available. This
means that every write is acknowledged by at least 2 osds every
time. If you're running with size = 2, then both copies of the
data need to be online for a write to happen and thus can never
have a write that the other does not. If you're running with size
= 3, then you always have a majority of the OSDs online receiving
a write and they can both agree on the correct data to give to the
third when it comes back up.
On Wed, Nov 1, 2017 at 3:31 AM Mario Giammarco
<mgiamma...@gmail.com <mailto:mgiamma...@gmail.com>> wrote:
Sure here it is ceph -s:
cluster:
id: 8bc45d9a-ef50-4038-8e1b-1f25ac46c945
health: HEALTH_ERR
100 scrub errors
Possible data damage: 56 pgs inconsistent
services:
mon: 3 daemons, quorum 0,1,pve3
mgr: pve3(active)
osd: 3 osds: 3 up, 3 in
data:
pools: 1 pools, 256 pgs
objects: 269k objects, 1007 GB
usage: 2050 GB used, 1386 GB / 3436 GB avail
pgs: 200 active+clean
56 active+clean+inconsistent
---
ceph health detail :
PG_DAMAGED Possible data damage: 56 pgs inconsistent
pg 2.6 is active+clean+inconsistent, acting [1,0]
pg 2.19 is active+clean+inconsistent, acting [1,2]
pg 2.1e is active+clean+inconsistent, acting [1,2]
pg 2.1f is active+clean+inconsistent, acting [1,2]
pg 2.24 is active+clean+inconsistent, acting [0,2]
pg 2.25 is active+clean+inconsistent, acting [2,0]
pg 2.36 is active+clean+inconsistent, acting [1,0]
pg 2.3d is active+clean+inconsistent, acting [1,2]
pg 2.4b is active+clean+inconsistent, acting [1,0]
pg 2.4c is active+clean+inconsistent, acting [0,2]
pg 2.4d is active+clean+inconsistent, acting [1,2]
pg 2.4f is active+clean+inconsistent, acting [1,2]
pg 2.50 is active+clean+inconsistent, acting [1,2]
pg 2.52 is active+clean+inconsistent, acting [1,2]
pg 2.56 is active+clean+inconsistent, acting [1,0]
pg 2.5b is active+clean+inconsistent, acting [1,2]
pg 2.5c is active+clean+inconsistent, acting [1,2]
pg 2.5d is active+clean+inconsistent, acting [1,0]
pg 2.5f is active+clean+inconsistent, acting [1,2]
pg 2.71 is active+clean+inconsistent, acting [0,2]
pg 2.75 is active+clean+inconsistent, acting [1,2]
pg 2.77 is active+clean+inconsistent, acting [1,2]
pg 2.79 is active+clean+inconsistent, acting [1,2]
pg 2.7e is active+clean+inconsistent, acting [1,2]
pg 2.83 is active+clean+inconsistent, acting [1,0]
pg 2.8a is active+clean+inconsistent, acting [1,0]
pg 2.92 is active+clean+inconsistent, acting [1,2]
pg 2.98 is active+clean+inconsistent, acting [1,0]
pg 2.9a is active+clean+inconsistent, acting [1,0]
pg 2.9e is active+clean+inconsistent, acting [1,0]
pg 2.9f is active+clean+inconsistent, acting [1,2]
pg 2.c6 is active+clean+inconsistent, acting [0,2]
pg 2.c7 is active+clean+inconsistent, acting [1,0]
pg 2.c8 is active+clean+inconsistent, acting [1,2]
pg 2.cb is active+clean+inconsistent, acting [1,2]
pg 2.cd <http://2.cd> is active+clean+inconsistent, acting
[1,2]
pg 2.ce is active+clean+inconsistent, acting [1,2]
pg 2.d2 is active+clean+inconsistent, acting [2,1]
pg 2.da is active+clean+inconsistent, acting [1,0]
pg 2.de <http://2.de> is active+clean+inconsistent, acting
[1,2]
pg 2.e1 is active+clean+inconsistent, acting [1,2]
pg 2.e4 is active+clean+inconsistent, acting [1,0]
pg 2.e6 is active+clean+inconsistent, acting [0,2]
pg 2.e8 is active+clean+inconsistent, acting [1,2]
pg 2.ee <http://2.ee> is active+clean+inconsistent, acting
[1,0]
pg 2.f9 is active+clean+inconsistent, acting [1,2]
pg 2.fa is active+clean+inconsistent, acting [1,0]
pg 2.fb is active+clean+inconsistent, acting [1,2]
pg 2.fc is active+clean+inconsistent, acting [1,2]
pg 2.fe is active+clean+inconsistent, acting [1,0]
pg 2.ff is active+clean+inconsistent, acting [1,0]
and ceph pg 2.6 query:
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"epoch": 1513,
"up": [
1,
0
],
"acting": [
1,
0
],
"actingbackfill": [
"0",
"1"
],
"info": {
"pgid": "2.6",
"last_update": "1513'89145",
"last_complete": "1513'89145",
"log_tail": "1503'87586",
"last_user_version": 330583,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "178"
},
{
"start": "17a",
"length": "3d"
},
{
"start": "1b8",
"length": "1"
},
{
"start": "1ba",
"length": "1"
},
{
"start": "1bc",
"length": "1"
},
{
"start": "1be",
"length": "44"
},
{
"start": "205",
"length": "12c"
},
{
"start": "332",
"length": "1"
},
{
"start": "334",
"length": "1"
},
{
"start": "336",
"length": "1"
},
{
"start": "338",
"length": "1"
},
{
"start": "33a",
"length": "1"
}
],
"history": {
"epoch_created": 90,
"epoch_pool_created": 90,
"last_epoch_started": 1339,
"last_interval_started": 1338,
"last_epoch_clean": 1339,
"last_interval_clean": 1338,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 1338,
"same_interval_since": 1338,
"same_primary_since": 1338,
"last_scrub": "1513'89112",
"last_scrub_stamp": "2017-11-01 05:52:21.259654",
"last_deep_scrub": "1513'89112",
"last_deep_scrub_stamp": "2017-11-01 05:52:21.259654",
"last_clean_scrub_stamp": "2017-10-25 04:25:09.830840"
},
"stats": {
"version": "1513'89145",
"reported_seq": "422820",
"reported_epoch": "1513",
"state": "active+clean+inconsistent",
"last_fresh": "2017-11-01 08:11:38.411784",
"last_change": "2017-11-01 05:52:21.259789",
"last_active": "2017-11-01 08:11:38.411784",
"last_peered": "2017-11-01 08:11:38.411784",
"last_clean": "2017-11-01 08:11:38.411784",
"last_became_active": "2017-10-15 20:36:33.644567",
"last_became_peered": "2017-10-15 20:36:33.644567",
"last_unstale": "2017-11-01 08:11:38.411784",
"last_undegraded": "2017-11-01 08:11:38.411784",
"last_fullsized": "2017-11-01 08:11:38.411784",
"mapping_epoch": 1338,
"log_start": "1503'87586",
"ondisk_log_start": "1503'87586",
"created": 90,
"last_epoch_clean": 1339,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "1513'89112",
"last_scrub_stamp": "2017-11-01 05:52:21.259654",
"last_deep_scrub": "1513'89112",
"last_deep_scrub_stamp": "2017-11-01 05:52:21.259654",
"last_clean_scrub_stamp": "2017-10-25
04:25:09.830840",
"log_size": 1559,
"ondisk_log_size": 1559,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 3747886080 <tel:374%20788%206080>,
"num_objects": 958,
"num_object_clones": 295,
"num_object_copies": 1916,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 958,
"num_whiteouts": 0,
"num_read": 333428,
"num_read_kb": 135550185,
"num_write": 79221,
"num_write_kb": 13441239,
"num_scrub_errors": 1,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 1,
"num_objects_recovered": 245,
"num_bytes_recovered": 1012833792,
"num_keys_recovered": 6,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
1,
0
],
"acting": [
1,
0
],
"blocked_by": [],
"up_primary": 1,
"acting_primary": 1
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 1339,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
"peer_info": [
{
"peer": "0",
"pgid": "2.6",
"last_update": "1513'89145",
"last_complete": "1513'89145",
"log_tail": "1274'68440",
"last_user_version": 315687,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "178"
},
{
"start": "17a",
"length": "3d"
},
{
"start": "1b8",
"length": "1"
},
{
"start": "1ba",
"length": "1"
},
{
"start": "1bc",
"length": "1"
},
{
"start": "1be",
"length": "44"
},
{
"start": "205",
"length": "82"
},
{
"start": "288",
"length": "1"
},
{
"start": "28a",
"length": "1"
},
{
"start": "28c",
"length": "1"
},
{
"start": "28e",
"length": "1"
},
{
"start": "290",
"length": "1"
}
],
"history": {
"epoch_created": 90,
"epoch_pool_created": 90,
"last_epoch_started": 1339,
"last_interval_started": 1338,
"last_epoch_clean": 1339,
"last_interval_clean": 1338,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 1338,
"same_interval_since": 1338,
"same_primary_since": 1338,
"last_scrub": "1513'89112",
"last_scrub_stamp": "2017-11-01 05:52:21.259654",
"last_deep_scrub": "1513'89112",
"last_deep_scrub_stamp": "2017-11-01
05:52:21.259654",
"last_clean_scrub_stamp": "2017-10-25
04:25:09.830840"
},
"stats": {
"version": "1337'71465",
"reported_seq": "347015",
"reported_epoch": "1338",
"state": "active+undersized+degraded",
"last_fresh": "2017-10-15 20:35:36.930611",
"last_change": "2017-10-15 20:30:35.752042",
"last_active": "2017-10-15 20:35:36.930611",
"last_peered": "2017-10-15 20:35:36.930611",
"last_clean": "2017-10-15 20:30:01.443288",
"last_became_active": "2017-10-15
20:30:35.752042",
"last_became_peered": "2017-10-15
20:30:35.752042",
"last_unstale": "2017-10-15 20:35:36.930611",
"last_undegraded": "2017-10-15 20:30:35.749043",
"last_fullsized": "2017-10-15 20:30:35.749043",
"mapping_epoch": 1338,
"log_start": "1274'68440",
"ondisk_log_start": "1274'68440",
"created": 90,
"last_epoch_clean": 1331,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "1294'71370",
"last_scrub_stamp": "2017-10-15 09:27:31.756027",
"last_deep_scrub": "1284'70813",
"last_deep_scrub_stamp": "2017-10-14
06:35:57.556773",
"last_clean_scrub_stamp": "2017-10-15
09:27:31.756027",
"log_size": 3025,
"ondisk_log_size": 3025,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 3555027456
<tel:355%20502%207456>,
"num_objects": 917,
"num_object_clones": 255,
"num_object_copies": 1834,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 917,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 917,
"num_whiteouts": 0,
"num_read": 275095,
"num_read_kb": 111713846,
"num_write": 64324,
"num_write_kb": 11365374,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 243,
"num_bytes_recovered": 1008594432,
"num_keys_recovered": 6,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
1,
0
],
"acting": [
1,
0
],
"blocked_by": [],
"up_primary": 1,
"acting_primary": 1
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 1339,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}
],
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2017-10-15 20:36:33.574915",
"might_have_unfound": [
{
"osd": "0",
"status": "already probed"
}
],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
},
"scrub": {
"scrubber.epoch_start": "1338",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2017-10-15 20:36:32.592892"
}
],
"agent_state": {}
}
2017-10-30 23:30 GMT+01:00 Gregory Farnum <gfar...@redhat.com
<mailto:gfar...@redhat.com>>:
You'll need to tell us exactly what error messages you're
seeing, what the output of ceph -s is, and the output of
pg query for the relevant PGs.
There's not a lot of documentation because much of this
tooling is new, it's changing quickly, and most people
don't have the kinds of problems that turn out to be
unrepairable. We should do better about that, though.
-Greg
On Mon, Oct 30, 2017, 11:40 AM Mario Giammarco
<mgiamma...@gmail.com <mailto:mgiamma...@gmail.com>> wrote:
>[Questions to the list]
>How is it possible that the cluster cannot repair
itself with ceph pg
repair?
>No good copies are remaining?
>Cannot decide which copy is valid or up-to date?
>If so, why not, when there is checksum, mtime for
everything?
>In this inconsistent state which object does the
cluster serve when it
doesn't know which one is the valid?
I am asking the same questions too, it seems strange
to me that in a
fault tolerant clustered file storage like Ceph there
is no
documentation about this.
I know that I am pedantic but please note that saying
"to be sure use
three copies" is not enough because I am not sure what
Ceph really does
when three copies are not matching.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com