I have 5 RBD kernel based clients, all using kernel 3.11.1, running Ubuntu 1304, that all failed with a write error at the same time and I need help to figure out what caused the failure.

The 5 clients were all using the same pool, and each had its own image, with an 18TB XFS file system on each client.

The errors reported in syslog on all 5 clients, which all came at about the same time were:

Sep 26 16:51:44 tca10 kernel: [244870.621836] rbd: rbd1: write 8000 at 89591958000 (158000) Sep 26 16:51:44 tca10 kernel: [244870.621842] rbd: rbd1: result -28 xferred 8000

Sep 26 16:51:52 tca14 kernel: [245058.782519] rbd: rbd1: write 8000 at 89593150000 (150000) Sep 26 16:51:52 tca14 kernel: [245058.782524] rbd: rbd1: result -28 xferred 8000

Sep 26 16:51:33 tca15 kernel: [245043.427752] rbd: rbd1: write 8000 at 89593638000 (238000) Sep 26 16:51:33 tca15 kernel: [245043.427758] rbd: rbd1: result -28 xferred 8000

Sep 26 16:51:40 tca16 kernel: [245054.429278] rbd: rbd1: write 8000 at 89593128000 (128000) Sep 26 16:51:40 tca16 kernel: [245054.429284] rbd: rbd1: result -28 xferred 8000

Sep 26 16:51:23 k6 kernel: [90574.093432] rbd: rbd1: write 80000 at f3e93a80000 (280000) Sep 26 16:51:23 k6 kernel: [90574.093441] rbd: rbd1: result -28 xferred 80000

The client systems had been running read/write tests on each of the clients, and had been running on some
of the clients for more then 2 days before it failed.

The ceph version on the cluster is 0.67.3 running on Ubuntu 1304 with 3.11.1 kernels. The cluster config includes 3 monitors, 6 OSD nodes with 15 disk drives each, for a total of 90 OSD. All monitors and OSD are running:

# ceph -v
ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)

---
An ls of the rbd-pool shows
# rbd ls -l -p rbd-pool
NAME        SIZE PARENT FMT PROT LOCK
k6_tst        17578G          1
tca10_tst  17578G          1
tca14_tst  17578G          1
tca15_tst  17578G          1
tca16_tst  17578G          1

---
There is still space in the pool:
# ceph df
GLOBAL:
   SIZE     AVAIL     RAW USED     %RAW USED
   249T     118T      131T         52.60

POOLS:
   NAME         ID     USED       %USED     OBJECTS
   data               0      0                     0         0
   metadata      1      0                     0         0
   rbd                 2      8                     0         1
   rbd-pool        3      67187G     26.30     17713336

---
# ceph health detail
HEALTH_WARN 9 near full osd(s)
osd.9 is near full at 85%
osd.29 is near full at 85%
osd.43 is near full at 91%
osd.45 is near full at 88%
osd.47 is near full at 88%
osd.55 is near full at 94%
osd.59 is near full at 94%
osd.67 is near full at 94%
osd.83 is near full at 94%

---
I did find these messages on one of my monitors, that occurred around the same time as the write failure

2013-09-26 16:50:43.567007 7fc2cc197700 0 mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160 used 2625788 avail 64083132 2013-09-26 16:51:23.519378 7fc2cc197700 1 mon.tca11@0(leader).osd e769 New setting for CEPH_OSDMAP_FULL -- doing propose 2013-09-26 16:51:23.520896 7fc2cb996700 1 mon.tca11@0(leader).osd e770 e770: 90 osds: 90 up, 90 in full 2013-09-26 16:51:23.521808 7fc2cb996700 0 log [INF] : osdmap e770: 90 osds: 90 up, 90 in full 2013-09-26 16:51:43.567118 7fc2cc197700 0 mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160 used 2631904 avail 64077016 2013-09-26 16:52:43.567227 7fc2cc197700 0 mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160 used 2632956 avail 64075964 2013-09-26 16:53:28.534868 7fc2cc197700 1 mon.tca11@0(leader).osd e770 New setting for CEPH_OSDMAP_FULL -- doing propose 2013-09-26 16:53:28.536477 7fc2cb996700 1 mon.tca11@0(leader).osd e771 e771: 90 osds: 90 up, 90 in 2013-09-26 16:53:28.538782 7fc2cb996700 0 log [INF] : osdmap e771: 90 osds: 90 up, 90 in 2013-09-26 16:53:43.567331 7fc2cc197700 0 mon.tca11@0(leader).data_health(10) update_stats avail 91% total 70303160 used 2623788 avail 64085132

---
All my OSD are reporting they are up:
# ceph osd tree
# id    weight  type name       up/down reweight
-1      249.7   root default
-2      51.72           host tca22
0       3.63                    osd.0   up      1       
6       3.63                    osd.6   up      1       
12      3.63                    osd.12  up      1       
18      3.63                    osd.18  up      1       
24      3.63                    osd.24  up      1       
30      3.63                    osd.30  up      1       
36      2.72                    osd.36  up      1       
42      3.63                    osd.42  up      1       
48      3.63                    osd.48  up      1       
54      2.72                    osd.54  up      1       
60      3.63                    osd.60  up      1       
66      3.63                    osd.66  up      1       
72      2.72                    osd.72  up      1       
78      3.63                    osd.78  up      1       
84      3.63                    osd.84  up      1       
-3      31.5            host tca23
1       3.63                    osd.1   up      1       
7       0.26                    osd.7   up      1       
13      2.72                    osd.13  up      1       
19      2.72                    osd.19  up      1       
25      0.26                    osd.25  up      1       
31      3.63                    osd.31  up      1       
37      2.72                    osd.37  up      1       
43      0.26                    osd.43  up      1       
49      3.63                    osd.49  up      1       
55      0.26                    osd.55  up      1       
61      3.63                    osd.61  up      1       
67      0.26                    osd.67  up      1       
73      3.63                    osd.73  up      1       
79      0.26                    osd.79  up      1       
85      3.63                    osd.85  up      1       
-4      51.72           host tca24
2       3.63                    osd.2   up      1       
8       3.63                    osd.8   up      1       
14      3.63                    osd.14  up      1       
20      3.63                    osd.20  up      1       
26      3.63                    osd.26  up      1       
32      3.63                    osd.32  up      1       
38      2.72                    osd.38  up      1       
44      3.63                    osd.44  up      1       
50      3.63                    osd.50  up      1       
56      2.72                    osd.56  up      1       
62      3.63                    osd.62  up      1       
68      3.63                    osd.68  up      1       
74      2.72                    osd.74  up      1       
80      3.63                    osd.80  up      1       
86      3.63                    osd.86  up      1       
-5      31.5            host tca25
3       3.63                    osd.3   up      1       
9       0.26                    osd.9   up      1       
15      2.72                    osd.15  up      1       
21      2.72                    osd.21  up      1       
27      0.26                    osd.27  up      1       
33      3.63                    osd.33  up      1       
39      2.72                    osd.39  up      1       
45      0.26                    osd.45  up      1       
51      3.63                    osd.51  up      1       
57      0.26                    osd.57  up      1       
63      3.63                    osd.63  up      1       
69      0.26                    osd.69  up      1       
75      3.63                    osd.75  up      1       
81      0.26                    osd.81  up      1       
87      3.63                    osd.87  up      1       
-6      51.72           host tca26
4       3.63                    osd.4   up      1       
10      3.63                    osd.10  up      1       
16      3.63                    osd.16  up      1       
22      3.63                    osd.22  up      1       
28      3.63                    osd.28  up      1       
34      3.63                    osd.34  up      1       
40      2.72                    osd.40  up      1       
46      3.63                    osd.46  up      1       
52      3.63                    osd.52  up      1       
58      2.72                    osd.58  up      1       
64      3.63                    osd.64  up      1       
70      3.63                    osd.70  up      1       
76      2.72                    osd.76  up      1       
82      3.63                    osd.82  up      1       
88      3.63                    osd.88  up      1       
-7      31.5            host tca27
5       3.63                    osd.5   up      1       
11      0.26                    osd.11  up      1       
17      2.72                    osd.17  up      1       
23      2.72                    osd.23  up      1       
29      0.26                    osd.29  up      1       
35      3.63                    osd.35  up      1       
41      2.72                    osd.41  up      1       
47      0.26                    osd.47  up      1       
53      3.63                    osd.53  up      1       
59      0.26                    osd.59  up      1       
65      3.63                    osd.65  up      1       
71      0.26                    osd.71  up      1       
77      3.63                    osd.77  up      1       
83      0.26                    osd.83  up      1       
89      3.63                    osd.89  up      1       

Kernel version on all systems
# cat /proc/version
Linux version 3.11.1-031101-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201309141102 SMP Sat Sep 14 15:02:49 UTC 2013

I would really like to know why it failed, before I restart my testing.

Thanks in advance,

Eric



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to