Re: [ceph-users] RGW pool contents

2015-11-26 Thread Somnath Roy
Thanks Wido !
Could you please explain a bit more on the relationship between user created 
buckets and the objects within .bucket.index pool ?
I am not seeing for each bucket one entry is created within .bucket.index pool.

Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
den Hollander
Sent: Wednesday, November 25, 2015 10:56 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RGW pool contents

On 11/24/2015 08:48 PM, Somnath Roy wrote:
> Hi Yehuda/RGW experts,
> 
> I have one cluster with RGW up and running in the customer site.
> 
> I did some heavy performance testing on that with CosBench and as a 
> result written significant amount of data to showcase performance on that.
> 
> Over time, customer also wrote significant amount of data using S3 api 
> into the cluster.
> 
> Now, I want to remove the buckets/objects created by CosBench and need 
> some help on that.
> 
> I ran the following command to list the buckets.
> 
>  
> 
> "radosgw-admin bucket list"
> 
>  
> 
> The output is the following snippet..
> 
>  
> 
> "rgwdef42",
> 
> "rgwdefghijklmnop79",
> 
> "rgwyzabc43",
> 
> "rgwdefgh43",
> 
> "rgwdefghijklm200",
> 
>  
> 
> ..
> 
> ..
> 
>  
> 
> My understanding is , cosbench should create containers with 
> "*mycontainers_*"  and objects with format "*myobjects*_" prefix 
> (?). But, it's not there in the output of the above command.
> 
>  

Well, if it did, they should show up there.

> 
> Next, I tried to list the contents of the different rgw pools..
> 
>  
> 
> *rados -p .rgw.buckets.index ls*
> 
>  
> 
> .dir.default.5407.17
> 
> .dir.default.6063.24
> 
> .dir.default.6068.23
> 
> .dir.default.6046.7
> 
> .dir.default.6065.44
> 
> .dir.default.5409.3
> 
> ...
> 
> ...
> 
>  
> 
> Nothing with rgw prefix...Shouldn't the bucketindex objects having 
> similar prefix with bucket names ?
> 

No, there are the internal IDs of the buckets. You can find the actual bucket 
objects in the ".rgw" pool.

>  
> 
>  
> 
> Now, tried to get the actual objects...
> 
> *rados -p .rgw.buckets ls*
> 
>  
> 
> default.6662.5_myobjects57862
> 
> default.5193.18_myobjects6615
> 
> default.5410.5_myobjects68518
> 
> default.6661.8_myobjects7407
> 
> default.5410.22_myobjects54939
> 
> default.6651.6_myobjects23790
> 
>  
> 
> 
> 
> ...
> 
>  
> 
> So, looking at these, it seems cosbench run is creating the
> .dir.default.* buckets and the default._myobjects* objects 
> (?)
> 

No, again, the .dir.default.X is the internal ID of the bucket. It creates 
"myobject" object on those buckets.

>  
> 
> But, these buckets are not listed by the first "radosgw-admin" 
> command, *why ?*
> 
>  
> 
> Next, I listed the contents of the .rgw pool and here is the output..
> 
>  
> 
> *rados -p .rgw ls*
> 
>  
> 
> .bucket.meta.rgwdefghijklm78:default.6069.18
> 
> rgwdef42
> 
> rgwdefghijklmnop79
> 
> rgwyzabc43
> 
> .bucket.meta.rgwdefghijklmnopqr71:default.6655.3
> 
> rgwdefgh43
> 
> .bucket.meta.rgwdefghijklm119:default.6066.25
> 
> rgwdefghijklm200
> 
> .bucket.meta.rgwxghi2:default.5203.4
> 
> rgwxjk17
> 
> rgwdefghijklm196
> 
>  
> 
> ...
> 
> ...
> 
>  
> 
> It seems this pool has the buckets listed by the radosgw-admin command.
> 
>  
> 
> Can anybody explain what is *.rgw pool* supposed to contain ?
> 
>  

This pool contains only the bucket metadata objects, here it references to the 
internal IDs.

You can fetch this with 'radosgw-admin metadata get bucket:XX'

> 
> Also, what is the difference between .*users.uid and .users pool* ?
> 
>  

In the .user.uid pool the RGW can do a quick query for users IDs since that is 
required for matching ACLs which might be on a bucket and/or object.

Wido

> 
>  
> 
> Appreciate any help on this.
> 
>  
> 
> Thanks & Regards
> 
> Somnath
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on XFS ENOSPC at 84% data / 5% inode and inode64?

2015-11-26 Thread Laurent GUERBY
On Thu, 2015-11-26 at 07:52 +, Межов Игорь Александрович wrote:
> Hi!
> 
> >After our trouble with ext4/xattr soft lockup kernel bug we started
> >moving some of our OSD to XFS, we're using ubuntu 14.04 3.19 kernel
> >and ceph 0.94.5.
> 
> It was a rather serious bug, but there is small a patch at kernel.org
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=107301
> https://bugzilla.kernel.org/attachment.cgi?id=192351
> 
> We also have this problem on both 3.16 (Jessie) and 4.2 (Sid) Debian kernels,
> but, after applying this patch, the problem is gone. There was no 
> hangs/lockups 
> during last week. Patch is for 3.18 kernel (I think), but it is easy to 
> manually port
> it to 4.2 - ext4 code do not change too much.
> 
> 
> Megov Igor
> CIO, Yuterra
> 

Hi,

I wrote the patch in the bugzilla for 3.19 (ubuntu trusty sources), the
patch I submitted against the current Linus kernel tree is here (only
mechanical changes):

http://marc.info/?l=linux-ext4&m=144801312605200&w=2

The kernel developpers have not proposed any solution but don't
want to accept my patch, I will continue to work on this on my spare
time (not a lot currently), but ceph users of ext4 for any kernel
must install it otherwise they'll be in lots of trouble.

Also I forgot to ask about XFS and inode size:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch04s05.html

Should we use "mkfs.xfs -i size=512" now for ceph volumes?

Thanks in advance,

Laurent

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade to hammer, crush tuneables issue

2015-11-26 Thread Tomasz Kuzemko
This has nothing to do with the number of seconds between backfills. It is
actually the number of objects from a PG being scanned during a single op
when PG is backfilled. From what I can tell by looking at the source code,
impact on performance comes from the fact that during this scanning the PG
is locked for other operations.

>From my benchmarks it's clearly evident that this has big impact on client
latency during backfill. The lower the values for osd_backfill_scan_min and
osd_backfill_scan_max, the less impact on latency but *longer* recovery
time. Changing these values online will probably take affect only for PGs
on which backfill has not yet started, which can explain why you did not
see immediate effect of changing these on the fly.

--
Tomasz Kuzemko
tom...@kuzemko.net

2015-11-26 0:24 GMT+01:00 Robert LeBlanc :
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I don't think this does what you think it does.
>
> This will almost certainly starve the client of IO. This is the number
> of seconds between backfills, not the number of objects being scanned
> during a backfill. Setting these to higher values will make recovery
> take longer, but hardly affect the client. Setting these to low values
> will increase the rate of recovery so it takes less time, but will
> impact the performance of the clients.
>
> Also, I haven't had much luck changing these on the fly for
> recovery/backfill already in progress or queued.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Nov 25, 2015 at 2:42 PM, Tomasz Kuzemko  wrote:
> > To ease on clients you can change osd_backfill_scan_min and
> > osd_backfill_scan_max to 1. It's possible to change this online:
> > ceph tell osd.\* injectargs '--osd_backfill_scan_min 1'
> > ceph tell osd.\* injectargs '--osd_backfill_scan_max 1'
> >
> > 2015-11-24 16:52 GMT+01:00 Joe Ryner :
> >>
> >> Hi,
> >>
> >> Last night I upgraded my cluster from Centos 6.5 -> Centos 7.1 and in
the
> >> process upgraded from Emperor -> Firefly -> Hammer
> >>
> >> When I finished I changed the crush tunables from
> >> ceph osd crush tunables legacy -> ceph osd crush tunables optimal
> >>
> >> I knew this would cause data movement.  But the IO for my clients is
> >> unacceptable.  Can any please tell what the best settings are for my
> >> configuration.  I have 2 Dell R720 Servers and 2 Dell R730 servers.  I
have
> >> 36 1TB SATA SSD Drives in my cluster.  The servers have 128 GB of RAM.
> >>
> >> Below is some detail the might help.  According to my calculations the
> >> rebalance will take over a day.
> >>
> >> I would greatly appreciate some help on this.
> >>
> >> Thank you,
> >>
> >> Joe
> >>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Undersized pgs problem

2015-11-26 Thread ЦИТ РТ-Курамшин Камиль Фидаилевич
It seams that you played around with crushmap, and done something wrong.
Compare the look of 'ceph osd tree' and crushmap. There are some 'osd' devices 
renamed to 'device' think threre is you problem.

Отправлено с мобильного устройства.

-Original Message-
From: Vasiliy Angapov 
To: ceph-users 
Sent: чт, 26 нояб. 2015 7:53
Subject: [ceph-users] Undersized pgs problem

Hi, colleagues!

I have small 4-node CEPH cluster (0.94.2), all pools have size 3, min_size 1.
This night one host failed and cluster was unable to rebalance saying
there are a lot of undersized pgs.

root@slpeah002:[~]:# ceph -s
cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728
 health HEALTH_WARN
1486 pgs degraded
1486 pgs stuck degraded
2257 pgs stuck unclean
1486 pgs stuck undersized
1486 pgs undersized
recovery 80429/555185 objects degraded (14.487%)
recovery 40079/555185 objects misplaced (7.219%)
4/20 in osds are down
1 mons down, quorum 1,2 slpeah002,slpeah007
 monmap e7: 3 mons at
{slpeah001=192.168.254.11:6780/0,slpeah002=192.168.254.12:6780/0,slpeah007=172.31.252.46:6789/0}
election epoch 710, quorum 1,2 slpeah002,slpeah007
 osdmap e14062: 20 osds: 16 up, 20 in; 771 remapped pgs
  pgmap v7021316: 4160 pgs, 5 pools, 1045 GB data, 180 kobjects
3366 GB used, 93471 GB / 96838 GB avail
80429/555185 objects degraded (14.487%)
40079/555185 objects misplaced (7.219%)
1903 active+clean
1486 active+undersized+degraded
 771 active+remapped
  client io 0 B/s rd, 246 kB/s wr, 67 op/s

  root@slpeah002:[~]:# ceph osd tree
ID  WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 94.63998 root default
 -9 32.75999 host slpeah007
 72  5.45999 osd.72  up  1.0  1.0
 73  5.45999 osd.73  up  1.0  1.0
 74  5.45999 osd.74  up  1.0  1.0
 75  5.45999 osd.75  up  1.0  1.0
 76  5.45999 osd.76  up  1.0  1.0
 77  5.45999 osd.77  up  1.0  1.0
-10 32.75999 host slpeah008
 78  5.45999 osd.78  up  1.0  1.0
 79  5.45999 osd.79  up  1.0  1.0
 80  5.45999 osd.80  up  1.0  1.0
 81  5.45999 osd.81  up  1.0  1.0
 82  5.45999 osd.82  up  1.0  1.0
 83  5.45999 osd.83  up  1.0  1.0
 -3 14.56000 host slpeah001
  1  3.64000  osd.1 down  1.0  1.0
 33  3.64000 osd.33down  1.0  1.0
 34  3.64000 osd.34down  1.0  1.0
 35  3.64000 osd.35down  1.0  1.0
 -2 14.56000 host slpeah002
  0  3.64000 osd.0   up  1.0  1.0
 36  3.64000 osd.36  up  1.0  1.0
 37  3.64000 osd.37  up  1.0  1.0
 38  3.64000 osd.38  up  1.0  1.0

Crushmap:

 # begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 1 osd.1
device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11
device 12 device12
device 13 device13
device 14 device14
device 15 device15
device 16 device16
device 17 device17
device 18 device18
device 19 device19
device 20 device20
device 21 device21
device 22 device22
device 23 device23
device 24 device24
device 25 device25
device 26 device26
device 27 device27
device 28 device28
device 29 device29
device 30 device30
device 31 device31
device 32 device32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 device39
device 40 device40
device 41 device41
device 42 device42
device 43 device43
device 44 device44
device 45 device45
device 46 device46
device 47 device47
device 48 device48
device 49 device49
device 50 device50
device 51 device51
device 52 device52
device 53 device53
device 54 device54
device 55 device55
device 56 device56
device 57 device57
device 58 device58
device 59 device59
device 60 device60
device 61 device61
device 62 device62
device 63 device63
device 64 device64
device 65 device65
device 66 device66
device 67 device67
device 68 device68
device 69 device69
device 70 device70
device 71 device71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75
device 76 osd.76
device 77 osd.77
device 78 osd.78
device 79

[ceph-users] Infernalis: best practices to start/stop

2015-11-26 Thread Marc Boisis

Hi,

I want to know what are the best practices to start or stop all OSDs of a node 
with infernalis.
Before with init, we used « /etc/init.d/ceph start »  now with systemd I have a 
script per osd : "systemctl start ceph-osd@171.service"
Where is the global one ?
Thanks in advance!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis: best practices to start/stop

2015-11-26 Thread Daniel Swarbrick
SUSE has pretty good documentation about interacting with Ceph using
systemctl -
https://www.suse.com/documentation/ses-1/book_storage_admin/data/ceph_operating_services.html

The following should work:

systemctl start ceph-osd*

On 26/11/15 12:46, Marc Boisis wrote:
> 
> Hi,
> 
> I want to know what are the best practices to start or stop all OSDs of a 
> node with infernalis.
> Before with init, we used « /etc/init.d/ceph start »  now with systemd I have 
> a script per osd : "systemctl start ceph-osd@171.service"
> Where is the global one ?
> Thanks in advance!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Undersized pgs problem

2015-11-26 Thread Irek Fasikhov
Hi.
Vasiliy, Yes it is a problem with crusmap. Look at height:
" -3 14.56000 host slpeah001
 -2 14.56000 host slpeah002
 "

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757

2015-11-26 13:16 GMT+03:00 ЦИТ РТ-Курамшин Камиль Фидаилевич <
kamil.kurams...@tatar.ru>:

> It seams that you played around with crushmap, and done something wrong.
> Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
> devices renamed to 'device' think threre is you problem.
>
> Отправлено с мобильного устройства.
>
>
> -Original Message-
> From: Vasiliy Angapov 
> To: ceph-users 
> Sent: чт, 26 нояб. 2015 7:53
> Subject: [ceph-users] Undersized pgs problem
>
> Hi, colleagues!
>
> I have small 4-node CEPH cluster (0.94.2), all pools have size 3, min_size
> 1.
> This night one host failed and cluster was unable to rebalance saying
> there are a lot of undersized pgs.
>
> root@slpeah002:[~]:# ceph -s
> cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728
>  health HEALTH_WARN
> 1486 pgs degraded
> 1486 pgs stuck degraded
> 2257 pgs stuck unclean
> 1486 pgs stuck undersized
> 1486 pgs undersized
> recovery 80429/555185 <80429555185> objects degraded
> (14.487%)
> recovery 40079/555185 objects misplaced (7.219%)
> 4/20 in osds are down
> 1 mons down, quorum 1,2 slpeah002,slpeah007
>  monmap e7: 3 mons at
> {slpeah001=
> 192.168.254.11:6780/0,slpeah002=192.168.254.12:6780/0,slpeah007=172.31.252.46:6789/0}
>
> election epoch 710, quorum 1,2 slpeah002,slpeah007
>  osdmap e14062: 20 osds: 16 up, 20 in; 771 remapped pgs
>   pgmap v7021316: 4160 pgs, 5 pools, 1045 GB data, 180 kobjects
> 3366 GB used, 93471 GB / 96838 GB avail
> 80429/555185 <80429555185> objects degraded (14.487%)
> 40079/555185 objects misplaced (7.219%)
> 1903 active+clean
> 1486 active+undersized+degraded
>  771 active+remapped
>   client io 0 B/s rd, 246 kB/s wr, 67 op/s
>
>   root@slpeah002:[~]:# ceph osd tree
> ID  WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 94.63998 root default
>  -9 32.75999 host slpeah007
>  72  5.45999 osd.72  up  1.0  1.0
>  73  5.45999 osd.73  up  1.0  1.0
>  74  5.45999 osd.74  up  1.0  1.0
>  75  5.45999 osd.75  up  1.0  1.0
>  76  5.45999 osd.76  up  1.0  1.0
>  77  5.45999 osd.77  up  1.0  1.0
> -10 32.75999 host slpeah008
>  78  5.45999 osd.78  up  1.0  1.0
>  79  5.45999 osd.79  up  1.0  1.0
>  80  5.45999 osd.80  up  1.0  1.0
>  81  5.45999 osd.81  up  1.0  1.0
>  82  5.45999 osd.82  up  1.0  1.0
>  83  5.45999 osd.83  up  1.0  1.0
>  -3 14.56000 host slpeah001
>   1  3.64000  osd.1 down  1.0  1.0
>  33  3.64000 osd.33down  1.0  1.0
>  34  3.64000 osd.34down  1.0  1.0
>  35  3.64000 osd.35down  1.0  1.0
>  -2 14.56000 host slpeah002
>   0  3.64000 osd.0   up  1.0  1.0
>  36  3.64000 osd.36  up  1.0  1.0
>  37  3.64000 osd.37  up  1.0  1.0
>  38  3.64000 osd.38  up  1.0  1.0
>
> Crushmap:
>
>  # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 device2
> device 3 device3
> device 4 device4
> device 5 device5
> device 6 device6
> device 7 device7
> device 8 device8
> device 9 device9
> device 10 device10
> device 11 device11
> device 12 device12
> device 13 device13
> device 14 device14
> device 15 device15
> device 16 device16
> device 17 device17
> device 18 device18
> device 19 device19
> device 20 device20
> device 21 device21
> device 22 device22
> device 23 device23
> device 24 device24
> device 25 device25
> device 26 device26
> device 27 device27
> device 28 device28
> device 29 device29
> device 30 device30
> device 31 device31
> device 32 device32
> device 33 osd.33
> device 34 osd.34
> device 35 osd.35
> device 36 osd.36
> device 37 osd.37
> device 38 osd.38
> device 39 device39
> device 40 device40
> device 41 device41
> device 42 device42
> device 43 device43
> device 44 device44
> device 45 device45
> device 46 device46
> device 47 device47
> device 48 

Re: [ceph-users] Infernalis: best practices to start/stop

2015-11-26 Thread Marc Boisis
the documentation is good but it doesn’t work on my centos7:
root@cephrr1n8:/root > systemctl status "ceph*"
ceph\x2a.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

maybe a bug with centos’s systemd release, is there anybody with centos 7 + 
infernalis ?


> Le 26 nov. 2015 à 13:15, Daniel Swarbrick  
> a écrit :
> 
> SUSE has pretty good documentation about interacting with Ceph using
> systemctl -
> https://www.suse.com/documentation/ses-1/book_storage_admin/data/ceph_operating_services.html
> 
> The following should work:
> 
> systemctl start ceph-osd*
> 
> On 26/11/15 12:46, Marc Boisis wrote:
>> 
>> Hi,
>> 
>> I want to know what are the best practices to start or stop all OSDs of a 
>> node with infernalis.
>> Before with init, we used « /etc/init.d/ceph start »  now with systemd I 
>> have a script per osd : "systemctl start ceph-osd@171.service"
>> Where is the global one ?
>> Thanks in advance!
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Change both client/cluster network subnets

2015-11-26 Thread Nasos Pan



Hi guys. For some months i had a simple working ceph cluster with 3 nodes and 3 
monitors inside. Client, monitors and cluster network was at redundant 10Gbps 
ports in the same subnet 10.10.10.0/24.

Here is the conf
#
[global]
 auth client required = cephx
 auth cluster required = cephx
 auth service required = cephx
 auth supported = cephx
 cluster network = 10.0.0.0/24
 filestore xattr use omap = true
 fsid = 169d99bb-9b62-459e-8e5e-2d101a8c17b2
 keyring = /etc/pve/priv/$cluster.$name.keyring
 osd journal size = 5120
 osd pool default min size = 1
 public network = 10.0.0.0/24
 osd mount options xfs = rw,noatime,inode64

[osd]
 keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.1]
 host = test2
 mon addr = 10.0.0.2:6789

[mon.0]
 host = test1
 mon addr = 10.0.0.1:6789

[mon.2]
 host = test3
 mon addr = 10.0.0.3:6789


###

I want for several reasons to change all the configuration at another subnet, 
172.16.0.0/24
I can stop all io traffic to the cluster, scrubbing, snapshots backups and 
everything that would cause changes.(anything else?)
Just to be sure i have scrub deep-scrub off and osd noout.

What is the best way to change this setting? I don't mind rebooting if 
necessary.
Apparently just change 10.10.10 to 172.16.0 and restart services one by one 
(either osds first either mons) didn't worked.

Any help? As long as i have osd noout can i just stop all osds, stop mons and 
then start one by one again?

Thanks!
Nasos Pan



  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing question

2015-11-26 Thread Major Csaba

Hi,

On 11/25/2015 06:41 PM, Robert LeBlanc wrote:

Since the one that is different is not your primary for the pg, then
pg repair is safe.

Ok, that's clear thanks.
I think we managed to identify the root cause of the scrubbing errors 
even if the files are identical.
It seems to be a hardware issue (faulty RAM module), which is really 
hard to detect, even if you have an ECC capable module.


The glitch happens here:
*node2:~# while true; do sha1sum 
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1; 
sleep 0.1; done**
**acd62deb72530e22b7ebdce3e2e47e0480af533b 
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**

**...
**acd62deb72530e22b7ebdce3e2e47e0480af533b 
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b 
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b 
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b 
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4 
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**

***

So, sometimes it calculates different values. We managed to copy this 
file several times to find the difference:

*# diff 48.bin 49.bin **
**40095c40095**
**< 
hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC**

**---**
**> 
hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC*

So, it has a single bit difference (0x50 vs 0x54)

I think this presentation could be very useful about the silent 
corruption of data:

https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf

We will test all of our RAM modules now (it should have happened before, 
of course...), but it seems you have to be very careful with the cheap 
commodity hardware.


Regards,
Csaba

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing question

2015-11-26 Thread Tomasz Kuzemko
Hi,
I have also seen inconsistent PGs despite md5 being the same on all
objects, however all my hardware uses ECC RAM, which as I understand
should prevent this type of error. To be clear - in your case you were
using ECC or non-ECC module?

--
Tomasz Kuzemko
tomasz.kuze...@ovh.net

W dniu 26.11.2015 o 15:23, Major Csaba pisze:
> Hi,
> 
> On 11/25/2015 06:41 PM, Robert LeBlanc wrote:
>> Since the one that is different is not your primary for the pg, then
>> pg repair is safe.
> Ok, that's clear thanks.
> I think we managed to identify the root cause of the scrubbing errors
> even if the files are identical.
> It seems to be a hardware issue (faulty RAM module), which is really
> hard to detect, even if you have an ECC capable module.
> 
> The glitch happens here:
> *node2:~# while true; do sha1sum
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1;
> sleep 0.1; done**
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
> **...
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
> **4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4 
> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
> ***
> 
> So, sometimes it calculates different values. We managed to copy this
> file several times to find the difference:
> *# diff 48.bin 49.bin **
> **40095c40095**
> **<
> hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC**
> **---**
> **>
> hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC*
> So, it has a single bit difference (0x50 vs 0x54)
> 
> I think this presentation could be very useful about the silent
> corruption of data:
> https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
> 
> We will test all of our RAM modules now (it should have happened before,
> of course...), but it seems you have to be very careful with the cheap
> commodity hardware.
> 
> Regards,
> Csaba
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing question

2015-11-26 Thread Major Csaba

Hi,

We don't use ECC modules but the ECC doesn't mean you're safe.

See the presentation I linked earlier: 
https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf>https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf 



You can see it also here, for example: 
https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/
"/However, it's important to note that ECC RAM can only correct 1 bit 
flip per byte (8 bits). If you have 2 bit flips per byte, ECC RAM will 
not be able to recover the data./"


Regards,
Csaba

On 11/26/2015 03:31 PM, Tomasz Kuzemko wrote:

Hi,
I have also seen inconsistent PGs despite md5 being the same on all
objects, however all my hardware uses ECC RAM, which as I understand
should prevent this type of error. To be clear - in your case you were
using ECC or non-ECC module?

--
Tomasz Kuzemko
tomasz.kuze...@ovh.net

W dniu 26.11.2015 o 15:23, Major Csaba pisze:

Hi,

On 11/25/2015 06:41 PM, Robert LeBlanc wrote:

Since the one that is different is not your primary for the pg, then
pg repair is safe.

Ok, that's clear thanks.
I think we managed to identify the root cause of the scrubbing errors
even if the files are identical.
It seems to be a hardware issue (faulty RAM module), which is really
hard to detect, even if you have an ECC capable module.

The glitch happens here:
*node2:~# while true; do sha1sum
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1;
sleep 0.1; done**
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**...
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
**4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
***

So, sometimes it calculates different values. We managed to copy this
file several times to find the difference:
*# diff 48.bin 49.bin **
**40095c40095**
**<
hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC**
**---**
**>
hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC*
So, it has a single bit difference (0x50 vs 0x54)

I think this presentation could be very useful about the silent
corruption of data:
https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf

We will test all of our RAM modules now (it should have happened before,
of course...), but it seems you have to be very careful with the cheap
commodity hardware.

Regards,
Csaba



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd_inst.create

2015-11-26 Thread NEVEU Stephane
Hi all,

I'm using python scripts to create rbd images like described here 
http://docs.ceph.com/docs/giant/rbd/librbdpy/
rbd_inst.create(ioctx, 'myimage', size, old_format=False, features=1) seems to 
create a layering image
rbd_inst.create(ioctx, 'myimage', size, old_format=False, features=2) seems to 
create a stripped image

Setting up "rbd default format =2" in ceph.conf and just using the following 
(without feature=x)
rbd_inst.create(ioctx, 'myimage', size) seems to create a layered + stripped 
image

If someone could point me to the documentation about those bitmasks (features), 
that would be great :) I cannot find it.

Moreover, when my images are created this way (using rbd_inst.create with 
python), no way to snapshot an image !
#rbd snap create rbd/myimage@snap1
 -1 librbd: failed to create snap id: (22) Invalid argument

Same thing with img.create_snap(snap) in python, snapshots are not created.




[@@ THALES GROUP INTERNAL @@]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing question

2015-11-26 Thread Tomasz Kuzemko
ECC will not be able to recover the data, but it will always be able to
detect that data is corrupted. AFAIK under Linux this results in
immediate halt of system, so it would not be able to report bad checksum
data during deep-scrub.

--
Tomasz Kuzemko
tomasz.kuze...@corp.ovh.com

W dniu 26.11.2015 o 15:46, Major Csaba pisze:
> Hi,
> 
> We don't use ECC modules but the ECC doesn't mean you're safe.
> 
> See the presentation I linked earlier:
> https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf>https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
> 
> You can see it also here, for example:
> https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/
> "/However, it's important to note that ECC RAM can only correct 1 bit
> flip per byte (8 bits). If you have 2 bit flips per byte, ECC RAM will
> not be able to recover the data./"
> 
> Regards,
> Csaba
> 
> On 11/26/2015 03:31 PM, Tomasz Kuzemko wrote:
>> Hi,
>> I have also seen inconsistent PGs despite md5 being the same on all
>> objects, however all my hardware uses ECC RAM, which as I understand
>> should prevent this type of error. To be clear - in your case you were
>> using ECC or non-ECC module?
>>
>> --
>> Tomasz Kuzemko
>> tomasz.kuze...@ovh.net
>>
>> W dniu 26.11.2015 o 15:23, Major Csaba pisze:
>>> Hi,
>>>
>>> On 11/25/2015 06:41 PM, Robert LeBlanc wrote:
 Since the one that is different is not your primary for the pg, then
 pg repair is safe.
>>> Ok, that's clear thanks.
>>> I think we managed to identify the root cause of the scrubbing errors
>>> even if the files are identical.
>>> It seems to be a hardware issue (faulty RAM module), which is really
>>> hard to detect, even if you have an ECC capable module.
>>>
>>> The glitch happens here:
>>> *node2:~# while true; do sha1sum
>>> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1;
>>> sleep 0.1; done**
>>> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
>>> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
>>> **...
>>> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
>>> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
>>> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
>>> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
>>> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
>>> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
>>> **acd62deb72530e22b7ebdce3e2e47e0480af533b 
>>> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
>>> **4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4 
>>> /var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.00016ce4__head_DE208D4E__1**
>>> ***
>>>
>>> So, sometimes it calculates different values. We managed to copy this
>>> file several times to find the difference:
>>> *# diff 48.bin 49.bin **
>>> **40095c40095**
>>> **<
>>> hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC**
>>> **---**
>>> **>
>>> hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC*
>>> So, it has a single bit difference (0x50 vs 0x54)
>>>
>>> I think this presentation could be very useful about the silent
>>> corruption of data:
>>> https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
>>>
>>> We will test all of our RAM modules now (it should have happened before,
>>> of course...), but it seems you have to be very careful with the cheap
>>> commodity hardware.
>>>
>>> Regards,
>>> Csaba
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing question

2015-11-26 Thread Lionel Bouton
Le 26/11/2015 15:53, Tomasz Kuzemko a écrit :
> ECC will not be able to recover the data, but it will always be able to
> detect that data is corrupted.

No. That's a theoretical impossibility as the detection is done by some
kind of hash over the memory content which brings the possibility of
hash collisions. For cryptographic hashes collisions are by definition
nearly impossible to trigger but obviously memory controllers can't use
cryptographic hashes to protect the memory content : the verification
would be prohibitive (both in hardware costs and in latencies). Most ECC
implementations use hamming codes which correct all single-bit errors
and detect all 2-bit errors but can have false negatives for 3+ bit
errors. There's even speculation that modern hardware makes this more
likely because individual chips now use buses that aren't 1-bit anymore
and defective chips don't store only 1-bit in a byte returned by a read
anymore but several.

>  AFAIK under Linux this results in
> immediate halt of system, so it would not be able to report bad checksum
> data during deep-scrub.

It can, it's just less likely.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade to hammer, crush tuneables issue

2015-11-26 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Based on [1] and my experience with Hammer it is seconds. After
adjusting this back to the defaults and doing recovery in our
production cluster I saw batches of recovery start every 64 seconds.
It initially started out nice and distributed, but over time it
clumped up into the same 15 seconds. We would have 15 seconds of high
speed recovery than nothing for almost 50 seconds.

It is quite possible that Infernalis has changed the way this works
(the master documentation shows number of objects instead of seconds).
I haven't looked at the code to know for sure. I do know that setting
these values osd_backfill_scan_min=2,osd_backfill_san_max=16 on our
busy cluster where osd_max_backfills=1 would cause periodic slow I/O
during backfill/recovery. Setting osd_backfill_scan_min=16,
osd_backfill_scan_max=32 and osd_max_backfills=10 on the same cluster
eliminated the slow I/O.

[1] 
http://docs.ceph.com/docs/v0.80.5/rados/configuration/osd-config-ref/#backfilling
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWVyyBCRDmVDuy+mK58QAAevcP/3dACiBgu/VghcMLn7mB
6d8VwqGLcq0yekkzSm+iB4INMmlC4sBoQ0OF71Yn5YRRiztIY1nr4MHRsu5+
z+mrwxTwYw4yHcHO9xKBOwESdNhJuEJikrK3LqmHcZ1mK0gaHFuK+HhTR3N3
j099qtsSWH4E2YQ1RPX6ch0lNOdGeSTcfXycKnM9eD/1TAlksipDqqa4hVNV
tYQlA2dqTPnPx3D7rdkwdQDhV+EGlcAOQjL5Vf+R1mtPCrBoFEo3oBSXItDX
qrvUT6A5xsXOtmoTfER5TYIA2jNedOismM7ectY/+qYhwjKnYdTFQ4tI9GNK
FmcuKcgG8jGdGJDoLwYa58iBs7TdEDzhDtST/OUMEBlu7NeGQmLm9LM5PAsG
MF+vHaNAYFVN2EmPnI+zGzsSv/C+LNlJRKoUwbgGZ0BT0mU+pWcf2DkbUzNV
ThgqsJHVJ0EVMZ9pMTifAroeh3VwHVi0hbCIms5qvmgGikWh8Yr4YiCprn9D
Wa0f93rl/pzUPOyWTw2951sUlGNaJUX1MpxU8YNd340QTf2eBa9XratHWcsz
JmdRUyTghpk19MU19nVuJ4vXA3iELx4EcokCVXnexCLsyg7Nbpr1tCHDbNaK
QjWQGbAbz15QrcCmeAY+79fq3Ndl5PhUfIt5tlW2+jKJgK27YaAhoTyVx7Rm
IXC8
=JwlH
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Nov 26, 2015 at 1:52 AM, Tomasz Kuzemko  wrote:
> This has nothing to do with the number of seconds between backfills. It is
> actually the number of objects from a PG being scanned during a single op
> when PG is backfilled. From what I can tell by looking at the source code,
> impact on performance comes from the fact that during this scanning the PG
> is locked for other operations.
>
> From my benchmarks it's clearly evident that this has big impact on client
> latency during backfill. The lower the values for osd_backfill_scan_min and
> osd_backfill_scan_max, the less impact on latency but *longer* recovery
> time. Changing these values online will probably take affect only for PGs on
> which backfill has not yet started, which can explain why you did not see
> immediate effect of changing these on the fly.
>
> --
> Tomasz Kuzemko
> tom...@kuzemko.net
>
>
> 2015-11-26 0:24 GMT+01:00 Robert LeBlanc :
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I don't think this does what you think it does.
>>
>> This will almost certainly starve the client of IO. This is the number
>> of seconds between backfills, not the number of objects being scanned
>> during a backfill. Setting these to higher values will make recovery
>> take longer, but hardly affect the client. Setting these to low values
>> will increase the rate of recovery so it takes less time, but will
>> impact the performance of the clients.
>>
>> Also, I haven't had much luck changing these on the fly for
>> recovery/backfill already in progress or queued.
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Nov 25, 2015 at 2:42 PM, Tomasz Kuzemko  wrote:
>> > To ease on clients you can change osd_backfill_scan_min and
>> > osd_backfill_scan_max to 1. It's possible to change this online:
>> > ceph tell osd.\* injectargs '--osd_backfill_scan_min 1'
>> > ceph tell osd.\* injectargs '--osd_backfill_scan_max 1'
>> >
>> > 2015-11-24 16:52 GMT+01:00 Joe Ryner :
>> >>
>> >> Hi,
>> >>
>> >> Last night I upgraded my cluster from Centos 6.5 -> Centos 7.1 and in
>> >> the
>> >> process upgraded from Emperor -> Firefly -> Hammer
>> >>
>> >> When I finished I changed the crush tunables from
>> >> ceph osd crush tunables legacy -> ceph osd crush tunables optimal
>> >>
>> >> I knew this would cause data movement.  But the IO for my clients is
>> >> unacceptable.  Can any please tell what the best settings are for my
>> >> configuration.  I have 2 Dell R720 Servers and 2 Dell R730 servers.  I
>> >> have
>> >> 36 1TB SATA SSD Drives in my cluster.  The servers have 128 GB of RAM.
>> >>
>> >> Below is some detail the might help.  According to my calculations the
>> >> rebalance will take over a day.
>> >>
>> >> I would greatly appreciate some help on this.
>> >>
>> >> Thank you,
>> >>
>> >> Joe
>> >>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.c

[ceph-users] Modification Time of RBD Images

2015-11-26 Thread Christoph Adomeit
Hi there,

I am using Ceph-Hammer and I am wondering about the following:

What is the recommended way to find out when an rbd-Image was last modified ?

Thanks
  Christoph

-- 
Christoph Adomeit
GATWORKS GmbH
Reststrauch 191
41199 Moenchengladbach
Sitz: Moenchengladbach
Amtsgericht Moenchengladbach, HRB 6303
Geschaeftsfuehrer:
Christoph Adomeit, Hans Wilhelm Terstappen

christoph.adom...@gatworks.de Internetloesungen vom Feinsten
Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Modification Time of RBD Images

2015-11-26 Thread Gregory Farnum
I don't think anything tracks this explicitly for RBD, but each RADOS
object does maintain an mtime you can check via the rados tool. You could
write a script to iterate through all the objects in the image and find the
most recent mtime (although a custom librados binary will be faster if you
want to do this frequently).
-Greg

On Thursday, November 26, 2015, Christoph Adomeit <
christoph.adom...@gatworks.de> wrote:

> Hi there,
>
> I am using Ceph-Hammer and I am wondering about the following:
>
> What is the recommended way to find out when an rbd-Image was last
> modified ?
>
> Thanks
>   Christoph
>
> --
> Christoph Adomeit
> GATWORKS GmbH
> Reststrauch 191
> 41199 Moenchengladbach
> Sitz: Moenchengladbach
> Amtsgericht Moenchengladbach, HRB 6303
> Geschaeftsfuehrer:
> Christoph Adomeit, Hans Wilhelm Terstappen
>
> christoph.adom...@gatworks.de  Internetloesungen vom
> Feinsten
> Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Modification Time of RBD Images

2015-11-26 Thread Jan Schermer
Find in which block the filesystem on your RBD image stores journal, find the 
object hosting this block in rados and use its mtime :-)

Jan


> On 26 Nov 2015, at 18:49, Gregory Farnum  wrote:
> 
> I don't think anything tracks this explicitly for RBD, but each RADOS object 
> does maintain an mtime you can check via the rados tool. You could write a 
> script to iterate through all the objects in the image and find the most 
> recent mtime (although a custom librados binary will be faster if you want to 
> do this frequently).
> -Greg
> 
> On Thursday, November 26, 2015, Christoph Adomeit 
> mailto:christoph.adom...@gatworks.de>> wrote:
> Hi there,
> 
> I am using Ceph-Hammer and I am wondering about the following:
> 
> What is the recommended way to find out when an rbd-Image was last modified ?
> 
> Thanks
>   Christoph
> 
> --
> Christoph Adomeit
> GATWORKS GmbH
> Reststrauch 191
> 41199 Moenchengladbach
> Sitz: Moenchengladbach
> Amtsgericht Moenchengladbach, HRB 6303
> Geschaeftsfuehrer:
> Christoph Adomeit, Hans Wilhelm Terstappen
> 
> christoph.adom...@gatworks.de  Internetloesungen vom 
> Feinsten
> Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on XFS ENOSPC at 84% data / 5% inode and inode64?

2015-11-26 Thread Andrey Korolyov
On Thu, Nov 26, 2015 at 1:29 AM, Laurent GUERBY  wrote:
> Hi,
>
> After our trouble with ext4/xattr soft lockup kernel bug we started
> moving some of our OSD to XFS, we're using ubuntu 14.04 3.19 kernel
> and ceph 0.94.5.
>
> We have two out of 28 rotational OSD running XFS and
> they both get restarted regularly because they're terminating with
> "ENOSPC":
>
> 2015-11-25 16:51:08.015820 7f6135153700  0 
> filestore(/var/lib/ceph/osd/ceph-11)  error (28) No space left on device not 
> handled on operation 0xa0f4d520 (12849173.0.4, or op 4, counting from 0)
> 2015-11-25 16:51:08.015837 7f6135153700  0 
> filestore(/var/lib/ceph/osd/ceph-11) ENOSPC handling not implemented
> 2015-11-25 16:51:08.015838 7f6135153700  0 
> filestore(/var/lib/ceph/osd/ceph-11)  transaction dump:
> ...
> {
> "op_num": 4,
> "op_name": "write",
> "collection": "58.2d5_head",
> "oid": 
> "53e4fed5\/rbd_data.11f20f75aac8266.000a79eb\/head\/\/58",
> "length": 73728,
> "offset": 4120576,
> "bufferlist length": 73728
> },
>
> (Writing the last 73728 bytes = 72 kbytes of 4 Mbytes if I'm reading
> this correctly)
>
> Mount options:
>
> /dev/sdb1 /var/lib/ceph/osd/ceph-11 xfs rw,noatime,attr2,inode64,noquota
>
> Space and Inodes:
>
> Filesystem Type  1K-blocks   Used Available Use% Mounted on
> /dev/sdb1  xfs  1947319356 1624460408 322858948  84% 
> /var/lib/ceph/osd/ceph-11
>
> Filesystem TypeInodes   IUsed IFree IUse% Mounted on
> /dev/sdb1  xfs   48706752 1985587  467211655% 
> /var/lib/ceph/osd/ceph-11
>
> We're only using rbd devices, so max 4 MB/object write, how
> can we get ENOSPC for a 4MB operation with 322 GB free space?
>
> The most surprising thing is that after the automatic restart
> disk usage keep increasing and we no longer get ENOSPC for a while.
>
> Did we miss a needed XFS mount option? Did other ceph users
> encounter this issue with XFS?
>
> We have no such issue with ~96% full ext4 OSD (after setting the right
> value for the various ceph "fill" options).
>
> Thanks in advance,
>
> Laurent
>

Hi, from given numbers one can conclude that you are facing some kind
of XFS preallocation bug, because ((raw space divided by number of
files)) is four times lower than the ((raw space divided by 4MB
blocks)). At a glance it could be avoided by specifying relatively
small allocsize= mount option, of course by impacting overall
performance, appropriate benchmarks could be found through
ceph-users/ceph-devel. Also do you plan to preserve overcommit ratio
to be that high forever?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD on XFS ENOSPC at 84% data / 5% inode and inode64?

2015-11-26 Thread Laurent GUERBY
On Thu, 2015-11-26 at 22:13 +0300, Andrey Korolyov wrote:
> On Thu, Nov 26, 2015 at 1:29 AM, Laurent GUERBY  wrote:
> > Hi,
> >
> > After our trouble with ext4/xattr soft lockup kernel bug we started
> > moving some of our OSD to XFS, we're using ubuntu 14.04 3.19 kernel
> > and ceph 0.94.5.
> >
> > We have two out of 28 rotational OSD running XFS and
> > they both get restarted regularly because they're terminating with
> > "ENOSPC":
> >
> > 2015-11-25 16:51:08.015820 7f6135153700  0 
> > filestore(/var/lib/ceph/osd/ceph-11)  error (28) No space left on device 
> > not handled on operation 0xa0f4d520 (12849173.0.4, or op 4, counting from 0)
> > 2015-11-25 16:51:08.015837 7f6135153700  0 
> > filestore(/var/lib/ceph/osd/ceph-11) ENOSPC handling not implemented
> > 2015-11-25 16:51:08.015838 7f6135153700  0 
> > filestore(/var/lib/ceph/osd/ceph-11)  transaction dump:
> > ...
> > {
> > "op_num": 4,
> > "op_name": "write",
> > "collection": "58.2d5_head",
> > "oid": 
> > "53e4fed5\/rbd_data.11f20f75aac8266.000a79eb\/head\/\/58",
> > "length": 73728,
> > "offset": 4120576,
> > "bufferlist length": 73728
> > },
> >
> > (Writing the last 73728 bytes = 72 kbytes of 4 Mbytes if I'm reading
> > this correctly)
> >
> > Mount options:
> >
> > /dev/sdb1 /var/lib/ceph/osd/ceph-11 xfs rw,noatime,attr2,inode64,noquota
> >
> > Space and Inodes:
> >
> > Filesystem Type  1K-blocks   Used Available Use% Mounted on
> > /dev/sdb1  xfs  1947319356 1624460408 322858948  84% 
> > /var/lib/ceph/osd/ceph-11
> >
> > Filesystem TypeInodes   IUsed IFree IUse% Mounted on
> > /dev/sdb1  xfs   48706752 1985587  467211655% 
> > /var/lib/ceph/osd/ceph-11
> >
> > We're only using rbd devices, so max 4 MB/object write, how
> > can we get ENOSPC for a 4MB operation with 322 GB free space?
> >
> > The most surprising thing is that after the automatic restart
> > disk usage keep increasing and we no longer get ENOSPC for a while.
> >
> > Did we miss a needed XFS mount option? Did other ceph users
> > encounter this issue with XFS?
> >
> > We have no such issue with ~96% full ext4 OSD (after setting the right
> > value for the various ceph "fill" options).
> >
> > Thanks in advance,
> >
> > Laurent
> >
> 
> Hi, from given numbers one can conclude that you are facing some kind
> of XFS preallocation bug, because ((raw space divided by number of
> files)) is four times lower than the ((raw space divided by 4MB
> blocks)). At a glance it could be avoided by specifying relatively
> small allocsize= mount option, of course by impacting overall
> performance, appropriate benchmarks could be found through
> ceph-users/ceph-devel. Also do you plan to preserve overcommit ratio
> to be that high forever?

Hi,

Thanks for your answer.

On these disks we have 3 active pools all doing rbd images : a regular
3-replica one (4MB files), an EC 4+1 (1 MB files) and an EC 8+2 (512 kB
files), we're currently copying rbd images from EC 4+1 to EC 8+2 so we
have temporary high disk usage on some disks until we remove the EC 4+1
pool. 

We're still using straw and not straw2 so we have 59% to 96% usage
depending on the disk. We have new disks and nodes ready that we plan to
add while migrating to straw2, but we need to choose first wether to use
ext4 or XFS on these new nodes hence my mail.

Having reread http://xfs.org/index.php/XFS_FAQ with your advice
in mind I see why speculative preallocation could cause issue
with ceph which in our case will have mostly fixed size files.
And these issues are temporary because of the XFS
scanner "to perform background trimming of files with lingering post-EOF
preallocations" is running after 5 minutes:

$ cat /proc/sys/fs/xfs/speculative_prealloc_lifetime
300

Message with a probleme similar to our:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/038817.html

We'll probably go with XFS but adding allocsize=128k.

Sincerely

Laurent

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] filestore journal writeahead

2015-11-26 Thread Lindsay Mathieson
I'm a bit confused re this setting with regard to xfs. Docs state:

"Enables writeahead journaling, default for xfs.", which implies to me that
is on by default for xfs, but then after that is it states:

"Default:false"


So is it on or off by default for xfs? and is there a way to tell?

Also - whats its actual impact ;) will it improve or hinder performance?

thanks,
-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com