[ceph-users] cephfs-journal-tool lead to data missing and show up

2016-07-14 Thread txm
I am a user of cephfs.

Recently i met a problem by using the cephfs-journal-tool.

There were some strange things happened below. 

1.After use the cephfs-journal-tool and cephfs-table-tool(i came up with the 
"negative object nums” issues, so i tried these tools to repair the cephfs),i 
remount the cephfs.
2.Then i found that the old data(a directoy and a file under it) is missing. 
3.But after i create a new file at the root of cephfs, the missing directory 
show up. Then i delete the new created file,the “missing directory”  
disappeared soon.
4.So this is the problem, when i create something under the root of cephfs, the 
missing directory show up, when i delete it ,the “missing directory” 
disappeared.

Here is my question:
1. I am not sure whether this damage is caused by cephfs-journal-tool?
2. If so, another question is , how this damage came up by using the 
cephfs-journal-tool and what should i do next to fix this problem? 

Apparently ,i didn’t lose my data, but this is strange after all.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fail to add mon in a way of ceph-deploy or manually

2016-07-14 Thread 朱 彤
Using ceph-deploy:
I have ceph-node1 as admin and mon, and I would like to add another mon 
ceph-node2.
On ceph-node1:
ceph-deploy mon create ceph-node2
ceph-deploy mon add ceph-node2

The fisrt command warns:

[ceph-node2][WARNIN] ceph-node2 is not defined in `mon initial members`
[ceph-node2][WARNIN] monitor ceph-node2 does not exist in monmap

The second command warns and throws errors:

[ceph-node2][WARNIN] IO error: lock 
/var/lib/ceph/mon/ceph-ceph-node2/store.db/LOCK: Resource temporarily 
unavailable
[ceph-node2][WARNIN] 2016-07-14 16:25:14.838255 7f6177f724c0 -1 
asok(0x7f6183ef4000) AdminSocketConfigObs::init: failed: 
AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to 
'/var/run/ceph/ceph-mon.ceph-node2.asok': (17) File exists
[ceph-node2][WARNIN] 2016-07-14 16:25:14.844003 7f6177f724c0 -1 error opening 
mon data directory at '/var/lib/ceph/mon/ceph-ceph-node2': (22) Invalid argument
[ceph-node2][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.mon][ERROR ] Failed to execute command: ceph-mon -i ceph-node2 
--pid-file /var/run/ceph/mon.ceph-node2.pid --public-addr 192.168.57.103
[ceph_deploy][ERROR ] GenericError: Failed to add monitor to host:  ceph-node2

Now status is :
[root@ceph-node1 ceph]# ceph status
cluster eee6caf2-a7c6-411c-8711-a87aa4a66bf2
 health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs degraded
64 pgs stuck inactive
64 pgs undersized
too few PGs per OSD (21 < min 30)
 monmap e1: 1 mons at {ceph-node1=192.168.57.101:6789/0}
election epoch 8, quorum 0 ceph-node1
 osdmap e24: 3 osds: 3 up, 3 in
flags sortbitwise
  pgmap v45: 64 pgs, 1 pools, 0 bytes data, 0 objects
101836 kB used, 15227 MB / 15326 MB avail
  64 undersized+degraded+peered

Using hands:
(http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/)
Running on ceph-node2 fails at step 3:
ceph auth get mon. -o {tmp}/{key-filename}

error:
2016-07-14 16:26:52.469722 7f706bff7700  0 -- :/1183426694 >> 
192.168.57.101:6789/0 pipe(0x7f707005c7d0 sd=3 :0 s=1 pgs=0 cs=0 l=1 
c=0x7f707005da90).fault
2016-07-14 16:26:55.470789 7f706bef6700  0 -- :/1183426694 >> 
192.168.57.101:6789/0 pipe(0x7f706c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 
c=0x7f7060001f90).fault

So I ran the above command on ceph-node1, and then scp key and map to host 
ceph-node2.
Then I completes the next procedures "successfully". But running ceph status on 
ceph-node2:
[root@ceph-node2 ~]# ceph status
2016-07-14 17:01:30.134496 7f43f8164700  0 -- :/2056484158 >> 
192.168.57.101:6789/0 pipe(0x7f43f405c7d0 sd=3 :0 s=1 pgs=0 cs=0 l=1 
c=0x7f43f405da90).fault
2016-07-14 17:01:33.136259 7f43efd77700  0 -- :/2056484158 >> 
192.168.57.101:6789/0 pipe(0x7f43e4000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 
c=0x7f43e4001f90).fault

And on ceph-node1, ceph status shows there is only one mon.
monmap e1: 1 mons at {ceph-node1=192.168.57.101:6789/0}
election epoch 8, quorum 0 ceph-node1

This is ceph.conf:
[global]
fsid = eee6caf2-a7c6-411c-8711-a87aa4a66bf2
mon_initial_members = ceph-node1
mon_host = 192.168.57.101
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 192.168.57.0/24


 I use root account during the whole ceph build-up procedure.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-14 Thread Nick Fisk
I've seen something similar if you are using RBD caching, I found that if you 
can fill the RBD cache faster than it can flush you
get these stalls. I increased the size of the cache and also the flush 
threshold and this solved the problem. I didn't spend much
time looking into it, but it seemed like with a smaller cache it didn't have 
enough working space to accept new writes whilst the
older ones were being flushed.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
> Nelson
> Sent: 14 July 2016 03:34
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Terrible RBD performance with Jewel
> 
> As Somnath mentioned, you've got a lot of tunables set there.  Are you sure 
> those are all doing what you think they are doing?
> 
> FWIW, the xfs -n size=64k option is probably not a good idea.
> Unfortunately it can't be changed without making a new filesystem.
> 
> See:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007645.html
> 
> Typically that seems to manifest as suicide timeouts on the OSDs though.
>   You'd also see kernel log messages that look like:
> 
> kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
> 
> Mark
> 
> On 07/13/2016 08:39 PM, Garg, Pankaj wrote:
> > I agree, but I'm dealing with something else out here with this setup.
> >
> > I just ran a test, and within 3 seconds my IOPS went to 0, and stayed
> > there for 90 seconds..then started and within seconds again went to 0.
> >
> > This doesn't seem normal at all. Here is my ceph.conf:
> >
> >
> >
> > [global]
> >
> > fsid = xx
> >
> > public_network = 
> >
> > cluster_network = 
> >
> > mon_initial_members = ceph1
> >
> > mon_host = 
> >
> > auth_cluster_required = cephx
> >
> > auth_service_required = cephx
> >
> > auth_client_required = cephx
> >
> > filestore_xattr_use_omap = true
> >
> > osd_mkfs_options = -f -i size=2048 -n size=64k
> >
> > osd_mount_options_xfs = inode64,noatime,logbsize=256k
> >
> > filestore_merge_threshold = 40
> >
> > filestore_split_multiple = 8
> >
> > osd_op_threads = 12
> >
> > osd_pool_default_size = 2
> >
> > mon_pg_warn_max_object_skew = 10
> >
> > mon_pg_warn_min_per_osd = 0
> >
> > mon_pg_warn_max_per_osd = 32768
> >
> > filestore_op_threads = 6
> >
> >
> >
> > [osd]
> >
> > osd_enable_op_tracker = false
> >
> > osd_op_num_shards = 2
> >
> > filestore_wbthrottle_enable = false
> >
> > filestore_max_sync_interval = 1
> >
> > filestore_odsync_write = true
> >
> > filestore_max_inline_xattr_size = 254
> >
> > filestore_max_inline_xattrs = 6
> >
> > filestore_queue_committing_max_bytes = 1048576000
> >
> > filestore_queue_committing_max_ops = 5000
> >
> > filestore_queue_max_bytes = 1048576000
> >
> > filestore_queue_max_ops = 500
> >
> > journal_max_write_bytes = 1048576000
> >
> > journal_max_write_entries = 1000
> >
> > journal_queue_max_bytes = 1048576000
> >
> > journal_queue_max_ops = 3000
> >
> > filestore_fd_cache_shards = 32
> >
> > filestore_fd_cache_size = 64
> >
> >
> >
> >
> >
> > *From:*Somnath Roy [mailto:somnath@sandisk.com]
> > *Sent:* Wednesday, July 13, 2016 6:06 PM
> > *To:* Garg, Pankaj; ceph-users@lists.ceph.com
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > You should do that first to get a stable performance out with filestore.
> >
> > 1M seq write for the entire image should be sufficient to precondition it.
> >
> >
> >
> > *From:*Garg, Pankaj [mailto:pankaj.g...@cavium.com]
> > *Sent:* Wednesday, July 13, 2016 6:04 PM
> > *To:* Somnath Roy; ceph-users@lists.ceph.com
> > 
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > No I have not.
> >
> >
> >
> > *From:*Somnath Roy [mailto:somnath@sandisk.com]
> > *Sent:* Wednesday, July 13, 2016 6:00 PM
> > *To:* Garg, Pankaj; ceph-users@lists.ceph.com
> > 
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > In fact, I was wrong , I missed you are running with 12 OSDs
> > (considering one OSD per SSD). In that case, it will take ~250 second
> > to fill up the journal.
> >
> > Have you preconditioned the entire image with bigger block say 1M
> > before doing any real test ?
> >
> >
> >
> > *From:*Garg, Pankaj [mailto:pankaj.g...@cavium.com]
> > *Sent:* Wednesday, July 13, 2016 5:55 PM
> > *To:* Somnath Roy; ceph-users@lists.ceph.com
> > 
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > Thanks Somnath. I will try all these, but I think there is something
> > else going on too.
> >
> > Firstly my test reaches 0 IOPS within 10 seconds sometimes.
> >
> > Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no
> > CPU activity either. This part is strange.
> >
> >
> >
> > Thanks
> >
> > Pankaj
> >
> >

Re: [ceph-users] osd failing to start

2016-07-14 Thread Martin Wilderoth
Hello

I don't really find any hardware problems. I have done disk checks and
looked at log files.

Should the osd fail in a core dump if there are hardware problems ?

All my data seems intact I only have:
HEALTH_ERR 915 pgs are stuck inactive for more than 300 seconds; 915 pgs
down; 915 pgs peering; 915 pgs stuck inactive;
I guess its due to the failing osd.

I guess I could remove the osd and add as a new one, but its always
interesting to know what's actually wrong.

 /Regards Martin

Best Regards / Vänliga Hälsningar
*Martin Wilderoth*
*VD*
Enhagslingan 1B, 187 40 Täby

Direkt: +46 8 473 60 63
Mobil: +46 70 969 09 19
martin.wilder...@linserv.se
www.linserv.se

On 14 July 2016 at 06:14, Brad Hubbard  wrote:

> On Thu, Jul 14, 2016 at 06:06:58AM +0200, Martin Wilderoth wrote:
> >  Hello,
> >
> > I have a ceph cluster where the one osd is failng to start. I have been
> > upgrading ceph to see if the error dissappered. Now I'm running jewel
> but I
> > still get the  error message.
> >
> > -1> 2016-07-13 17:04:22.061384 7fda4d24e700  1 heartbeat_map
> is_healthy
> > 'OSD::osd_tp thread 0x7fda25dd8700' had suicide timed out after 150
>
> This appears to indicate that an OSD thread pool thread (work queue thread)
> has failed to complete an operation within the 150 second grace period.
>
> The most likely and common cause for this is hardware failure and I would
> therefore suggest you thoroughly check this device and look for indicators
> in
> syslog, dmesg, diagnostics, etc. tat this device may have failed.
>
> --
> HTH,
> Brad
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-14 Thread Götz Reinicke - IT Koordinator
Am 13.07.16 um 17:44 schrieb David:
> Aside from the 10GbE vs 40GbE question, if you're planning to export
> an RBD image over smb/nfs I think you are going to struggle to reach
> anywhere near 1GB/s in a single threaded read. This is because even
> with readahead cranked right up you're still only going be hitting a
> handful of disks at a time. There's a few threads on this list about
> sequential reads with the kernel rbd client. I think CephFS would be
> more appropriate in your use case.
Thanks for that hint; as soon as our nodes are online, we will do some
testing!

Regards . Götz




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-14 Thread Götz Reinicke - IT Koordinator
Am 13.07.16 um 17:08 schrieb c...@jack.fr.eu.org:
> I am using these for other stuff:
> http://www.supermicro.com/products/accessories/addon/AOC-STG-b4S.cfm
>
> If you want NIC, also think of the "network side" : SFP+ switch are very
> common, 40G is less common, 25G is really new (= really few products)
The networkside will be 10G for the OSD nodes, and 40Gb for the core layer.

Regards . Götz



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Antw: Re: SSD Journal

2016-07-14 Thread Steffen Weißgerber


>>> Christian Balzer  schrieb am Donnerstag, 14. Juli 2016 um
05:05:

Hello,

> Hello,
> 
> On Wed, 13 Jul 2016 09:34:35 + Ashley Merrick wrote:
> 
>> Hello,
>> 
>> Looking at using 2 x 960GB SSD's (SM863)
>>
> Massive overkill.
>  
>> Reason for larger is I was thinking would be better off with them in Raid 1 
> so enough space for OS and all Journals.
>>
> As I pointed out several times in this ML, Ceph journal usage rarely
> exceeds hundreds of MB, let alone several GB with default parameters.
> So 10GB per journal is plenty, unless you're doing something very special
> (and you aren't with normal HDDs as OSDs).
>  
>> Instead am I better off using 2 x 200GB S3700's instead, with 5 disks per a 
> SSD?
>>
> S3700s are unfortunately EOL'ed, the 200GB ones were great at 375MB/s.
> 200GB S3710s are about on par for 5 HDDs at 300MB/s, but if you can afford
> it and have a 10Gb/s network, the 400GB ones at 470MB/s would be optimal.
> 
> As for sharing the SSDs with OS, I do that all the time, the minute
> logging of a storage node really has next to no impact.
> 
> I prefer this over using DoMs for reasons of:
> 1. Redundancy
> 2. hot-swapability  
> 
> If you go the DoM route, make sure it's size AND endurance are a match for
> what you need. 
> This is especially important if you were to run a MON on those machines as
> well.
> 

Cause we had to change some DoM's due to heavy MON logging, how do you
configure MON logging? On that redundant SSD's or remote?
 
Steffen

> Christian
> 
>> Thanks,
>> Ashley
>> 
>> -Original Message-
>> From: Christian Balzer [mailto:ch...@gol.com] 
>> Sent: 13 July 2016 01:12
>> To: ceph-users@lists.ceph.com 
>> Cc: Wido den Hollander ; Ashley Merrick 
>> 
>> Subject: Re: [ceph-users] SSD Journal
>> 
>> 
>> Hello,
>> 
>> On Tue, 12 Jul 2016 19:14:14 +0200 (CEST) Wido den Hollander wrote:
>> 
>> > 
>> > > Op 12 juli 2016 om 15:31 schreef Ashley Merrick :
>> > > 
>> > > 
>> > > Hello,
>> > > 
>> > > Looking at final stages of planning / setup for a CEPH Cluster.
>> > > 
>> > > Per a Storage node looking @
>> > > 
>> > > 2 x SSD OS / Journal
>> > > 10 x SATA Disk
>> > > 
>> > > Will have a small Raid 1 Partition for the OS, however not sure if best 
>> > > to 
> do:
>> > > 
>> > > 5 x Journal Per a SSD
>> > 
>> > Best solution. Will give you the most performance for the OSDs. RAID-1 
>> > will 
> just burn through cycles on the SSDs.
>> > 
>> > SSDs don't fail that often.
>> >
>> What Wido wrote, but let us know what SSDs you're planning to use.
>> 
>> Because the detailed version of that sentence should read: 
>> "Well known and tested DC level SSDs whose size/endurance levels are matched 
> to the workload rarely fail, especially unexpected."
>>  
>> > Wido
>> > 
>> > > 10 x Journal on Raid 1 of two SSD's
>> > > 
>> > > Is the "Performance" increase from splitting 5 Journal's on each SSD 
>> > > worth 
> the "issue" caused when one SSD goes down?
>> > > 
>> As always, assume at least a node being the failure domain you need to be 
> able to handle.
>> 
>> Christian
>> 
>> > > Thanks,
>> > > Ashley
>> > > ___
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com 
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com 
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> > 
>> 
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/ 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow requests on cluster.

2016-07-14 Thread Jaroslaw Owsiewski
Hi,

we have problem with drastic performance slowing down on a cluster. We used
radosgw with S3 protocol. Our configuration:

153 OSD SAS 1.2TB with journal on SSD disks (ratio 4:1)
- no problems with networking, no hardware issues, etc.

Output from "ceph df":

GLOBAL:
SIZE AVAIL RAW USED %RAW USED
166T  129T   38347G 22.44
POOLS:
NAME   ID USED   %USED MAX AVAIL
OBJECTS
.rgw   9  70330k 039879G
 393178
.rgw.root  10848 039879G
  3
.rgw.control   11  0 039879G
  8
.rgw.gc12  0 039879G
 32
.rgw.buckets   13 10007G  5.8639879G
331079052
.rgw.buckets.index 14  0 039879G
2994652
.rgw.buckets.extra 15  0 039879G
  2
.log   16   475M 039879G
408
.intent-log17  0 039879G
  0
.users 19729 039879G
 49
.users.email   20414 039879G
 26
.users.swift   21  0 039879G
  0
.users.uid 22  17170 039879G
 89

Problems began on last saturday,
Troughput was 400k req per hour - mostly PUTs and HEADs ~100kb.

Ceph version is hammer.


We have two clusters with similar configuration and both experienced same
problems at once.

Any hints


Latest output from "ceph -w":

2016-07-14 14:43:16.197131 osd.26 [WRN] 17 slow requests, 16 included
below; oldest blocked for > 34.766976 secs
2016-07-14 14:43:16.197138 osd.26 [WRN] slow request 32.99 seconds old,
received at 2016-07-14 14:42:43.641440: osd_op(client.75866283.0:20130084
.dir.default.75866283.65796.3 [delete] 14.122252f4
ondisk+write+known_if_redirected e18788) currently commit_sent
2016-07-14 14:43:16.197145 osd.26 [WRN] slow request 32.536551 seconds old,
received at 2016-07-14 14:42:43.660487: osd_op(client.75866283.0:20130121
.dir.default.75866283.65799.6 [delete] 14.d2dc1672
ondisk+write+known_if_redirected e18788) currently commit_sent
2016-07-14 14:43:16.197153 osd.26 [WRN] slow request 30.971549 seconds old,
received at 2016-07-14 14:42:45.225490: osd_op(client.75866283.0:20132345
gc.12 [call rgw.gc_set_entry] 12.a45046b8
ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
2016-07-14 14:43:16.197158 osd.26 [WRN] slow request 30.967568 seconds old,
received at 2016-07-14 14:42:45.229471: osd_op(client.76495939.0:20147494
gc.12 [call rgw.gc_set_entry] 12.a45046b8
ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
2016-07-14 14:43:16.197162 osd.26 [WRN] slow request 32.253169 seconds old,
received at 2016-07-14 14:42:43.943870: osd_op(client.75866283.0:20130663
.dir.default.75866283.65805.7 [delete] 14.2b5a1672
ondisk+write+known_if_redirected e18788) currently commit_sent
2016-07-14 14:43:17.197429 osd.26 [WRN] 3 slow requests, 2 included below;
oldest blocked for > 31.967882 secs
2016-07-14 14:43:17.197434 osd.26 [WRN] slow request 31.579897 seconds old,
received at 2016-07-14 14:42:45.617456: osd_op(client.76495939.0:20147877
gc.12 [call rgw.gc_set_entry] 12.a45046b8
ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
2016-07-14 14:43:17.197439 osd.26 [WRN] slow request 30.897873 seconds old,
received at 2016-07-14 14:42:46.299480: osd_op(client.76495939.0:20148668
gc.12 [call rgw.gc_set_entry] 12.a45046b8
ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks


Regards
-- 
Jarosław Owsiewski
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests on cluster.

2016-07-14 Thread Luis Periquito
Hi Jaroslaw,

several things are springing up to mind. I'm assuming the cluster is
healthy (other than the slow requests), right?

From the (little) information you send it seems the pools are
replicated with size 3, is that correct?

Are there any long running delete processes? They usually have a
negative impact on performance, specially as they don't really show up
in the IOPS statistics.
I've also something like this happen when there's a slow disk/osd. You
can try to check with "ceph osd perf" and look for higher numbers.
Usually restarting that OSD brings back the cluster to life, if that's
the issue.
If nothing shows, try a "ceph tell osd.* version"; if there's a
misbehaving OSD they usually don't respond to the command (slow or
even timing out).

Also you also don't say how many scrub/deep-scrub processes are
running. If not properly handled they are also a performance killer.

Last, but by far not least, have you ever thought of creating a SSD
pool (even small) and move all pools but .rgw.buckets there? The other
ones are small enough, but enjoy having their own "reserved" osds...



On Thu, Jul 14, 2016 at 1:59 PM, Jaroslaw Owsiewski
 wrote:
> Hi,
>
> we have problem with drastic performance slowing down on a cluster. We used
> radosgw with S3 protocol. Our configuration:
>
> 153 OSD SAS 1.2TB with journal on SSD disks (ratio 4:1)
> - no problems with networking, no hardware issues, etc.
>
> Output from "ceph df":
>
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 166T  129T   38347G 22.44
> POOLS:
> NAME   ID USED   %USED MAX AVAIL
> OBJECTS
> .rgw   9  70330k 039879G
> 393178
> .rgw.root  10848 039879G
> 3
> .rgw.control   11  0 039879G
> 8
> .rgw.gc12  0 039879G
> 32
> .rgw.buckets   13 10007G  5.8639879G
> 331079052
> .rgw.buckets.index 14  0 039879G
> 2994652
> .rgw.buckets.extra 15  0 039879G
> 2
> .log   16   475M 039879G
> 408
> .intent-log17  0 039879G
> 0
> .users 19729 039879G
> 49
> .users.email   20414 039879G
> 26
> .users.swift   21  0 039879G
> 0
> .users.uid 22  17170 039879G
> 89
>
> Problems began on last saturday,
> Troughput was 400k req per hour - mostly PUTs and HEADs ~100kb.
>
> Ceph version is hammer.
>
>
> We have two clusters with similar configuration and both experienced same
> problems at once.
>
> Any hints
>
>
> Latest output from "ceph -w":
>
> 2016-07-14 14:43:16.197131 osd.26 [WRN] 17 slow requests, 16 included below;
> oldest blocked for > 34.766976 secs
> 2016-07-14 14:43:16.197138 osd.26 [WRN] slow request 32.99 seconds old,
> received at 2016-07-14 14:42:43.641440: osd_op(client.75866283.0:20130084
> .dir.default.75866283.65796.3 [delete] 14.122252f4
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:16.197145 osd.26 [WRN] slow request 32.536551 seconds old,
> received at 2016-07-14 14:42:43.660487: osd_op(client.75866283.0:20130121
> .dir.default.75866283.65799.6 [delete] 14.d2dc1672
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:16.197153 osd.26 [WRN] slow request 30.971549 seconds old,
> received at 2016-07-14 14:42:45.225490: osd_op(client.75866283.0:20132345
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:16.197158 osd.26 [WRN] slow request 30.967568 seconds old,
> received at 2016-07-14 14:42:45.229471: osd_op(client.76495939.0:20147494
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:16.197162 osd.26 [WRN] slow request 32.253169 seconds old,
> received at 2016-07-14 14:42:43.943870: osd_op(client.75866283.0:20130663
> .dir.default.75866283.65805.7 [delete] 14.2b5a1672
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:17.197429 osd.26 [WRN] 3 slow requests, 2 included below;
> oldest blocked for > 31.967882 secs
> 2016-07-14 14:43:17.197434 osd.26 [WRN] slow request 31.579897 seconds old,
> received at 2016-07-14 14:42:45.617456: osd_op(client.76495939.0:20147877
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:17.197439 osd.26 [WRN] slow request 30.897873 seconds old,
> received at 2016-07-14 14:42:46.299480: osd_op(client.76495939.0:20148668
> gc.12 [call rgw.gc_set_entry] 12.a4504

Re: [ceph-users] Slow requests on cluster.

2016-07-14 Thread Jaroslaw Owsiewski
2016-07-14 15:26 GMT+02:00 Luis Periquito :

> Hi Jaroslaw,
>
> several things are springing up to mind. I'm assuming the cluster is
> healthy (other than the slow requests), right?
>
>
Yes.



> From the (little) information you send it seems the pools are
> replicated with size 3, is that correct?
>
>
True.


> Are there any long running delete processes? They usually have a
> negative impact on performance, specially as they don't really show up
> in the IOPS statistics.
>

During normal troughput we have small amount of deletes.


> I've also something like this happen when there's a slow disk/osd. You
> can try to check with "ceph osd perf" and look for higher numbers.
> Usually restarting that OSD brings back the cluster to life, if that's
> the issue.
>

I will check this.



> If nothing shows, try a "ceph tell osd.* version"; if there's a
> misbehaving OSD they usually don't respond to the command (slow or
> even timing out).
>
> Also you also don't say how many scrub/deep-scrub processes are
> running. If not properly handled they are also a performance killer.
>
>
Scrub/deep-scrub processes are disabled


Last, but by far not least, have you ever thought of creating a SSD
> pool (even small) and move all pools but .rgw.buckets there? The other
> ones are small enough, but enjoy having their own "reserved" osds...
>
>
>

This is one idea we had some time ago, we will try that.

One important thing:

sysop@s41617:~/bin$ ceph osd pool get .rgw.buckets pg_num
pg_num: 4470
sysop@s41617:~/bin$ ceph osd pool get .rgw.buckets.index pg_num
pg_num: 2048

Could be this a main problem?


Regards
-- 
Jarek
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests on cluster.

2016-07-14 Thread Jaroslaw Owsiewski
I think that first symptoms of out problems occurred when we posted this
issue:

http://tracker.ceph.com/issues/15727

Regards
-- 
Jarek

-- 
Jarosław Owsiewski

2016-07-14 15:43 GMT+02:00 Jaroslaw Owsiewski <
jaroslaw.owsiew...@allegrogroup.com>:

> 2016-07-14 15:26 GMT+02:00 Luis Periquito :
>
>> Hi Jaroslaw,
>>
>> several things are springing up to mind. I'm assuming the cluster is
>> healthy (other than the slow requests), right?
>>
>>
> Yes.
>
>
>
>> From the (little) information you send it seems the pools are
>> replicated with size 3, is that correct?
>>
>>
> True.
>
>
>> Are there any long running delete processes? They usually have a
>> negative impact on performance, specially as they don't really show up
>> in the IOPS statistics.
>>
>
> During normal troughput we have small amount of deletes.
>
>
>> I've also something like this happen when there's a slow disk/osd. You
>> can try to check with "ceph osd perf" and look for higher numbers.
>> Usually restarting that OSD brings back the cluster to life, if that's
>> the issue.
>>
>
> I will check this.
>
>
>
>> If nothing shows, try a "ceph tell osd.* version"; if there's a
>> misbehaving OSD they usually don't respond to the command (slow or
>> even timing out).
>>
>> Also you also don't say how many scrub/deep-scrub processes are
>> running. If not properly handled they are also a performance killer.
>>
>>
> Scrub/deep-scrub processes are disabled
>
>
> Last, but by far not least, have you ever thought of creating a SSD
>> pool (even small) and move all pools but .rgw.buckets there? The other
>> ones are small enough, but enjoy having their own "reserved" osds...
>>
>>
>>
>
> This is one idea we had some time ago, we will try that.
>
> One important thing:
>
> sysop@s41617:~/bin$ ceph osd pool get .rgw.buckets pg_num
> pg_num: 4470
> sysop@s41617:~/bin$ ceph osd pool get .rgw.buckets.index pg_num
> pg_num: 2048
>
> Could be this a main problem?
>
>
> Regards
> --
> Jarek
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-07-14 Thread Christian Balzer

Hello,

On Thu, 14 Jul 2016 13:37:54 +0200 Steffen Weißgerber wrote:

> 
> 
> >>> Christian Balzer  schrieb am Donnerstag, 14. Juli 2016 um
> 05:05:
> 
> Hello,
> 
> > Hello,
> > 
> > On Wed, 13 Jul 2016 09:34:35 + Ashley Merrick wrote:
> > 
> >> Hello,
> >> 
> >> Looking at using 2 x 960GB SSD's (SM863)
> >>
> > Massive overkill.
> >  
> >> Reason for larger is I was thinking would be better off with them in Raid 
> >> 1 
> > so enough space for OS and all Journals.
> >>
> > As I pointed out several times in this ML, Ceph journal usage rarely
> > exceeds hundreds of MB, let alone several GB with default parameters.
> > So 10GB per journal is plenty, unless you're doing something very special
> > (and you aren't with normal HDDs as OSDs).
> >  
> >> Instead am I better off using 2 x 200GB S3700's instead, with 5 disks per 
> >> a 
> > SSD?
> >>
> > S3700s are unfortunately EOL'ed, the 200GB ones were great at 375MB/s.
> > 200GB S3710s are about on par for 5 HDDs at 300MB/s, but if you can afford
> > it and have a 10Gb/s network, the 400GB ones at 470MB/s would be optimal.
> > 
> > As for sharing the SSDs with OS, I do that all the time, the minute
> > logging of a storage node really has next to no impact.
> > 
> > I prefer this over using DoMs for reasons of:
> > 1. Redundancy
> > 2. hot-swapability  
> > 
> > If you go the DoM route, make sure it's size AND endurance are a match for
> > what you need. 
> > This is especially important if you were to run a MON on those machines as
> > well.
> > 
> 
> Cause we had to change some DoM's due to heavy MON logging, how do you
> configure MON logging? On that redundant SSD's or remote?
>  

What model/maker DoM where those?

Anyway, everything that runs a MON in my clusters has SSDs with sufficient
endurance for the OS.
Heck, even 180GB Intel 530s (aka consumer SSD) doing the OS for a
dedicated MON in a busy (but standard level logging) cluster are only at
98% wear-out (as in 2% down) after a year.
Though that's a HW RAID1 and the controller has 512MB cache, so writes do
get nicely coalesced. 
All my other MONs (shared with on OSD storage nodes) are on S37x0 SSDs.

OTOH, the Supermicro DoMs look nice enough on paper with 1 DWPD:
https://www.supermicro.com/products/nfo/SATADOM.cfm

The 64GB model should do the trick in most scenarios.

Christian

> Steffen
> 
> > Christian
> > 
> >> Thanks,
> >> Ashley
> >> 
> >> -Original Message-
> >> From: Christian Balzer [mailto:ch...@gol.com] 
> >> Sent: 13 July 2016 01:12
> >> To: ceph-users@lists.ceph.com 
> >> Cc: Wido den Hollander ; Ashley Merrick 
> >> 
> >> Subject: Re: [ceph-users] SSD Journal
> >> 
> >> 
> >> Hello,
> >> 
> >> On Tue, 12 Jul 2016 19:14:14 +0200 (CEST) Wido den Hollander wrote:
> >> 
> >> > 
> >> > > Op 12 juli 2016 om 15:31 schreef Ashley Merrick 
> >> > > :
> >> > > 
> >> > > 
> >> > > Hello,
> >> > > 
> >> > > Looking at final stages of planning / setup for a CEPH Cluster.
> >> > > 
> >> > > Per a Storage node looking @
> >> > > 
> >> > > 2 x SSD OS / Journal
> >> > > 10 x SATA Disk
> >> > > 
> >> > > Will have a small Raid 1 Partition for the OS, however not sure if 
> >> > > best to 
> > do:
> >> > > 
> >> > > 5 x Journal Per a SSD
> >> > 
> >> > Best solution. Will give you the most performance for the OSDs. RAID-1 
> >> > will 
> > just burn through cycles on the SSDs.
> >> > 
> >> > SSDs don't fail that often.
> >> >
> >> What Wido wrote, but let us know what SSDs you're planning to use.
> >> 
> >> Because the detailed version of that sentence should read: 
> >> "Well known and tested DC level SSDs whose size/endurance levels are 
> >> matched 
> > to the workload rarely fail, especially unexpected."
> >>  
> >> > Wido
> >> > 
> >> > > 10 x Journal on Raid 1 of two SSD's
> >> > > 
> >> > > Is the "Performance" increase from splitting 5 Journal's on each SSD 
> >> > > worth 
> > the "issue" caused when one SSD goes down?
> >> > > 
> >> As always, assume at least a node being the failure domain you need to be 
> > able to handle.
> >> 
> >> Christian
> >> 
> >> > > Thanks,
> >> > > Ashley
> >> > > ___
> >> > > ceph-users mailing list
> >> > > ceph-users@lists.ceph.com 
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com 
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> >> > 
> >> 
> >> 
> > 
> > 
> > -- 
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/ 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
__

Re: [ceph-users] osd inside LXC

2016-07-14 Thread Daniel Gryniewicz
This is fairly standard for container deployment: one app per container 
instance.  This is how we're deploying docker in our upstream 
ceph-docker / ceph-ansible as well.


Daniel

On 07/13/2016 08:41 PM, Łukasz Jagiełło wrote:

Hi,

Just wonder why you want each OSD inside separate LXC container? Just to
pin them to specific cpus?

On Tue, Jul 12, 2016 at 6:33 AM, Guillaume Comte
mailto:guillaume.co...@blade-group.com>> wrote:

Hi,

I am currently defining a storage architecture based on ceph, and i
wish to know if i don't misunderstood some stuffs.

So, i plan to deploy for each HDD of each servers as much as OSD as
free harddrive, each OSD will be inside a LXC container.

Then, i wish to turn the server itself as a rbd client for objects
created in the pools, i wish also to have a SSD to activate caching
(and also store osd logs as well)

The idea behind is to create CRUSH rules which will maintain a set
of object within a couple of servers connected to the same pair of
switch in order to have the best proximity between where i store the
object and where i use them (i don't bother having a very high
insurance to not loose data if my whole rack powerdown)

Am i already on the wrong track ? Is there a way to guaranty
proximity of data with ceph whitout making twisted configuration as
i am ready to do ?

Thks in advance,

Regards
--
*Guillaume Comte*
06 25 85 02 02  | guillaume.co...@blade-group.com

90 avenue des Ternes, 75 017 Paris


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Łukasz Jagiełło
lukaszjagielloorg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-14 Thread Garg, Pankaj
Something in this section is causing all the 0 IOPS issue. Have not been able 
to nail down it yet. (I did comment out the filestore_max_inline_xattr_size 
entries, and problem still exists).
If I take out the whole [osd] section, I was able to get rid of IOPS staying at 
0 for long periods of time. Performance is still not where I would expect.
[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
#filestore_max_inline_xattr_size = 254
#filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 7:05 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sa

Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-14 Thread Somnath Roy
Try increasing the following to say 10

osd_op_num_shards = 10
filestore_fd_cache_size = 128

Hope, the following you introduced after I told you , so, it shouldn't be the 
cause it seems (?)

filestore_odsync_write = true

Also, comment out the following.

filestore_wbthrottle_enable = false



From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Thursday, July 14, 2016 10:05 AM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Something in this section is causing all the 0 IOPS issue. Have not been able 
to nail down it yet. (I did comment out the filestore_max_inline_xattr_size 
entries, and problem still exists).
If I take out the whole [osd] section, I was able to get rid of IOPS staying at 
0 for long periods of time. Performance is still not where I would expect.
[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
#filestore_max_inline_xattr_size = 254
#filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 7:05 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [ma

Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-14 Thread Garg, Pankaj
Disregard the last msg. Still getting long 0 IOPS periods.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Thursday, July 14, 2016 10:05 AM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Something in this section is causing all the 0 IOPS issue. Have not been able 
to nail down it yet. (I did comment out the filestore_max_inline_xattr_size 
entries, and problem still exists).
If I take out the whole [osd] section, I was able to get rid of IOPS staying at 
0 for long periods of time. Performance is still not where I would expect.
[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
#filestore_max_inline_xattr_size = 254
#filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 7:05 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with

Re: [ceph-users] Ceph RBD object-map and discard in VM

2016-07-14 Thread Vaibhav Bhembre
We have been observing this similar behavior. Usually it is the case where
we create a new rbd image, expose it into the guest and perform any
operation that issues discard to the device.

A typical command that's first run on a given device is mkfs, usually with
discard on.

# time mkfs.xfs -s size=4096 -f /dev/sda
meta-data=/dev/sda   isize=256agcount=4, agsize=6553600 blks
 =   sectsz=4096  attr=2, projid32bit=0
data =   bsize=4096   blocks=26214400, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal log   bsize=4096   blocks=12800, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

real 9m10.882s
user 0m0.000s
sys 0m0.012s

When we issue this same command with object-map feature disabled on the
image it completes much faster.

# time mkfs.xfs -s size=4096 -f /dev/sda
meta-data=/dev/sda   isize=256agcount=4, agsize=6553600 blks
 =   sectsz=4096  attr=2, projid32bit=0
data =   bsize=4096   blocks=26214400, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal log   bsize=4096   blocks=12800, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

real 0m1.780s
user 0m0.000s
sys 0m0.012s

Also from what I am seeing the slowness seems to be proportional to the
size of the image rather than the amount of data written into it. Issuing
mkfs without discard doesn't reproduce this issue. The above values were
for 100G rbd image. The 250G takes slightly more than twice the time taken
for 100G one.

# time mkfs.xfs -s size=4096 -f /dev/sda
meta-data=/dev/sda   isize=256agcount=4, agsize=16384000
blks
 =   sectsz=4096  attr=2, projid32bit=0
data =   bsize=4096   blocks=65536000, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal log   bsize=4096   blocks=32000, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

real 22m58.076s
user 0m0.000s
sys 0m0.024s

Let me know if you need any more information regarding this. We would like
to enable object-map (and fast-diff) on our images once this gets resolved.


On Wed, Jun 22, 2016 at 5:39 PM, Jason Dillaman  wrote:

> I'm not sure why I never received the original list email, so I
> apologize for the delay. Is /dev/sda1, from your example, fresh with
> no data to actually discard or does it actually have lots of data to
> discard?
>
> Thanks,
>
> On Wed, Jun 22, 2016 at 1:56 PM, Brian Andrus  wrote:
> > I've created a downstream bug for this same issue.
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=1349116
> >
> > On Wed, Jun 15, 2016 at 6:23 AM,  wrote:
> >>
> >> Hello guys,
> >>
> >> We are currently testing Ceph Jewel with object-map feature enabled:
> >>
> >> rbd image 'disk-22920':
> >> size 102400 MB in 25600 objects
> >> order 22 (4096 kB objects)
> >> block_name_prefix: rbd_data.7cfa2238e1f29
> >> format: 2
> >> features: layering, exclusive-lock, object-map, fast-diff,
> >> deep-flatten
> >> flags:
> >>
> >> We use this RBD as disk for a kvm virtual machine with virtio-scsi and
> >> discard=unmap. We noticed the following paremeters in /sys/block:
> >>
> >> # cat /sys/block/sda/queue/discard_*
> >> 4096
> >> 1073741824
> >> 0 <- discard_zeroes_data
> >>
> >> While trying to do a mkfs.ext4 on the disk in VM we noticed a low
> >> performance with using discard.
> >>
> >> mkfs.ext4 -E nodiscard /dev/sda1 - tooks 5 seconds to complete
> >> mkfs.ext4 -E discard /dev/sda1 - tooks around 3 monutes
> >>
> >> When disabling the object-map the mkfs with discard tooks just 5
> seconds.
> >>
> >> Do you have any idea what might cause this issue?
> >>
> >> Kernel: 4.2.0-35-generic #40~14.04.1-Ubuntu
> >> Ceph: 10.2.0
> >> Libvirt: 1.3.1
> >> QEMU: 2.5.0
> >>
> >> Thanks!
> >>
> >> Best regards,
> >> Jonas
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > --
> > Brian Andrus
> > Red Hat, Inc.
> > Storage Consultant, Global Storage Practice
> > Mobile +1 (530) 903-8487
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
> 

Re: [ceph-users] Ceph RBD object-map and discard in VM

2016-07-14 Thread Jason Dillaman
I would probably be able to resolve the issue fairly quickly if it
would be possible for you to provide a RBD replay trace from a slow
and fast mkfs.xfs test run and attach it to the tracker ticket I just
opened for this issue [1]. You can follow the instructions here [2]
but would only need to perform steps 1 and 2 (attaching to output from
step 2 to the ticket).

Thanks,

[1] http://tracker.ceph.com/issues/16689
[2] http://docs.ceph.com/docs/master/rbd/rbd-replay/

On Thu, Jul 14, 2016 at 2:55 PM, Vaibhav Bhembre
 wrote:
> We have been observing this similar behavior. Usually it is the case where
> we create a new rbd image, expose it into the guest and perform any
> operation that issues discard to the device.
>
> A typical command that's first run on a given device is mkfs, usually with
> discard on.
>
> # time mkfs.xfs -s size=4096 -f /dev/sda
> meta-data=/dev/sda   isize=256agcount=4, agsize=6553600 blks
>  =   sectsz=4096  attr=2, projid32bit=0
> data =   bsize=4096   blocks=26214400, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0
> log  =internal log   bsize=4096   blocks=12800, version=2
>  =   sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
>
> real 9m10.882s
> user 0m0.000s
> sys 0m0.012s
>
> When we issue this same command with object-map feature disabled on the
> image it completes much faster.
>
> # time mkfs.xfs -s size=4096 -f /dev/sda
> meta-data=/dev/sda   isize=256agcount=4, agsize=6553600 blks
>  =   sectsz=4096  attr=2, projid32bit=0
> data =   bsize=4096   blocks=26214400, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0
> log  =internal log   bsize=4096   blocks=12800, version=2
>  =   sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
>
> real 0m1.780s
> user 0m0.000s
> sys 0m0.012s
>
> Also from what I am seeing the slowness seems to be proportional to the size
> of the image rather than the amount of data written into it. Issuing mkfs
> without discard doesn't reproduce this issue. The above values were for 100G
> rbd image. The 250G takes slightly more than twice the time taken for 100G
> one.
>
> # time mkfs.xfs -s size=4096 -f /dev/sda
> meta-data=/dev/sda   isize=256agcount=4, agsize=16384000
> blks
>  =   sectsz=4096  attr=2, projid32bit=0
> data =   bsize=4096   blocks=65536000, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0
> log  =internal log   bsize=4096   blocks=32000, version=2
>  =   sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
>
> real 22m58.076s
> user 0m0.000s
> sys 0m0.024s
>
> Let me know if you need any more information regarding this. We would like
> to enable object-map (and fast-diff) on our images once this gets resolved.
>
>
> On Wed, Jun 22, 2016 at 5:39 PM, Jason Dillaman  wrote:
>>
>> I'm not sure why I never received the original list email, so I
>> apologize for the delay. Is /dev/sda1, from your example, fresh with
>> no data to actually discard or does it actually have lots of data to
>> discard?
>>
>> Thanks,
>>
>> On Wed, Jun 22, 2016 at 1:56 PM, Brian Andrus  wrote:
>> > I've created a downstream bug for this same issue.
>> >
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1349116
>> >
>> > On Wed, Jun 15, 2016 at 6:23 AM,  wrote:
>> >>
>> >> Hello guys,
>> >>
>> >> We are currently testing Ceph Jewel with object-map feature enabled:
>> >>
>> >> rbd image 'disk-22920':
>> >> size 102400 MB in 25600 objects
>> >> order 22 (4096 kB objects)
>> >> block_name_prefix: rbd_data.7cfa2238e1f29
>> >> format: 2
>> >> features: layering, exclusive-lock, object-map, fast-diff,
>> >> deep-flatten
>> >> flags:
>> >>
>> >> We use this RBD as disk for a kvm virtual machine with virtio-scsi and
>> >> discard=unmap. We noticed the following paremeters in /sys/block:
>> >>
>> >> # cat /sys/block/sda/queue/discard_*
>> >> 4096
>> >> 1073741824
>> >> 0 <- discard_zeroes_data
>> >>
>> >> While trying to do a mkfs.ext4 on the disk in VM we noticed a low
>> >> performance with using discard.
>> >>
>> >> mkfs.ext4 -E nodiscard /dev/sda1 - tooks 5 seconds to complete
>> >> mkfs.ext4 -E discard /dev/sda1 - tooks around 3 monutes
>> >>
>> >> When disabling the object-map the mkfs with discard tooks just 5
>> >> seconds.
>> >>
>> >> Do you have any idea what might cause this issu

Re: [ceph-users] setting crushmap while creating pool fails

2016-07-14 Thread Oliver Dzombic
Hi,

thanks for the suggestion. I tried it out.

No effect.

My ceph.conf looks like:

[osd]
osd_pool_default_crush_replicated_ruleset = 2
osd_pool_default_size = 2
osd_pool_default_min_size = 1

The complete: http://pastebin.com/sG4cPYCY

But the config is completely ignored.

If i run

# ceph osd pool create vmware1 64 64 replicated cold-storage-rule

i will get:

pool 12 'vmware1' replicated size 3 min_size 2 crush_ruleset 1
object_hash rjenkins pg_num 64 pgp_num 64 last_change 2100 flags
hashpspool stripe_width 0

While the intresting part of my crushmap looks like:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

root ssd-cache {
id -5   # do not change unnecessarily
# weight 1.704
alg straw
hash 0  # rjenkins1
item cephosd1-ssd-cache weight 0.852
item cephosd2-ssd-cache weight 0.852
}
root cold-storage {
id -6   # do not change unnecessarily
# weight 51.432
alg straw
hash 0  # rjenkins1
item cephosd1-cold-storage weight 25.716
item cephosd2-cold-storage weight 25.716
}

# rules
rule ssd-cache-rule {
ruleset 1
type replicated
min_size 2
max_size 10
step take ssd-cache
step chooseleaf firstn 0 type host
step emit
}
rule cold-storage-rule {
ruleset 2
type replicated
min_size 2
max_size 10
step take cold-storage
step chooseleaf firstn 0 type host
step emit
}

-

I have no idea whats going wrong here.

I already opend a bug tracker:

http://tracker.ceph.com/issues/16653

But unfortunatelly without too much luck.

I really have no idea what to do now. I cant create pools and assign the
correct rulesets. Basically that means i have to resetup all. But there
is no gurantee that this will not happen again.

So my only option would be to make an additional ceph storage for other
pools, which is not really an option.

I deeply appriciate any kind of idea...

Thank you !


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 13.07.2016 um 08:18 schrieb Wido den Hollander:
> 
>> Op 12 juli 2016 om 22:30 schreef Oliver Dzombic :
>>
>>
>> Hi,
>>
>> i have a crushmap which looks like:
>>
>> http://pastebin.com/YC9FdTUd
>>
>> I issue:
>>
>> # ceph osd pool create vmware1 64 cold-storage-rule
>> pool 'vmware1' created
>>
>> I would expect the pool to have ruleset 2.
>>
>> #ceph osd pool ls detail
>>
>> pool 10 'vmware1' replicated size 3 min_size 2 crush_ruleset 1
>> object_hash rjenkins pg_num 64 pgp_num 64 last_change 483 flags
>> hashpspool stripe_width 0
>>
>> but it has crush_ruleset 1.
>>
>>
>> Why ?
> 
> What happens if you set 'osd_pool_default_crush_replicated_ruleset' to 2 and 
> try again?
> 
> Should be set in the [global] or [mon] section.
> 
> Wido
> 
>>
>> Thank you !
>>
>>
>> -- 
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt )
>> Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd inside LXC

2016-07-14 Thread Guillaume Comte
Thanks for all your answers,

Today people dedicate servers to act as ceph osd nodes which serve data
stored inside to other dedicated servers which run applications or VMs, can
we think about squashing the 2 inside 1?

Le 14 juil. 2016 18:15, "Daniel Gryniewicz"  a écrit :

> This is fairly standard for container deployment: one app per container
> instance.  This is how we're deploying docker in our upstream ceph-docker /
> ceph-ansible as well.
>
> Daniel
>
> On 07/13/2016 08:41 PM, Łukasz Jagiełło wrote:
>
>> Hi,
>>
>> Just wonder why you want each OSD inside separate LXC container? Just to
>> pin them to specific cpus?
>>
>> On Tue, Jul 12, 2016 at 6:33 AM, Guillaume Comte
>> > > wrote:
>>
>> Hi,
>>
>> I am currently defining a storage architecture based on ceph, and i
>> wish to know if i don't misunderstood some stuffs.
>>
>> So, i plan to deploy for each HDD of each servers as much as OSD as
>> free harddrive, each OSD will be inside a LXC container.
>>
>> Then, i wish to turn the server itself as a rbd client for objects
>> created in the pools, i wish also to have a SSD to activate caching
>> (and also store osd logs as well)
>>
>> The idea behind is to create CRUSH rules which will maintain a set
>> of object within a couple of servers connected to the same pair of
>> switch in order to have the best proximity between where i store the
>> object and where i use them (i don't bother having a very high
>> insurance to not loose data if my whole rack powerdown)
>>
>> Am i already on the wrong track ? Is there a way to guaranty
>> proximity of data with ceph whitout making twisted configuration as
>> i am ready to do ?
>>
>> Thks in advance,
>>
>> Regards
>> --
>> *Guillaume Comte*
>> 06 25 85 02 02  | guillaume.co...@blade-group.com
>> 
>> 90 avenue des Ternes, 75 017 Paris
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> --
>> Łukasz Jagiełło
>> lukaszjagielloorg
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow requet on node reboot

2016-07-14 Thread Luis Ramirez

Hi,

I've a cluster with 3 MON nodes and 5 OSD nodes. If i make a reboot 
of 1 of the osd nodes i get slow request waiting for active.


2016-07-14 19:39:07.996942 osd.33 10.255.128.32:6824/7404 888 : cluster 
[WRN] slow request 60.627789 seconds old, received at 2016-07-14 
19:38:07.369009: osd_op(client.593241.0:3283308 3.d8215fdb (undecoded) 
ondisk+write+known_if_redirected e11409) currently waiting for active
2016-07-14 19:39:07.996950 osd.33 10.255.128.32:6824/7404 889 : cluster 
[WRN] slow request 60.623972 seconds old, received at 2016-07-14 
19:38:07.372826: osd_op(client.593241.0:3283309 3.d8215fdb (undecoded) 
ondisk+write+known_if_redirected e11411) currently waiting for active
2016-07-14 19:39:07.996958 osd.33 10.255.128.32:6824/7404 890 : cluster 
[WRN] slow request 240.631544 seconds old, received at 2016-07-14 
19:35:07.365255: osd_op(client.593241.0:3283269 3.d8215fdb (undecoded) 
ondisk+write+known_if_redirected e11384) currently waiting for active
2016-07-14 19:39:07.996965 osd.33 10.255.128.32:6824/7404 891 : cluster 
[WRN] slow request 30.625102 seconds old, received at 2016-07-14 
19:38:37.371697: osd_op(client.593241.0:3283315 3.d8215fdb (undecoded) 
ondisk+write+known_if_redirected e11433) currently waiting for active
2016-07-14 19:39:12.997985 osd.33 10.255.128.32:6824/7404 893 : cluster 
[WRN] 83 slow requests, 4 included below; oldest blocked for > 
395.971587 secs


And the service will not recover until the node restart sucesffully. 
Anyone could provide me any light about what i'm doing wrong?


Regards
Luis

--
-

Luis Ramírez Viejo 
*

*
<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] setting crushmap while creating pool fails

2016-07-14 Thread Oliver Dzombic
Hi,

wow, figured it out.

If you dont have a ruleset 0 id, you are in trouble.

So the solution is, that you >MUST< have a ruleset id 0.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 15.07.2016 um 00:10 schrieb Oliver Dzombic:
> Hi,
> 
> thanks for the suggestion. I tried it out.
> 
> No effect.
> 
> My ceph.conf looks like:
> 
> [osd]
> osd_pool_default_crush_replicated_ruleset = 2
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> 
> The complete: http://pastebin.com/sG4cPYCY
> 
> But the config is completely ignored.
> 
> If i run
> 
> # ceph osd pool create vmware1 64 64 replicated cold-storage-rule
> 
> i will get:
> 
> pool 12 'vmware1' replicated size 3 min_size 2 crush_ruleset 1
> object_hash rjenkins pg_num 64 pgp_num 64 last_change 2100 flags
> hashpspool stripe_width 0
> 
> While the intresting part of my crushmap looks like:
> 
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
> 
> root ssd-cache {
> id -5   # do not change unnecessarily
> # weight 1.704
> alg straw
> hash 0  # rjenkins1
> item cephosd1-ssd-cache weight 0.852
> item cephosd2-ssd-cache weight 0.852
> }
> root cold-storage {
> id -6   # do not change unnecessarily
> # weight 51.432
> alg straw
> hash 0  # rjenkins1
> item cephosd1-cold-storage weight 25.716
> item cephosd2-cold-storage weight 25.716
> }
> 
> # rules
> rule ssd-cache-rule {
> ruleset 1
> type replicated
> min_size 2
> max_size 10
> step take ssd-cache
> step chooseleaf firstn 0 type host
> step emit
> }
> rule cold-storage-rule {
> ruleset 2
> type replicated
> min_size 2
> max_size 10
> step take cold-storage
> step chooseleaf firstn 0 type host
> step emit
> }
> 
> -
> 
> I have no idea whats going wrong here.
> 
> I already opend a bug tracker:
> 
> http://tracker.ceph.com/issues/16653
> 
> But unfortunatelly without too much luck.
> 
> I really have no idea what to do now. I cant create pools and assign the
> correct rulesets. Basically that means i have to resetup all. But there
> is no gurantee that this will not happen again.
> 
> So my only option would be to make an additional ceph storage for other
> pools, which is not really an option.
> 
> I deeply appriciate any kind of idea...
> 
> Thank you !
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Qemu with customized librbd/librados

2016-07-14 Thread ZHOU Yuan
Hi list,

I ran into some issue on customizing the librbd(linking with jemalloc) with
stock qemu in Ubuntu Trusty here.
Stock qemu depends on librbd1 and librados2(0.80.x). These two libraries
will be installed at /usr/lib/x86_64-linux-gnu/lib{rbd,rados}.so. The path
is included in /etc/ld.so.conf.d/x86_64-linux-gnu.conf. This seems to be
for compatibility of multi-arch support on Ubuntu.
I find when I'm building the local ceph, the compiler seems try to link
with the existing /etc/lib/x86_64-linux-gnu/librbd.so, instead of the newly
built local librbd. So I just went to remove those
/usr/lib/x86_64-linux-gnu/lib{rbd,rados}.so and installed my customized lib
into /usr/local/lib/

Is this the right way to building a customized Ceph? or should I make the
--prefix=/usr/lib/x86_64-linux-gnu?

Sincerely, Yuan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-14 Thread Adrian Saul

I would suggest caution with " filestore_odsync_write" - its fine on good SSDs, 
but on poor SSDs or spinning disks it will kill performance.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Friday, 15 July 2016 3:12 AM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Try increasing the following to say 10

osd_op_num_shards = 10
filestore_fd_cache_size = 128

Hope, the following you introduced after I told you , so, it shouldn't be the 
cause it seems (?)

filestore_odsync_write = true

Also, comment out the following.

filestore_wbthrottle_enable = false



From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Thursday, July 14, 2016 10:05 AM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Something in this section is causing all the 0 IOPS issue. Have not been able 
to nail down it yet. (I did comment out the filestore_max_inline_xattr_size 
entries, and problem still exists).
If I take out the whole [osd] section, I was able to get rid of IOPS staying at 
0 for long periods of time. Performance is still not where I would expect.
[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
#filestore_max_inline_xattr_size = 254
#filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 7:05 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considerin

Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-14 Thread 席智勇
good job, thank you for sharing, Wido~
it's very useful~

2016-07-14 14:33 GMT+08:00 Wido den Hollander :

> To add, the RGWs upgraded just fine as well.
>
> No regions in use here (yet!), so that upgraded as it should.
>
> Wido
>
> > Op 13 juli 2016 om 16:56 schreef Wido den Hollander :
> >
> >
> > Hello,
> >
> > The last 3 days I worked at a customer with a 1800 OSD cluster which had
> to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
> >
> > The cluster in this case is 99% RGW, but also some RBD.
> >
> > I wanted to share some of the things we encountered during this upgrade.
> >
> > All 180 nodes are running CentOS 7.1 on a IPv6-only network.
> >
> > ** Hammer Upgrade **
> > At first we upgraded from 0.94.5 to 0.94.7, this went well except for
> the fact that the monitors got spammed with these kind of messages:
> >
> >   "Failed to encode map eXXX with expected crc"
> >
> > Some searching on the list brought me to:
> >
> >   ceph tell osd.* injectargs -- --clog_to_monitors=false
> >
> >  This reduced the load on the 5 monitors and made recovery succeed
> smoothly.
> >
> >  ** Monitors to Jewel **
> >  The next step was to upgrade the monitors from Hammer to Jewel.
> >
> >  Using Salt we upgraded the packages and afterwards it was simple:
> >
> >killall ceph-mon
> >chown -R ceph:ceph /var/lib/ceph
> >chown -R ceph:ceph /var/log/ceph
> >
> > Now, a systemd quirck. 'systemctl start ceph.target' does not work, I
> had to manually enabled the monitor and start it:
> >
> >   systemctl enable ceph-mon@srv-zmb04-05.service
> >   systemctl start ceph-mon@srv-zmb04-05.service
> >
> > Afterwards the monitors were running just fine.
> >
> > ** OSDs to Jewel **
> > To upgrade the OSDs to Jewel we initially used Salt to update the
> packages on all systems to 10.2.2, we then used a Shell script which we ran
> on one node at a time.
> >
> > The failure domain here is 'rack', so we executed this in one rack, then
> the next one, etc, etc.
> >
> > Script can be found on Github:
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> >
> > Be aware that the chown can take a long, long, very long time!
> >
> > We ran into the issue that some OSDs crashed after start. But after
> trying again they would start.
> >
> >   "void FileStore::init_temp_collections()"
> >
> > I reported this in the tracker as I'm not sure what is happening here:
> http://tracker.ceph.com/issues/16672
> >
> > ** New OSDs with Jewel **
> > We also had some new nodes which we wanted to add to the Jewel cluster.
> >
> > Using Salt and ceph-disk we ran into a partprobe issue in combination
> with ceph-disk. There was already a Pull Request for the fix, but that was
> not included in Jewel 10.2.2.
> >
> > We manually applied the PR and it fixed our issues:
> https://github.com/ceph/ceph/pull/9330
> >
> > Hope this helps other people with their upgrades to Jewel!
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-14 Thread Goncalo Borges

Hi All...

I've seen that Zheng, Brad, Pat and Greg already updated or made some 
comments on the bug issue. Zheng also proposes a simple patch. However, 
I do have a bit more information. We do think we have identified the 
source of the problem and that we can correct it. Therefore, I would 
propose that you hold any work on the issue until we test our 
hypothesis. I'll try to summarize it:


1./ After being convinced that the ceph-fuse segfault we saw in specific 
VMs was not memory related, I decided to run the user application in 
multiple zones of the openstack cloud we use. We scale up our resources 
by using a public funded openstack cloud which spawns machines (using 
always the same image) in multiple availability zones. In the majority 
of the cases we limit our VMs to (normally) the same availability zone 
because it seats in the same data center as our infrastructure. This 
experiment showed that ceph-fuse does not segfaults in other 
availability zones with multiple VMS of different sizes and types. So 
the problem was restricted to the availability zone we normally use as 
our default one.


2./ I've them created new VMs of multiple sizes and types  in our 
'default' availability zone and rerun the user application. This new 
experiment, running in newly created VMs, showed ceph-fuse segfaults 
independent of the VM types but not in all VMs. For example, in this new 
test, ceph-fuse was segfaulting in some 4 and 8 core VMs but not in all.


3./ I've then decided to inspect the CPU types, and the breakthrough was 
that I got a 100% correlation of ceph-fuse segfaults with AMD 62xx 
processor VMs. This availability zone has only 2 types of hypervisors: 
an old one with AMD 62xx processors, and a new one with Intel 
processors. If my jobs run in a VM with Intel, everything is ok. If my 
jobs run in AMD 62xx, ceph-fuse segfaults. Actually, the segfault is 
almost immediate in 4 core AMD 62xx VMs but takes much more time in 
8-core AMD62xx VMs.


4./ I've then crosschecked what processors were used in the successful 
jobs executed in the other availability zones: Several types of intel, 
AMD 63xx but not AMD 62xx processors.


5./ Talking with my awesome colleague Sean, he remembered some 
discussions about applications segfaulting in AMD processors when 
compiled in an Intel processor with AVX2 extension. Actually, I compiled 
ceph 10.2.2 in an intel processor with AVX2 but ceph 9.2.0 was compiled 
several months ago on an intel processor without AVX2. The reason for 
the change is simply because we upgraded our infrastructure.


6./ Then, we compared the cpuflags between AMD 63xx and AMD62xx. if you 
look carefully, 63xx has 'fma f16c tbm bmi1' and 62xx has 'svm'. 
According to my colleague, fma and f16c are both AMD extensions which 
make AMD more compatible with the AVX extension by Intel.


   *63xx*
   flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
   pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
   pdpe1gb lm rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3
   fma cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c hypervisor
   lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch
   osvw xop fma4 tbm bmi1

   *62xx*
   flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
   pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
   pdpe1gb lm rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3
   cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx hypervisor lahf_lm
   cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw
   xop fma4


All of the previous arguments may explain why we can use 9.2.0 in AMD 
62xx, and why 10.2.2 works in AMD 63xx but not in AMD 62xx.


So, we are hopping that compiling 10.2.2 in an intel processor without 
the AVX extensions will solve our problem.


Does this make sense?

The compilation takes a while but I will update the issue once I have 
finished this last experiment (in the next few days)


Cheers
Goncalo



On 07/12/2016 09:45 PM, Goncalo Borges wrote:

Hi All...

Thank you for continuing to follow this already very long thread.

Pat and Greg are correct in their assumption regarding the 10gb virtual memory 
footprint I see for ceph-fuse process in our cluster with 12 core (24 because of 
hyperthreading) machines and 96 gb of RAM. The source is glibc > 1.10. I can 
reduce / tune virtual memory threads usage by setting MALLOC_ARENA_MAX = 4 (the 
default is 8 on 64 bits machines) before mounting the filesystem with ceph-fuse. 
So, there is no memory leak on ceph-fuse :-)

The bad news is that, while reading the arena malloc glibc explanation, it 
became obvious that the virtual memory footprint scales with tje numer of 
cores. Therefore the 10gb virtual memory i was seeing in the resources with 12 
cores (24 because of hyperthreading) could not / would not be the same in the 
VMs where I get the segfault since they have only 4 cores.

So, at this point, I kno

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-14 Thread Brad Hubbard
On Fri, Jul 15, 2016 at 11:35:10AM +1000, Goncalo Borges wrote:
> Hi All...
> 
> I've seen that Zheng, Brad, Pat and Greg already updated or made some
> comments on the bug issue. Zheng also proposes a simple patch. However, I do
> have a bit more information. We do think we have identified the source of
> the problem and that we can correct it. Therefore, I would propose that you
> hold any work on the issue until we test our hypothesis. I'll try to
> summarize it:
> 
> 1./ After being convinced that the ceph-fuse segfault we saw in specific VMs
> was not memory related, I decided to run the user application in multiple
> zones of the openstack cloud we use. We scale up our resources by using a
> public funded openstack cloud which spawns machines (using always the same
> image) in multiple availability zones. In the majority of the cases we limit
> our VMs to (normally) the same availability zone because it seats in the
> same data center as our infrastructure. This experiment showed that
> ceph-fuse does not segfaults in other availability zones with multiple VMS
> of different sizes and types. So the problem was restricted to the
> availability zone we normally use as our default one.
> 
> 2./ I've them created new VMs of multiple sizes and types  in our 'default'
> availability zone and rerun the user application. This new experiment,
> running in newly created VMs, showed ceph-fuse segfaults independent of the
> VM types but not in all VMs. For example, in this new test, ceph-fuse was
> segfaulting in some 4 and 8 core VMs but not in all.
> 
> 3./ I've then decided to inspect the CPU types, and the breakthrough was
> that I got a 100% correlation of ceph-fuse segfaults with AMD 62xx processor
> VMs. This availability zone has only 2 types of hypervisors: an old one with
> AMD 62xx processors, and a new one with Intel processors. If my jobs run in
> a VM with Intel, everything is ok. If my jobs run in AMD 62xx, ceph-fuse
> segfaults. Actually, the segfault is almost immediate in 4 core AMD 62xx VMs
> but takes much more time in 8-core AMD62xx VMs.
> 
> 4./ I've then crosschecked what processors were used in the successful jobs
> executed in the other availability zones: Several types of intel, AMD 63xx
> but not AMD 62xx processors.
> 
> 5./ Talking with my awesome colleague Sean, he remembered some discussions
> about applications segfaulting in AMD processors when compiled in an Intel
> processor with AVX2 extension. Actually, I compiled ceph 10.2.2 in an intel
> processor with AVX2 but ceph 9.2.0 was compiled several months ago on an
> intel processor without AVX2. The reason for the change is simply because we
> upgraded our infrastructure.
> 
> 6./ Then, we compared the cpuflags between AMD 63xx and AMD62xx. if you look
> carefully, 63xx has 'fma f16c tbm bmi1' and 62xx has 'svm'. According to my
> colleague, fma and f16c are both AMD extensions which make AMD more
> compatible with the AVX extension by Intel.
> 
>*63xx*
>flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
>pdpe1gb lm rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3
>fma cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c hypervisor
>lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch
>osvw xop fma4 tbm bmi1
> 
>*62xx*
>flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
>pdpe1gb lm rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3
>cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx hypervisor lahf_lm
>cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw
>xop fma4
> 
> 
> All of the previous arguments may explain why we can use 9.2.0 in AMD 62xx,
> and why 10.2.2 works in AMD 63xx but not in AMD 62xx.
> 
> So, we are hopping that compiling 10.2.2 in an intel processor without the
> AVX extensions will solve our problem.
> 
> Does this make sense?
> 
> The compilation takes a while but I will update the issue once I have
> finished this last experiment (in the next few days)

Wow, great analysis, well done.

So is this a CPU bug then? Is it documented anywhere you know of?

I can't see where we use the AVX2 extensions directly so I assume this has to
be at a lower level than the ceph code?

-- 
Cheers,
Brad

> 
> Cheers
> Goncalo
> 
> 
> 
> On 07/12/2016 09:45 PM, Goncalo Borges wrote:
> > Hi All...
> > 
> > Thank you for continuing to follow this already very long thread.
> > 
> > Pat and Greg are correct in their assumption regarding the 10gb virtual 
> > memory footprint I see for ceph-fuse process in our cluster with 12 core 
> > (24 because of hyperthreading) machines and 96 gb of RAM. The source is 
> > glibc > 1.10. I can reduce / tune virtual memory threads usage by setting 
> > MALLOC_ARENA_MAX = 4 (the default is 8 on 64 bits machines) before mounting 
> > the filesyst

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-14 Thread Yan, Zheng
On Fri, Jul 15, 2016 at 9:35 AM, Goncalo Borges
 wrote:
> Hi All...
>
> I've seen that Zheng, Brad, Pat and Greg already updated or made some
> comments on the bug issue. Zheng also proposes a simple patch. However, I do
> have a bit more information. We do think we have identified the source of
> the problem and that we can correct it. Therefore, I would propose that you
> hold any work on the issue until we test our hypothesis. I'll try to
> summarize it:
>
> 1./ After being convinced that the ceph-fuse segfault we saw in specific VMs
> was not memory related, I decided to run the user application in multiple
> zones of the openstack cloud we use. We scale up our resources by using a
> public funded openstack cloud which spawns machines (using always the same
> image) in multiple availability zones. In the majority of the cases we limit
> our VMs to (normally) the same availability zone because it seats in the
> same data center as our infrastructure. This experiment showed that
> ceph-fuse does not segfaults in other availability zones with multiple VMS
> of different sizes and types. So the problem was restricted to the
> availability zone we normally use as our default one.
>
> 2./ I've them created new VMs of multiple sizes and types  in our 'default'
> availability zone and rerun the user application. This new experiment,
> running in newly created VMs, showed ceph-fuse segfaults independent of the
> VM types but not in all VMs. For example, in this new test, ceph-fuse was
> segfaulting in some 4 and 8 core VMs but not in all.
>
> 3./ I've then decided to inspect the CPU types, and the breakthrough was
> that I got a 100% correlation of ceph-fuse segfaults with AMD 62xx processor
> VMs. This availability zone has only 2 types of hypervisors: an old one with
> AMD 62xx processors, and a new one with Intel processors. If my jobs run in
> a VM with Intel, everything is ok. If my jobs run in AMD 62xx, ceph-fuse
> segfaults. Actually, the segfault is almost immediate in 4 core AMD 62xx VMs
> but takes much more time in 8-core AMD62xx VMs.
>
> 4./ I've then crosschecked what processors were used in the successful jobs
> executed in the other availability zones: Several types of intel, AMD 63xx
> but not AMD 62xx processors.
>
> 5./ Talking with my awesome colleague Sean, he remembered some discussions
> about applications segfaulting in AMD processors when compiled in an Intel
> processor with AVX2 extension. Actually, I compiled ceph 10.2.2 in an intel
> processor with AVX2 but ceph 9.2.0 was compiled several months ago on an
> intel processor without AVX2. The reason for the change is simply because we
> upgraded our infrastructure.
>
> 6./ Then, we compared the cpuflags between AMD 63xx and AMD62xx. if you look
> carefully, 63xx has 'fma f16c tbm bmi1' and 62xx has 'svm'. According to my
> colleague, fma and f16c are both AMD extensions which make AMD more
> compatible with the AVX extension by Intel.
>
> 63xx
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm
> rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 fma cx16 sse4_1
> sse4_2 x2apic popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy
> cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 tbm bmi1
>
> 62xx
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm
> rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2
> x2apic popcnt aes xsave avx hypervisor lahf_lm cmp_legacy svm cr8_legacy abm
> sse4a misalignsse 3dnowprefetch osvw xop fma4
>
>
> All of the previous arguments may explain why we can use 9.2.0 in AMD 62xx,
> and why 10.2.2 works in AMD 63xx but not in AMD 62xx.
>
> So, we are hopping that compiling 10.2.2 in an intel processor without the
> AVX extensions will solve our problem.
>
> Does this make sense?

I have a different theory. ObjectCacher::flush() checks
"bh->last_write <= cutoff" to decide if it should write buffer head.
But ObjectCacher::bh_write_adjacencies() checks "bh->last_write <
cutoff". (cutoff is the time clock when ObjectCacher::flush() starts
executing). If there is only one dirty buffer head and its last_write
is equal to cutoff, the segfault happens. For some hardware
limitations, AMD 62xx CPU may unable to provide high precision time
clock. This explains the segfault only happens in AMD 62xx. The code
that causes the segfault was introduced in jewel release. So ceph-fuse
9.2.0 does not have this problem.


Regards
Yan, Zheng




>
> The compilation takes a while but I will update the issue once I have
> finished this last experiment (in the next few days)
>
> Cheers
> Goncalo
>
>
>
> On 07/12/2016 09:45 PM, Goncalo Borges wrote:
>
> Hi All...
>
> Thank you for continuing to follow this already very long thread.
>
> Pat and Greg are correct in their assumption regarding the 10gb virt

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-14 Thread Brad Hubbard
On Fri, Jul 15, 2016 at 11:19:12AM +0800, Yan, Zheng wrote:
> On Fri, Jul 15, 2016 at 9:35 AM, Goncalo Borges
>  wrote:
> > So, we are hopping that compiling 10.2.2 in an intel processor without the
> > AVX extensions will solve our problem.
> >
> > Does this make sense?
> 
> I have a different theory. ObjectCacher::flush() checks
> "bh->last_write <= cutoff" to decide if it should write buffer head.
> But ObjectCacher::bh_write_adjacencies() checks "bh->last_write <
> cutoff". (cutoff is the time clock when ObjectCacher::flush() starts
> executing). If there is only one dirty buffer head and its last_write
> is equal to cutoff, the segfault happens. For some hardware
> limitations, AMD 62xx CPU may unable to provide high precision time
> clock. This explains the segfault only happens in AMD 62xx. The code
> that causes the segfault was introduced in jewel release. So ceph-fuse
> 9.2.0 does not have this problem.

Hmmm... this also make a lot of sense.

I guess trying with your patch on all the CPUs mentioned should prove it one
way or the other.

-- 
Cheers,
Brad

> 
> 
> Regards
> Yan, Zheng
> 
> 
> 
> 
> >
> > The compilation takes a while but I will update the issue once I have
> > finished this last experiment (in the next few days)
> >
> > Cheers
> > Goncalo
> >
> >
> >
> > On 07/12/2016 09:45 PM, Goncalo Borges wrote:
> >
> > Hi All...
> >
> > Thank you for continuing to follow this already very long thread.
> >
> > Pat and Greg are correct in their assumption regarding the 10gb virtual
> > memory footprint I see for ceph-fuse process in our cluster with 12 core (24
> > because of hyperthreading) machines and 96 gb of RAM. The source is glibc >
> > 1.10. I can reduce / tune virtual memory threads usage by setting
> > MALLOC_ARENA_MAX = 4 (the default is 8 on 64 bits machines) before mounting
> > the filesystem with ceph-fuse. So, there is no memory leak on ceph-fuse :-)
> >
> > The bad news is that, while reading the arena malloc glibc explanation, it
> > became obvious that the virtual memory footprint scales with tje numer of
> > cores. Therefore the 10gb virtual memory i was seeing in the resources with
> > 12 cores (24 because of hyperthreading) could not / would not be the same in
> > the VMs where I get the segfault since they have only 4 cores.
> >
> > So, at this point, I know that:
> > a./ The segfault is always appearing in a set of VMs with 16 GB of RAM and 4
> > cores.
> > b./ The segfault is not appearing in a set of VMs (in principle identical to
> > the 16 GB ones) but with 16 cores and 64 GB of RAM.
> > c./ the segfault is not appearing in a physicall cluster with machines with
> > 96 GB of RAM and 12 cores (24 because of hyperthreading)
> > and I am not so sure anymore that this is memory related.
> >
> > For further debugging, I've updated
> >http://tracker.ceph.com/issues/16610
> > with a summary of my finding plus some log files:
> >   - The gdb.txt I get after running
> >   $ gdb /path/to/ceph-fuse core.
> >   (gdb) set pag off
> >   (gdb) set log on
> >   (gdb) thread apply all bt
> >   (gdb) thread apply all bt full
> >   as advised by Brad
> > - The debug.out (gzipped) I get after running ceph-fuse in debug mode with
> > 'debug client 20' and 'debug objectcacher = 20'
> >
> > Cheers
> > Goncalo
> > 
> > From: Gregory Farnum [gfar...@redhat.com]
> > Sent: 12 July 2016 03:07
> > To: Goncalo Borges
> > Cc: John Spray; ceph-users
> > Subject: Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)
> >
> > Oh, is this one of your custom-built packages? Are they using
> > tcmalloc? That difference between VSZ and RSS looks like a glibc
> > malloc problem.
> > -Greg
> >
> > On Mon, Jul 11, 2016 at 12:04 AM, Goncalo Borges
> >  wrote:
> >
> > Hi John...
> >
> > Thank you for replying.
> >
> > Here is the result of the tests you asked but I do not see nothing abnormal.
> > Actually, your suggestions made me see that:
> >
> > 1) ceph-fuse 9.2.0 is presenting the same behaviour but with less memory
> > consumption, probably, less enought so that it doesn't brake ceph-fuse in
> > our machines with less memory.
> >
> > 2) I see a tremendous number of  ceph-fuse threads launched (around 160).
> >
> > # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | wc -l
> > 157
> >
> > # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | head -n 10
> > COMMAND  PPID   PID  SPIDVSZ   RSS %MEM %CPU
> > ceph-fuse --id mount_user - 1  3230  3230 9935240 339780  0.6 0.0
> > ceph-fuse --id mount_user - 1  3230  3231 9935240 339780  0.6 0.1
> > ceph-fuse --id mount_user - 1  3230  3232 9935240 339780  0.6 0.0
> > ceph-fuse --id mount_user - 1  3230  3233 9935240 339780  0.6 0.0
> > ceph-fuse --id mount_user - 1  3230  3234 9935240 339780  0.6 0.0
> > ceph-fuse --id mount_user - 1  3230  3235 9935240 339780  0.6 0.0
> > ceph-fuse --id mount_user - 1  3230  3236 9935240 339780  0.6 0.0
> > ceph-

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-14 Thread Goncalo Borges

Thanks Zheng...

Now that we have identified the exact context when the segfault appears 
(only in AMD 62XX) I think it should be safe to understand in each 
situation does the crash appears.


My current compilation is ongoing and I will then test it.

If it fails, I will recompile including your patch.

Will report here afterwards.

Thanks for the feedback.

Cheers

Goncalo


On 07/15/2016 01:19 PM, Yan, Zheng wrote:

On Fri, Jul 15, 2016 at 9:35 AM, Goncalo Borges
 wrote:

Hi All...

I've seen that Zheng, Brad, Pat and Greg already updated or made some
comments on the bug issue. Zheng also proposes a simple patch. However, I do
have a bit more information. We do think we have identified the source of
the problem and that we can correct it. Therefore, I would propose that you
hold any work on the issue until we test our hypothesis. I'll try to
summarize it:

1./ After being convinced that the ceph-fuse segfault we saw in specific VMs
was not memory related, I decided to run the user application in multiple
zones of the openstack cloud we use. We scale up our resources by using a
public funded openstack cloud which spawns machines (using always the same
image) in multiple availability zones. In the majority of the cases we limit
our VMs to (normally) the same availability zone because it seats in the
same data center as our infrastructure. This experiment showed that
ceph-fuse does not segfaults in other availability zones with multiple VMS
of different sizes and types. So the problem was restricted to the
availability zone we normally use as our default one.

2./ I've them created new VMs of multiple sizes and types  in our 'default'
availability zone and rerun the user application. This new experiment,
running in newly created VMs, showed ceph-fuse segfaults independent of the
VM types but not in all VMs. For example, in this new test, ceph-fuse was
segfaulting in some 4 and 8 core VMs but not in all.

3./ I've then decided to inspect the CPU types, and the breakthrough was
that I got a 100% correlation of ceph-fuse segfaults with AMD 62xx processor
VMs. This availability zone has only 2 types of hypervisors: an old one with
AMD 62xx processors, and a new one with Intel processors. If my jobs run in
a VM with Intel, everything is ok. If my jobs run in AMD 62xx, ceph-fuse
segfaults. Actually, the segfault is almost immediate in 4 core AMD 62xx VMs
but takes much more time in 8-core AMD62xx VMs.

4./ I've then crosschecked what processors were used in the successful jobs
executed in the other availability zones: Several types of intel, AMD 63xx
but not AMD 62xx processors.

5./ Talking with my awesome colleague Sean, he remembered some discussions
about applications segfaulting in AMD processors when compiled in an Intel
processor with AVX2 extension. Actually, I compiled ceph 10.2.2 in an intel
processor with AVX2 but ceph 9.2.0 was compiled several months ago on an
intel processor without AVX2. The reason for the change is simply because we
upgraded our infrastructure.

6./ Then, we compared the cpuflags between AMD 63xx and AMD62xx. if you look
carefully, 63xx has 'fma f16c tbm bmi1' and 62xx has 'svm'. According to my
colleague, fma and f16c are both AMD extensions which make AMD more
compatible with the AVX extension by Intel.

63xx
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm
rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 fma cx16 sse4_1
sse4_2 x2apic popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 tbm bmi1

62xx
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm
rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2
x2apic popcnt aes xsave avx hypervisor lahf_lm cmp_legacy svm cr8_legacy abm
sse4a misalignsse 3dnowprefetch osvw xop fma4


All of the previous arguments may explain why we can use 9.2.0 in AMD 62xx,
and why 10.2.2 works in AMD 63xx but not in AMD 62xx.

So, we are hopping that compiling 10.2.2 in an intel processor without the
AVX extensions will solve our problem.

Does this make sense?

I have a different theory. ObjectCacher::flush() checks
"bh->last_write <= cutoff" to decide if it should write buffer head.
But ObjectCacher::bh_write_adjacencies() checks "bh->last_write <
cutoff". (cutoff is the time clock when ObjectCacher::flush() starts
executing). If there is only one dirty buffer head and its last_write
is equal to cutoff, the segfault happens. For some hardware
limitations, AMD 62xx CPU may unable to provide high precision time
clock. This explains the segfault only happens in AMD 62xx. The code
that causes the segfault was introduced in jewel release. So ceph-fuse
9.2.0 does not have this problem.


Regards
Yan, Zheng





The compilation takes a while but I will update the issue once I

Re: [ceph-users] setting crushmap while creating pool fails

2016-07-14 Thread Shinobu Kinjo
You may want to change value of "osd_pool_default_crush_replicated_ruleset".

 shinobu

On Fri, Jul 15, 2016 at 7:38 AM, Oliver Dzombic 
wrote:

> Hi,
>
> wow, figured it out.
>
> If you dont have a ruleset 0 id, you are in trouble.
>
> So the solution is, that you >MUST< have a ruleset id 0.
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 15.07.2016 um 00:10 schrieb Oliver Dzombic:
> > Hi,
> >
> > thanks for the suggestion. I tried it out.
> >
> > No effect.
> >
> > My ceph.conf looks like:
> >
> > [osd]
> > osd_pool_default_crush_replicated_ruleset = 2
> > osd_pool_default_size = 2
> > osd_pool_default_min_size = 1
> >
> > The complete: http://pastebin.com/sG4cPYCY
> >
> > But the config is completely ignored.
> >
> > If i run
> >
> > # ceph osd pool create vmware1 64 64 replicated cold-storage-rule
> >
> > i will get:
> >
> > pool 12 'vmware1' replicated size 3 min_size 2 crush_ruleset 1
> > object_hash rjenkins pg_num 64 pgp_num 64 last_change 2100 flags
> > hashpspool stripe_width 0
> >
> > While the intresting part of my crushmap looks like:
> >
> > # begin crush map
> > tunable choose_local_tries 0
> > tunable choose_local_fallback_tries 0
> > tunable choose_total_tries 50
> > tunable chooseleaf_descend_once 1
> > tunable chooseleaf_vary_r 1
> > tunable straw_calc_version 1
> >
> > root ssd-cache {
> > id -5   # do not change unnecessarily
> > # weight 1.704
> > alg straw
> > hash 0  # rjenkins1
> > item cephosd1-ssd-cache weight 0.852
> > item cephosd2-ssd-cache weight 0.852
> > }
> > root cold-storage {
> > id -6   # do not change unnecessarily
> > # weight 51.432
> > alg straw
> > hash 0  # rjenkins1
> > item cephosd1-cold-storage weight 25.716
> > item cephosd2-cold-storage weight 25.716
> > }
> >
> > # rules
> > rule ssd-cache-rule {
> > ruleset 1
> > type replicated
> > min_size 2
> > max_size 10
> > step take ssd-cache
> > step chooseleaf firstn 0 type host
> > step emit
> > }
> > rule cold-storage-rule {
> > ruleset 2
> > type replicated
> > min_size 2
> > max_size 10
> > step take cold-storage
> > step chooseleaf firstn 0 type host
> > step emit
> > }
> >
> > -
> >
> > I have no idea whats going wrong here.
> >
> > I already opend a bug tracker:
> >
> > http://tracker.ceph.com/issues/16653
> >
> > But unfortunatelly without too much luck.
> >
> > I really have no idea what to do now. I cant create pools and assign the
> > correct rulesets. Basically that means i have to resetup all. But there
> > is no gurantee that this will not happen again.
> >
> > So my only option would be to make an additional ceph storage for other
> > pools, which is not really an option.
> >
> > I deeply appriciate any kind of idea...
> >
> > Thank you !
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Email:
shin...@linux.com
shin...@redhat.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-14 Thread Dirk Laurenz

Hello George,


i did what you suggested, but it didn't help...no autostart - i have to 
start them manually



root@cephosd01:~#  sgdisk -i 1 /dev/sdb
Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown)
Partition unique GUID: 48B7EC4E-A582-4B84-B823-8C3A36D9BB0A
First sector: 10487808 (at 5.0 GiB)
Last sector: 104857566 (at 50.0 GiB)
Partition size: 94369759 sectors (45.0 GiB)
Attribute flags: 
Partition name: 'ceph data'
root@cephosd01:~#  sgdisk -i 2 /dev/sdb
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 2B7CC697-EFA9-4041-A62C-A044DB2BB03B
First sector: 2048 (at 1024.0 KiB)
Last sector: 10487807 (at 5.0 GiB)
Partition size: 10485760 sectors (5.0 GiB)
Attribute flags: 
Partition name: 'ceph journal'


What makes me wonder - is that partition type is unknown.


Am 13.07.2016 um 17:16 schrieb George Shuklin:
As you can see you have 'unknown' partition type. It should be 'ceph 
journal' and 'ceph data'.


Stop ceph-osd, unmount partitions and change typecodes for partition 
properly:
/sbin/sgdisk --typecode=PART:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- 
/dev/DISK


PART - number of partition with data (1 in your case), so:

/sbin/sgdisk --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- 
/dev/sdb (sdc, etc).


You can change typecode for journal partition too:

/sbin/sgdisk --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 -- /dev/sdb


On 07/12/2016 01:05 AM, Dirk Laurenz wrote:


root@cephosd01:~# fdisk -l /dev/sdb

Disk /dev/sdb: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 87B152E0-EB5D-4EB0-8FFB-C27096CBB1ED

DeviceStart   End  Sectors Size Type
/dev/sdb1  10487808 104857566 94369759  45G unknown
/dev/sdb2  2048  10487807 10485760   5G unknown

Partition table entries are not in disk order.
root@cephosd01:~# fdisk -l /dev/sdc

Disk /dev/sdc: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 31B81FCA-9163-4723-B195-97AEC9568AD0

DeviceStart   End  Sectors Size Type
/dev/sdc1  10487808 104857566 94369759  45G unknown
/dev/sdc2  2048  10487807 10485760   5G unknown

Partition table entries are not in disk order.


Am 11.07.2016 um 18:01 schrieb George Shuklin:

Check out partition type for data partition for ceph.

fdisk -l /dev/sdc

On 07/11/2016 04:03 PM, Dirk Laurenz wrote:


hmm, helps partially ... running


/usr/sbin/ceph-disk trigger /dev/sdc1 or sdb1 works and brings osd up..


systemctl enable does not help


Am 11.07.2016 um 14:49 schrieb George Shuklin:

Short story how OSDs are started in systemd environments:

Ceph OSD parittions has specific typecode (partition type 
4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D). It handled by udev rules 
shipped by ceph package:

/lib/udev/rules.d/95-ceph-osd.rules

It set up proper owner/group for this disk ('ceph' instead 'root') 
and calls /usr/sbin/ceph-disk trigger.


ceph-disk triggers creation of instance of ceph-disk@ systemd unit 
(to mount disk to /var/lib/ceph/osd/...), and ceph-osd@ (i'm not 
sure about all sequence of events).


Basically, to make OSD autostart they NEED to have proper typecode 
in their partition. If you using something different (like 
'directory based OSD') you should enable OSD autostart:


systemctl enable ceph-osd@42


On 07/11/2016 03:32 PM, Dirk Laurenz wrote:

Hello,


i'm new to ceph an try to do some first steps with ceph to 
understand concepts.


my setup is at first completly in vm


i deployed (with ceph-deploy) three monitors and three osd hosts. 
(3+3 vms)


my frist test was to find out, if everything comes back online 
after a system restart. this works fine for the monitors, but 
fails for the osds. i have to start them manually.



OS is debian jessie, ceph is the current release


Where can find out, what's going wrong





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com