Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-05-23 Thread Ingo Reimann
Hi Ben!



Thanks for your advice. I included the names of our gateways, but did omit 
the external name of the service itself. Now, everything is working again.



And yes, this change is worth a note J



Best regards,



Ingo



Von: Ben Hines [mailto:bhi...@gmail.com]
Gesendet: Dienstag, 23. Mai 2017 02:16
An: Ingo Reimann
Cc: Radoslaw Zarzynski; ceph-users
Betreff: Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?



We used this workaround when upgrading to Kraken (which had a similar issue)



>modify the zonegroup and populate the 'hostnames' array with all backend 
>server hostnames as well as the hostname terminated by haproxy



Which i'm fine with. It's definitely a change that should be noted in a more 
prominent release note. Without the hostname in there, ceph interpreted the 
hostname as a bucket name if the hostname rgw was being hit with differed 
from the hostname of the actual server. Pre Kraken, i didn't need that 
setting at all and it just worked.



-Ben



On Mon, May 22, 2017 at 1:11 AM, Ingo Reimann  wrote:

Hi Radek,

are there any news about this issue? We are also stuck with 10.2.5 and can`t
update to 10.2.7.
We use a couple of radosgws that are loadbalanced behind a Keepalived/LVS.
Removal of rgw_dns_name does only help, if I address the gateway directly,
but not in general.

Best regards,

Ingo

-Ursprüngliche Nachricht-
Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von
Radoslaw Zarzynski
Gesendet: Mittwoch, 3. Mai 2017 11:59
An: Łukasz Jagiełło
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?


Hello Łukasz,

Thanks for your testing and sorry for my mistake. It looks that two commits
need to be reverted to get the previous behaviour:

The already mentioned one:
  https://github.com/ceph/ceph/commit/c9445faf7fac2ccb8a05b53152c0ca16d7f4c6d0
Its dependency:
  https://github.com/ceph/ceph/commit/b72fc1b820ede3cd186d887d9d30f7f91fe3764b

They have been merged in the same pull request:
  https://github.com/ceph/ceph/pull/11760
and form the difference visible between v10.2.5 and v10.2.6 in the matter of
"in_hosted_domain" handling:
  https://github.com/ceph/ceph/blame/v10.2.5/src/rgw/rgw_rest.cc#L1773
  https://github.com/ceph/ceph/blame/v10.2.6/src/rgw/rgw_rest.cc#L1781-L1782

I'm really not sure we want to revert them. Still, it can be that they just
unhide a misconfiguration issue while fixing the problems we had with
handling of virtual hosted buckets.

Regards,
Radek

On Wed, May 3, 2017 at 3:12 AM, Łukasz Jagiełło 
wrote:
> Hi,
>
> I tried today revert [1] from 10.2.7 but the problem is still there
> even without the change. Revert to 10.2.5 fix the issue instantly.
>
> https://github.com/ceph/ceph/commit/c9445faf7fac2ccb8a05b53152c0ca16d7
> f4c6d0
>
> On Thu, Apr 27, 2017 at 4:53 AM, Radoslaw Zarzynski
>  wrote:
>>




Ingo Reimann

Teamleiter Technik
Dunkel GmbH 
Dunkel GmbH
Philipp-Reis-Straße 2
65795 Hattersheim
Fon: +49 6190 889-100 
Fax: +49 6190 889-399 
eMail: supp...@dunkel.de
http://www.Dunkel.de/   Amtsgericht Frankfurt/Main
HRB: 37971
Geschäftsführer: Axel Dunkel
Ust-ID: DE 811622001

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some monitors have still not reached quorum

2017-05-23 Thread Shambhu Rajak
Hi Alfredo,
This is solved, all the listening ports were disabled in my setup, after 
allowing the monitor/osd ports.
Thanks,
Shambhu

-Original Message-
From: Shambhu Rajak 
Sent: Tuesday, May 23, 2017 10:33 AM
To: 'Alfredo Deza'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Some monitors have still not reached quorum

Hi Alfredo,
Here is the full log:

sandvine@shambhucephnode:~/my-cluster$ ceph-deploy mon create-initial 
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/home/sandvine/.cephdeploy.conf [ceph_deploy.cli][INFO  ] Invoked (1.5.37): 
/usr/bin/ceph-deploy mon create-initial [ceph_deploy.cli][INFO  ] ceph-deploy 
options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: create-initial
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   : 

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  keyrings  : None
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts shambhucephnode0 
shambhucephnode1 shambhucephnode2 [ceph_deploy.mon][DEBUG ] detecting platform 
for host shambhucephnode0 ...
[shambhucephnode0][DEBUG ] connection detected need for sudo 
[shambhucephnode0][DEBUG ] connected to host: shambhucephnode0 
[shambhucephnode0][DEBUG ] detect platform information from remote host 
[shambhucephnode0][DEBUG ] detect machine type [shambhucephnode0][DEBUG ] find 
the location of an executable [shambhucephnode0][INFO  ] Running command: sudo 
/sbin/initctl version [shambhucephnode0][DEBUG ] find the location of an 
executable [ceph_deploy.mon][INFO  ] distro info: Ubuntu 14.04 trusty 
[shambhucephnode0][DEBUG ] determining if provided host has same hostname in 
remote [shambhucephnode0][DEBUG ] get remote short hostname 
[shambhucephnode0][DEBUG ] deploying mon to shambhucephnode0 
[shambhucephnode0][DEBUG ] get remote short hostname [shambhucephnode0][DEBUG ] 
remote hostname: shambhucephnode0 [shambhucephnode0][DEBUG ] write cluster 
configuration to /etc/ceph/{cluster}.conf [shambhucephnode0][DEBUG ] create the 
mon path if it does not exist [shambhucephnode0][DEBUG ] checking for done 
path: /var/lib/ceph/mon/ceph-shambhucephnode0/done
[shambhucephnode0][DEBUG ] create a done file to avoid re-doing the mon 
deployment [shambhucephnode0][DEBUG ] create the init path if it does not exist 
[shambhucephnode0][INFO  ] Running command: sudo initctl emit ceph-mon 
cluster=ceph id=shambhucephnode0 [shambhucephnode0][INFO  ] Running command: 
sudo ceph --cluster=ceph --admin-daemon 
/var/run/ceph/ceph-mon.shambhucephnode0.asok mon_status 
[shambhucephnode0][DEBUG ] 

[shambhucephnode0][DEBUG ] status for monitor: mon.shambhucephnode0 
[shambhucephnode0][DEBUG ] {
[shambhucephnode0][DEBUG ]   "election_epoch": 0,
[shambhucephnode0][DEBUG ]   "extra_probe_peers": [
[shambhucephnode0][DEBUG ] "10.187.52.93:6789/0",
[shambhucephnode0][DEBUG ] "10.187.52.95:6789/0",
[shambhucephnode0][DEBUG ] "10.187.52.96:6789/0"
[shambhucephnode0][DEBUG ]   ],
[shambhucephnode0][DEBUG ]   "monmap": {
[shambhucephnode0][DEBUG ] "created": "2017-05-22 10:29:33.419165",
[shambhucephnode0][DEBUG ] "epoch": 0,
[shambhucephnode0][DEBUG ] "fsid": "95858ab8-73d6-42e7-9393-c67de2e12840",
[shambhucephnode0][DEBUG ] "modified": "2017-05-22 10:29:33.419165",
[shambhucephnode0][DEBUG ] "mons": [
[shambhucephnode0][DEBUG ]   {
[shambhucephnode0][DEBUG ] "addr": "172.16.1.13:6789/0",
[shambhucephnode0][DEBUG ] "name": "shambhucephnode0",
[shambhucephnode0][DEBUG ] "rank": 0
[shambhucephnode0][DEBUG ]   },
[shambhucephnode0][DEBUG ]   {
[shambhucephnode0][DEBUG ] "addr": "0.0.0.0:0/1",
[shambhucephnode0][DEBUG ] "name": "shambhucephnode1",
[shambhucephnode0][DEBUG ] "rank": 1
[shambhucephnode0][DEBUG ]   },
[shambhucephnode0][DEBUG ]   {
[shambhucephnode0][DEBUG ] "addr": "0.0.0.0:0/2",
[shambhucephnode0][DEBUG ] "name": "shambhucephnode2",
[shambhucephnode0][DEBUG ] "rank": 2
[shambhucephnode0][DEBUG ]   }
[shambhucephnode0][DEBUG ] ]
[shambhucephnode0][DEBUG ]   },
[shambhucephnode0][DEBUG ]   "name": "shambhucephnode0",
[shambhucephnode0][DEBUG ]   "outside_quorum": [
[shambhucephnode0][DEBUG ] "shambhucephnode0"
[shambhucephnode0][DEBUG ]   ],
[shambhucephnode0][DEBUG ]   "quorum": [],
[shambhucephnode0][DEBUG ]   "rank": 0,
[shambhucephnode0][DEBUG ]   "state": "probing",
[shambh

Re: [ceph-users] Large OSD omap directories (LevelDBs)

2017-05-23 Thread george.vasilakakos
> Your RGW buckets, how many objects in them, and do they have the index
> sharded?

> I know we have some very large & old buckets (10M+ RGW objects in a
> single bucket), with correspondingly large OMAPs wherever that bucket
> index is living (sufficently large that trying to list the entire thing
> online is fruitless). ceph's pgmap status says we have 2G RADOS objects
> however, and you're only at 61M RADOS objects.


According to radosgw-admin bucket stats the most populous bucket contains 
568101 objects. There is no index sharding. The default.rgw.buckets.data pool 
contains 4162566 objects, I think striping is done by default for 4MB sizes 
stripes.

Bear in mind RGW is a small use case for us currently.
Most of the data lives in a pool that is accessed by specialized servers that 
have plugins based on libradosstriper. That pool stores around 1.8 PB in 
32920055 objects.

One thing of note is that we have this:
filestore_xattr_use_omap=1
in our ceph.conf and libradosstriper makes use of xattrs for striping metadata 
and locking mechanisms.

This seems to have been removed some time ago but the question is could have 
any effect? This cluster was built in January and ran Jewel initially.

I do see the xattrs in XFS but a sampling of an omap dir from an OSD showed 
like there might be some xattrs in there too.

I'm going to try restarting an OSD with a big omap and also extracting a copy 
of one for further inspection.
It seems to me like they might not be cleaning up old data. I'm fairly certain 
an active cluster would've compacted enough for 3 month old SSTs to go away.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS Question

2017-05-23 Thread James Wilkins
Quick question on CephFS/MDS but I can't find this documented (apologies if it 
is)

What does the q: represent in a ceph daemon  perf dump mds represent?

[root@hp3-ceph-mds2 ~]# ceph daemon 
/var/run/ceph/ceph-mds.hp3-ceph-mds2.ceph.hostingp3.local.asok  perf dump mds
{
"mds": {
"request": 10843133,
"reply": 10842472,
"reply_latency": {
"avgcount": 10842472,
"sum": 2678925.337447889
},
"forward": 0,
"dir_fetch": 412972,
"dir_commit": 683903,
"dir_split": 0,
"dir_merge": 0,
"inode_max": 700,
"inodes": 7000209,
"inodes_top": 808282,
"inodes_bottom": 6191218,
"inodes_pin_tail": 709,
"inodes_pinned": 2055258,
"inodes_expired": 2276343,
"inodes_with_caps": 1905570,
"caps": 2392113,
"subtrees": 2,
"traverse": 12551065,
"traverse_hit": 10346763,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 312666,
"traverse_remote_ino": 0,
"traverse_lock": 41125,
"load_cent": 1090788840,
"q": 4371,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
}
}


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS Question

2017-05-23 Thread John Spray
On Tue, May 23, 2017 at 1:42 PM, James Wilkins
 wrote:
> Quick question on CephFS/MDS but I can’t find this documented (apologies if
> it is)
>
>
>
> What does the q: represent in a ceph daemon  perf dump mds
> represent?

mds]$ git grep "\"q\""
MDSRank.cc:mds_plb.add_u64(l_mds_dispatch_queue_len, "q",
"Dispatch queue length");

That's a quirky bit of naming for sure!

John

>
>
>
> [root@hp3-ceph-mds2 ~]# ceph daemon
> /var/run/ceph/ceph-mds.hp3-ceph-mds2.ceph.hostingp3.local.asok  perf dump
> mds
>
> {
>
> "mds": {
>
> "request": 10843133,
>
> "reply": 10842472,
>
> "reply_latency": {
>
> "avgcount": 10842472,
>
> "sum": 2678925.337447889
>
> },
>
> "forward": 0,
>
> "dir_fetch": 412972,
>
> "dir_commit": 683903,
>
> "dir_split": 0,
>
> "dir_merge": 0,
>
> "inode_max": 700,
>
> "inodes": 7000209,
>
> "inodes_top": 808282,
>
> "inodes_bottom": 6191218,
>
> "inodes_pin_tail": 709,
>
> "inodes_pinned": 2055258,
>
> "inodes_expired": 2276343,
>
> "inodes_with_caps": 1905570,
>
> "caps": 2392113,
>
> "subtrees": 2,
>
> "traverse": 12551065,
>
> "traverse_hit": 10346763,
>
> "traverse_forward": 0,
>
> "traverse_discover": 0,
>
> "traverse_dir_fetch": 312666,
>
> "traverse_remote_ino": 0,
>
> "traverse_lock": 41125,
>
> "load_cent": 1090788840,
>
> "q": 4371,
>
> "exported": 0,
>
> "exported_inodes": 0,
>
> "imported": 0,
>
> "imported_inodes": 0
>
> }
>
> }
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OSD omap directories (LevelDBs)

2017-05-23 Thread Wido den Hollander

> Op 23 mei 2017 om 13:01 schreef george.vasilaka...@stfc.ac.uk:
> 
> 
> > Your RGW buckets, how many objects in them, and do they have the index
> > sharded?
> 
> > I know we have some very large & old buckets (10M+ RGW objects in a
> > single bucket), with correspondingly large OMAPs wherever that bucket
> > index is living (sufficently large that trying to list the entire thing
> > online is fruitless). ceph's pgmap status says we have 2G RADOS objects
> > however, and you're only at 61M RADOS objects.
> 
> 
> According to radosgw-admin bucket stats the most populous bucket contains 
> 568101 objects. There is no index sharding. The default.rgw.buckets.data pool 
> contains 4162566 objects, I think striping is done by default for 4MB sizes 
> stripes.
> 

Without index sharding 500k objects in a bucket can already cause larger OMAP 
directories. I'd recommend that you at least start to shard them.

Wido

> Bear in mind RGW is a small use case for us currently.
> Most of the data lives in a pool that is accessed by specialized servers that 
> have plugins based on libradosstriper. That pool stores around 1.8 PB in 
> 32920055 objects.
> 
> One thing of note is that we have this:
> filestore_xattr_use_omap=1
> in our ceph.conf and libradosstriper makes use of xattrs for striping 
> metadata and locking mechanisms.
> 
> This seems to have been removed some time ago but the question is could have 
> any effect? This cluster was built in January and ran Jewel initially.
> 
> I do see the xattrs in XFS but a sampling of an omap dir from an OSD showed 
> like there might be some xattrs in there too.
> 
> I'm going to try restarting an OSD with a big omap and also extracting a copy 
> of one for further inspection.
> It seems to me like they might not be cleaning up old data. I'm fairly 
> certain an active cluster would've compacted enough for 3 month old SSTs to 
> go away.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scuttlemonkey signing off...

2017-05-23 Thread Wido den Hollander
Hey Patrick,

Thanks for all your work in the last 5 years! Sad to see you leave, but again, 
your effort is very much appreciated!

Wido

> Op 22 mei 2017 om 16:36 schreef Patrick McGarry :
> 
> 
> Hey cephers,
> 
> I'm writing to you today to share that my time in the Ceph community
> is coming to an end this year. The last five years (!!) of working
> with the Ceph community have yielded some of the most rewarding
> adventures of my professional career, but a new opportunity has come
> along that I just couldn't pass up.
> 
> I will continue to work through the end of July in order to transition
> my responsibilities to a replacement.  In the spirit of Ceph openness,
> I am currently assisting Stormy Peters (Red Hat's senior community
> manager - sto...@redhat.com) in seeking candidates, so if you know
> anyone who might be interested in managing the Ceph community, please
> let me know.
> 
> While this is definitely bittersweet for me, the Ceph community has
> done a good job of self-managing, self-healing, and replicating just
> like the underlying technology, so I know you are all in good hands
> (each others!).  If you would like to keep in touch, or have questions
> beyond the time I am able to answer my @redhat.com email address, feel
> free to reach out to me at pmcga...@gmail.com and I'll be happy to
> catch up.
> 
> If you have any questions or concerns in the meantime feel free to
> reach out to me directly, but I'll do my best to ensure there is
> minimal distruption during this transition. Thank you to all of you in
> the Ceph community who have made this journey so rewarding. I look
> forward to seeing even more amazing things in Ceph's future!
> 
> 
> -- 
> 
> Best Regards,
> 
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OSD omap directories (LevelDBs)

2017-05-23 Thread george.vasilakakos
Hi Wido,

I see your point. I would expect OMAPs to grow with the number of objects but 
multiple OSDs getting to multiple tens of GBs for their omaps seems excessive. 
I find it difficult to believe that not sharding the index for a bucket of 500k 
objects in RGW causes the 10 largest OSD omaps to grow to a total 512GB which 
is about 2000 greater that than size of 10 average omaps. Given the relative 
usage of our pools and the much greater prominence of our non-RGW pools on the 
OSDs with huge omaps I'm not inclined to think this is caused by some RGW 
configuration (or lack thereof).

It's also worth pointing out that we've seen problems with files being slow to 
retrieve (I'm talking about rados get doing 120MB/sec on one file and 2MB/sec 
on another) and subsequently the omap of the OSD hosting the first stripe of 
those growing from 30MB to 5GB in the span of an hour during which the logs are 
flooded with LevelDB compaction activity.

Best regards,

George

From: Wido den Hollander [w...@42on.com]
Sent: 23 May 2017 14:00
To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Large OSD omap directories (LevelDBs)

> Op 23 mei 2017 om 13:01 schreef george.vasilaka...@stfc.ac.uk:
>
>
> > Your RGW buckets, how many objects in them, and do they have the index
> > sharded?
>
> > I know we have some very large & old buckets (10M+ RGW objects in a
> > single bucket), with correspondingly large OMAPs wherever that bucket
> > index is living (sufficently large that trying to list the entire thing
> > online is fruitless). ceph's pgmap status says we have 2G RADOS objects
> > however, and you're only at 61M RADOS objects.
>
>
> According to radosgw-admin bucket stats the most populous bucket contains 
> 568101 objects. There is no index sharding. The default.rgw.buckets.data pool 
> contains 4162566 objects, I think striping is done by default for 4MB sizes 
> stripes.
>

Without index sharding 500k objects in a bucket can already cause larger OMAP 
directories. I'd recommend that you at least start to shard them.

Wido

> Bear in mind RGW is a small use case for us currently.
> Most of the data lives in a pool that is accessed by specialized servers that 
> have plugins based on libradosstriper. That pool stores around 1.8 PB in 
> 32920055 objects.
>
> One thing of note is that we have this:
> filestore_xattr_use_omap=1
> in our ceph.conf and libradosstriper makes use of xattrs for striping 
> metadata and locking mechanisms.
>
> This seems to have been removed some time ago but the question is could have 
> any effect? This cluster was built in January and ran Jewel initially.
>
> I do see the xattrs in XFS but a sampling of an omap dir from an OSD showed 
> like there might be some xattrs in there too.
>
> I'm going to try restarting an OSD with a big omap and also extracting a copy 
> of one for further inspection.
> It seems to me like they might not be cleaning up old data. I'm fairly 
> certain an active cluster would've compacted enough for 3 month old SSTs to 
> go away.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scuttlemonkey signing off...

2017-05-23 Thread John Wilkins
Sorry to see you go Patrick. You've been at this as long as I have. Best of
luck to you!

On Tue, May 23, 2017 at 6:01 AM, Wido den Hollander  wrote:

> Hey Patrick,
>
> Thanks for all your work in the last 5 years! Sad to see you leave, but
> again, your effort is very much appreciated!
>
> Wido
>
> > Op 22 mei 2017 om 16:36 schreef Patrick McGarry :
> >
> >
> > Hey cephers,
> >
> > I'm writing to you today to share that my time in the Ceph community
> > is coming to an end this year. The last five years (!!) of working
> > with the Ceph community have yielded some of the most rewarding
> > adventures of my professional career, but a new opportunity has come
> > along that I just couldn't pass up.
> >
> > I will continue to work through the end of July in order to transition
> > my responsibilities to a replacement.  In the spirit of Ceph openness,
> > I am currently assisting Stormy Peters (Red Hat's senior community
> > manager - sto...@redhat.com) in seeking candidates, so if you know
> > anyone who might be interested in managing the Ceph community, please
> > let me know.
> >
> > While this is definitely bittersweet for me, the Ceph community has
> > done a good job of self-managing, self-healing, and replicating just
> > like the underlying technology, so I know you are all in good hands
> > (each others!).  If you would like to keep in touch, or have questions
> > beyond the time I am able to answer my @redhat.com email address, feel
> > free to reach out to me at pmcga...@gmail.com and I'll be happy to
> > catch up.
> >
> > If you have any questions or concerns in the meantime feel free to
> > reach out to me directly, but I'll do my best to ensure there is
> > minimal distruption during this transition. Thank you to all of you in
> > the Ceph community who have made this journey so rewarding. I look
> > forward to seeing even more amazing things in Ceph's future!
> >
> >
> > --
> >
> > Best Regards,
> >
> > Patrick McGarry
> > Director Ceph Community || Red Hat
> > http://ceph.com  ||  http://community.redhat.com
> > @scuttlemonkey || @ceph
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
John Wilkins
Red Hat
jowil...@redhat.com
(415) 425-9599
http://redhat.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mon and existing zookeeper servers

2017-05-23 Thread Sean Purdy
Hi,


This is my first ceph installation.  It seems to tick our boxes.  Will be
using it as an object store with radosgw.

I notice that ceph-mon uses zookeeper behind the scenes.  Is there a way to
point ceph-mon at an existing zookeeper cluster, using a zookeeper chroot?

Alternatively, might ceph-mon coexist peacefully with a different zookeeper
already on the same machine?


Thanks,

Sean Purdy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon and existing zookeeper servers

2017-05-23 Thread John Spray
On Tue, May 23, 2017 at 4:04 PM, Sean Purdy  wrote:
> Hi,
>
>
> This is my first ceph installation.  It seems to tick our boxes.  Will be
> using it as an object store with radosgw.
>
> I notice that ceph-mon uses zookeeper behind the scenes.  Is there a way to
> point ceph-mon at an existing zookeeper cluster, using a zookeeper chroot?

ceph-mon uses a home-grown implementation of a consensus algorithm
(paxos) -- it is not based on zookeeper.

> Alternatively, might ceph-mon coexist peacefully with a different zookeeper
> already on the same machine?

Absolutely, there's nothing stopping you from running zookeeper and
the mons on the same nodes.  Just make sure they don't clobber each
other when very busy (perhaps consider giving them separate drives).

John

>
>
> Thanks,
>
> Sean Purdy
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS Question

2017-05-23 Thread James Wilkins
Thanks : -)

If we are seeing this rise unnaturally high (e.g >140K - which corresponds with 
slow access to CephFS) do you have any recommendations of where we should be 
looking - is this related to the messenger service and its dispatch/throttle 
bytes? 


-Original Message-
From: John Spray [mailto:jsp...@redhat.com] 
Sent: 23 May 2017 13:51
To: James Wilkins 
Cc: Users, Ceph 
Subject: Re: [ceph-users] MDS Question

On Tue, May 23, 2017 at 1:42 PM, James Wilkins  
wrote:
> Quick question on CephFS/MDS but I can’t find this documented 
> (apologies if it is)
>
>
>
> What does the q: represent in a ceph daemon  perf dump mds 
> represent?

mds]$ git grep "\"q\""
MDSRank.cc:mds_plb.add_u64(l_mds_dispatch_queue_len, "q",
"Dispatch queue length");

That's a quirky bit of naming for sure!

John

>
>
>
> [root@hp3-ceph-mds2 ~]# ceph daemon
> /var/run/ceph/ceph-mds.hp3-ceph-mds2.ceph.hostingp3.local.asok  perf 
> dump mds
>
> {
>
> "mds": {
>
> "request": 10843133,
>
> "reply": 10842472,
>
> "reply_latency": {
>
> "avgcount": 10842472,
>
> "sum": 2678925.337447889
>
> },
>
> "forward": 0,
>
> "dir_fetch": 412972,
>
> "dir_commit": 683903,
>
> "dir_split": 0,
>
> "dir_merge": 0,
>
> "inode_max": 700,
>
> "inodes": 7000209,
>
> "inodes_top": 808282,
>
> "inodes_bottom": 6191218,
>
> "inodes_pin_tail": 709,
>
> "inodes_pinned": 2055258,
>
> "inodes_expired": 2276343,
>
> "inodes_with_caps": 1905570,
>
> "caps": 2392113,
>
> "subtrees": 2,
>
> "traverse": 12551065,
>
> "traverse_hit": 10346763,
>
> "traverse_forward": 0,
>
> "traverse_discover": 0,
>
> "traverse_dir_fetch": 312666,
>
> "traverse_remote_ino": 0,
>
> "traverse_lock": 41125,
>
> "load_cent": 1090788840,
>
> "q": 4371,
>
> "exported": 0,
>
> "exported_inodes": 0,
>
> "imported": 0,
>
> "imported_inodes": 0
>
> }
>
> }
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon and existing zookeeper servers

2017-05-23 Thread Joao Eduardo Luis

On 05/23/2017 04:04 PM, Sean Purdy wrote:

Hi,


This is my first ceph installation.  It seems to tick our boxes.  Will be
using it as an object store with radosgw.

I notice that ceph-mon uses zookeeper behind the scenes.  Is there a way to
point ceph-mon at an existing zookeeper cluster, using a zookeeper chroot?

Alternatively, might ceph-mon coexist peacefully with a different zookeeper
already on the same machine?


The monitors use their own implementation of a zookeeper-like protocol.

You should not have a problem with having something else running on the 
same machine, granted the monitors are not starved for resources.


  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS Question

2017-05-23 Thread John Spray
On Tue, May 23, 2017 at 4:27 PM, James Wilkins
 wrote:
> Thanks : -)
>
> If we are seeing this rise unnaturally high (e.g >140K - which corresponds 
> with slow access to CephFS) do you have any recommendations of where we 
> should be looking - is this related to the messenger service and its 
> dispatch/throttle bytes?

That is a super long queue.  I would be looking at the other counters
to see what's ticking upwards quickly (i.e. what is being maxed out
and thereby presumably causing a backlog).

John

>
>
> -Original Message-
> From: John Spray [mailto:jsp...@redhat.com]
> Sent: 23 May 2017 13:51
> To: James Wilkins 
> Cc: Users, Ceph 
> Subject: Re: [ceph-users] MDS Question
>
> On Tue, May 23, 2017 at 1:42 PM, James Wilkins  
> wrote:
>> Quick question on CephFS/MDS but I can’t find this documented
>> (apologies if it is)
>>
>>
>>
>> What does the q: represent in a ceph daemon  perf dump mds
>> represent?
>
> mds]$ git grep "\"q\""
> MDSRank.cc:mds_plb.add_u64(l_mds_dispatch_queue_len, "q",
> "Dispatch queue length");
>
> That's a quirky bit of naming for sure!
>
> John
>
>>
>>
>>
>> [root@hp3-ceph-mds2 ~]# ceph daemon
>> /var/run/ceph/ceph-mds.hp3-ceph-mds2.ceph.hostingp3.local.asok  perf
>> dump mds
>>
>> {
>>
>> "mds": {
>>
>> "request": 10843133,
>>
>> "reply": 10842472,
>>
>> "reply_latency": {
>>
>> "avgcount": 10842472,
>>
>> "sum": 2678925.337447889
>>
>> },
>>
>> "forward": 0,
>>
>> "dir_fetch": 412972,
>>
>> "dir_commit": 683903,
>>
>> "dir_split": 0,
>>
>> "dir_merge": 0,
>>
>> "inode_max": 700,
>>
>> "inodes": 7000209,
>>
>> "inodes_top": 808282,
>>
>> "inodes_bottom": 6191218,
>>
>> "inodes_pin_tail": 709,
>>
>> "inodes_pinned": 2055258,
>>
>> "inodes_expired": 2276343,
>>
>> "inodes_with_caps": 1905570,
>>
>> "caps": 2392113,
>>
>> "subtrees": 2,
>>
>> "traverse": 12551065,
>>
>> "traverse_hit": 10346763,
>>
>> "traverse_forward": 0,
>>
>> "traverse_discover": 0,
>>
>> "traverse_dir_fetch": 312666,
>>
>> "traverse_remote_ino": 0,
>>
>> "traverse_lock": 41125,
>>
>> "load_cent": 1090788840,
>>
>> "q": 4371,
>>
>> "exported": 0,
>>
>> "exported_inodes": 0,
>>
>> "imported": 0,
>>
>> "imported_inodes": 0
>>
>> }
>>
>> }
>>
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Object store backups

2017-05-23 Thread Sean Purdy
Hi,

Another newbie question.  Do people using radosgw mirror their buckets
to AWS S3 or compatible services as a backup?  We're setting up a
small cluster and are thinking of ways to mitigate total disaster.
What do people recommend?


Thanks,

Sean Purdy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds slow request, getattr currently failed to rdlock. Kraken with Bluestore

2017-05-23 Thread Daniel K
Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs
enroute.

I was "beating up" on the cluster, and had been writing to a 6TB file in
CephFS for several hours, during which I changed the crushmap to better
match my environment, generating a bunch of recovery IO. After about 5.8TB
written, one of the OSD(which is also a MON..soon to be rectivied) hosts
crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph -s:
 (The degraded/misplaced warnings are likely because the cluster hasn't
completed rebalancing after I changed the crushmap I assume)


2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore
2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore
cluster e92e20ca-0fe6-4012-86cc-aa51e041
 health HEALTH_WARN
440 pgs backfill_wait
7 pgs backfilling
85 pgs degraded
5 pgs recovery_wait
85 pgs stuck degraded
452 pgs stuck unclean
77 pgs stuck undersized
77 pgs undersized
recovery 196526/3554278 objects degraded (5.529%)
recovery 1690392/3554278 objects misplaced (47.559%)
mds0: 1 slow requests are blocked > 30 sec
 monmap e4: 3 mons at {stor-vm1=
10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0,stor-vm3=10.0.15.53:6789/0}
election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3
  fsmap e21: 1/1/1 up {0=stor-vm4=up:active}
mgr active: stor-vm1 standbys: stor-vm2
 osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs
flags sortbitwise,require_jewel_osds,require_kraken_osds
  pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects
11041 GB used, 16901 GB / 27943 GB avail
196526/3554278 objects degraded (5.529%)
1690392/3554278 objects misplaced (47.559%)
 975 active+clean
 364 active+remapped+backfill_wait
  76 active+undersized+degraded+remapped+backfill_wait
   3 active+recovery_wait+degraded+remapped
   3 active+remapped+backfilling
   3 active+degraded+remapped+backfilling
   2 active+recovery_wait+degraded
   1 active+clean+scrubbing+deep
   1 active+undersized+degraded+remapped+backfilling
recovery io 112 MB/s, 28 objects/s


Seems related to the "corrupted rbd filesystems since jewel" thread.

log entries on the MDS server:

2017-05-23 18:27:12.966218 7f95ed6c0700  0 log_channel(cluster) log [WRN] :
slow request 243.113407 seconds old, received at 2017-05-23
18:23:09.852729: client_request(client.204100:5 getattr pAsLsXsFs
#10003ec 2017-05-23 17:48:23.770852 RETRY=2 caller_uid=0,
caller_gid=0{}) currently failed to rdlock, waiting


output of ceph daemon mds.stor-vm4 objecter_requests(changes each time I
run it)
:
root@stor-vm4:/var/log/ceph# ceph daemon mds.stor-vm4 objecter_requests
{
"ops": [
{
"tid": 66700,
"pg": "1.60e95c32",
"osd": 4,
"object_id": "10003ec.003efb9f",
"object_locator": "@1",
"target_object_id": "10003ec.003efb9f",
"target_object_locator": "@1",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "1.47461e+06s",
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "1969-12-31 19:00:00.00s",
"osd_ops": [
"stat"
]
}
],
"linger_ops": [],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}

I've tried restarting the mds daemon ( systemctl stop ceph-mds\*.service
ceph-mds.target &&  systemctl start ceph-mds\*.service ceph-mds.target )



IO to the file that was being access when the host crashed is blocked.


Suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scuttlemonkey signing off...

2017-05-23 Thread Dan Mick
On 05/22/2017 07:36 AM, Patrick McGarry wrote:

> I'm writing to you today to share that my time in the Ceph community
> is coming to an end this year. 

You'll leave a big hole, Patrick.  It's been great having you along for
the ride.

-- 
Dan Mick
Red Hat, Inc.
Ceph docs: http://ceph.com/docs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does rbd preserve the consistency of WRITE requests that span across multiple objects?

2017-05-23 Thread 许雪寒
Hi, thanks for the explanation:-)

On the other hand, I wonder if the following scenario could happen:

A program in a virtual machine that uses "libaio" to access a file 
continuous submit "write" requests to the underlying file system which 
translates the request into rbd requests. Say, a rbd "aio_write" X wants to 
write to an area that span across object A and B. according to my understanding 
of the rbd source code, librbd would separate this write request into two rados 
Ops, each corresponding to a single object. After these two rados Ops have been 
sent to OSD and before they are finished, another rbd "aio_write" request Y 
which also wants to write to the same area as the previous arrives, and is sent 
to OSD in the same way as X. Due to the possible reorder, it's possible that 
Y.B is done before X.B while Y.A is done after X.A, which could lead to an 
unexpected result.

Is this possible?


Date: Fri, 10 Mar 2017 19:27:00 +
From: Gregory Farnum 
To: Wei Jin ,"ceph-users@lists.ceph.com"
, ??? 
Subject: Re: [ceph-users] ??: How does ceph preserve read/write
consistency?
Message-ID:

Content-Type: text/plain; charset="utf-8"

On Thu, Mar 9, 2017 at 7:20 PM ???  wrote:

> Thanks for your reply.
>
> As the log shows, in our test, a READ that come after a WRITE did finished
> before that WRITE.


This is where you've gone astray. Any storage system is perfectly free to
reorder simultaneous requests -- defined as those whose submit-reply time
overlaps. So you submitted write W, then submitted read R, then got a
response to R before W. That's allowed, and preventing it is actually
impossible in general. In the specific case you've outlined, we *could* try
to prevent it, but doing so is pretty ludicrously expensive and, since the
"reorder" can happen anyway, doesn't provide any benefit.
So we don't try. :)

That said, obviously we *do* provide strict ordering across write
boundaries: a read submitted after a write completed will always see the
results of that write.
-Greg

And I read the source code, it seems that, for writes, in
> ReplicatedPG::do_op method, the thread in OSD_op_tp calls
> ReplicatedPG::get_rw_lock method which tries to get RWState::RWWRITE. If it
> fails, the op will be put into obc->rwstate.waiters queue and be requeued
> when repop finishes, however, the OSD_op_tp's thread doesn't wait for repop
> and tries to get the next OP. Can this be the cause?
>
> --
> ???: Wei Jin [mailto:wjin...@gmail.com]
> : 2017?3?9? 21:52
> ???: ???
> ??: ceph-users@lists.ceph.com
> ??: Re: [ceph-users] How does ceph preserve read/write consistency?
>
> On Thu, Mar 9, 2017 at 1:45 PM, ???  wrote:
> > Hi, everyone.
>
> > As shown above, WRITE req with tid 1312595 arrived at 18:58:27.439107
> and READ req with tid 6476 arrived at 18:59:55.030936, however, the latter
> finished at 19:00:20:89 while the former finished commit at
> 19:00:20.335061 and filestore write at 19:00:25.202321. And in these logs,
> we found that between the start and finish of each req, there was a lot of
> "dequeue_op" of that req. We read the source code, it seems that this is
> due to "RWState", is that correct?
> >
> > And also, it seems that OSD won't distinguish reqs from different
> clients, so is it possible that io reqs from the same client also finish in
> a different order than that they were created in? Could this affect the
> read/write consistency? For instance, that a read can't acquire the data
> that were written by the same client just before it.
> >
>
> IMO, that doesn't make sense for rados to distinguish reqs from different
> clients.
> Clients or Users should do it by themselves.
>
> However, as for one specific client, ceph can and must guarantee the
> request order.
>
> 1) ceph messenger (network layer) has in_seq and out_seq when receiving
> and sending message
>
> 2) message will be dispatched or fast dispatched and then be queued in
> ShardedOpWq in order.
>
> If requests belong to different pgs, they may be processed concurrently,
> that's ok.
>
> If requests belong to the same pg, they will be queued in the same shard
> and will be processed in order due to pg lock (both read and write).
> For continuous write, op will be queued in ObjectStore in order due to pg
> lock and ObjectStore has OpSequence to guarantee the order when applying op
> to page cache, that's ok.
>
> With regard to  'read after write' to the same object, ceph must guarantee
> read can get the correct write content. That's done by
> ondisk_read/write_lock in ObjectContext.
>
>
> > We are testing hammer version, 0.94.5.  Please help us, thank you:-)
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Re: [ceph-users] Large OSD omap directories (LevelDBs)

2017-05-23 Thread Gregory Farnum
On Tue, May 23, 2017 at 6:28 AM  wrote:

> Hi Wido,
>
> I see your point. I would expect OMAPs to grow with the number of objects
> but multiple OSDs getting to multiple tens of GBs for their omaps seems
> excessive. I find it difficult to believe that not sharding the index for a
> bucket of 500k objects in RGW causes the 10 largest OSD omaps to grow to a
> total 512GB which is about 2000 greater that than size of 10 average omaps.
> Given the relative usage of our pools and the much greater prominence of
> our non-RGW pools on the OSDs with huge omaps I'm not inclined to think
> this is caused by some RGW configuration (or lack thereof).
>
> It's also worth pointing out that we've seen problems with files being
> slow to retrieve (I'm talking about rados get doing 120MB/sec on one file
> and 2MB/sec on another) and subsequently the omap of the OSD hosting the
> first stripe of those growing from 30MB to 5GB in the span of an hour
> during which the logs are flooded with LevelDB compaction activity.
>

This does sound weird, but I also notice that in your earlier email you
seemed to have only ~5k PGs across  ~1400 OSDs, which is a pretty low
number. You may just have a truly horrible PG balance; can you share more
details (eg ceph osd df)?
-Greg


>
> Best regards,
>
> George
> 
> From: Wido den Hollander [w...@42on.com]
> Sent: 23 May 2017 14:00
> To: Vasilakakos, George (STFC,RAL,SC); ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Large OSD omap directories (LevelDBs)
>
> > Op 23 mei 2017 om 13:01 schreef george.vasilaka...@stfc.ac.uk:
> >
> >
> > > Your RGW buckets, how many objects in them, and do they have the index
> > > sharded?
> >
> > > I know we have some very large & old buckets (10M+ RGW objects in a
> > > single bucket), with correspondingly large OMAPs wherever that bucket
> > > index is living (sufficently large that trying to list the entire thing
> > > online is fruitless). ceph's pgmap status says we have 2G RADOS objects
> > > however, and you're only at 61M RADOS objects.
> >
> >
> > According to radosgw-admin bucket stats the most populous bucket
> contains 568101 objects. There is no index sharding. The
> default.rgw.buckets.data pool contains 4162566 objects, I think striping is
> done by default for 4MB sizes stripes.
> >
>
> Without index sharding 500k objects in a bucket can already cause larger
> OMAP directories. I'd recommend that you at least start to shard them.
>
> Wido
>
> > Bear in mind RGW is a small use case for us currently.
> > Most of the data lives in a pool that is accessed by specialized servers
> that have plugins based on libradosstriper. That pool stores around 1.8 PB
> in 32920055 objects.
> >
> > One thing of note is that we have this:
> > filestore_xattr_use_omap=1
> > in our ceph.conf and libradosstriper makes use of xattrs for striping
> metadata and locking mechanisms.
> >
> > This seems to have been removed some time ago but the question is could
> have any effect? This cluster was built in January and ran Jewel initially.
> >
> > I do see the xattrs in XFS but a sampling of an omap dir from an OSD
> showed like there might be some xattrs in there too.
> >
> > I'm going to try restarting an OSD with a big omap and also extracting a
> copy of one for further inspection.
> > It seems to me like they might not be cleaning up old data. I'm fairly
> certain an active cluster would've compacted enough for 3 month old SSTs to
> go away.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Internalls of RGW data store

2017-05-23 Thread Anton Dmitriev

Hi

Correct me if I am wrong: when uploading file to RGW it becomes split 
into stripe units and this stripe units mapped to RADOS objects. This 
RADOS objects are files on OSD filestore.


Whats going under the hood when I delete RGW object? If RADOS object 
consists of multple stripe units belonging to multiple RGW objects, so 
when I delete RGW object there are must appear some empty spaces in 
RADOS objects like it becomes "fragmented". Do I need to care about this 
empty spaces? Can this "fragmentation" bring to cluster performance 
degradation? Is there anything like compact for RADOS objects?



--
Dmitriev Anton

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com