[ceph-users] rgw expiration problem, a bug ?

2019-01-17 Thread Will Zhao
Hi all:
I found when I set the bucket expiration rule , after the expiration
date, when I upload a new object , it will be deleted , and I found
the related code like the following:
if (prefix_iter->second.expiration_date != boost::none) {
//we have checked it before
  Why  this should be true  ?   The new uploaded object should be deleted ?
is_expired = true;
} else {
is_expired = obj_has_expired(obj_iter->meta.mtime,
prefix_iter->second.expiration);
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Google Summer of Code / Outreachy Call for Projects

2019-01-17 Thread Mike Perez
Hello everyone,

We're getting ready for the next round of Google Summer of Code and Outreachy.

Ali Maredia and I will help organize the Ceph project for each program.

We are looking to have project ideas for Google Summer of Code by
February 4th as our project application deadline is due by February
6th.

https://ceph.com/contribute/gsoc2019/
https://summerofcode.withgoogle.com/how-it-works/

Outreachy applicants in mid-February through the end of March will
begin selecting open source projects to work on and make small
contributions. On April 12th we hope to have some projects set for
Outreachy so we can start selecting applicants. End of May will begin
the internships.

https://www.outreachy.org/mentor/

You can submit project ideas for both programs on this etherpad.
https://pad.ceph.com/p/project-ideas

Stay tuned for more updates.

--
Mike Perez (thingee)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephalocon Barcelona 2019 CFP now open!

2019-01-17 Thread Mike Perez
Hello everyone,

Just a reminder that the deadline for Cephalocon Barcelona 2019 CFP is
February 1 at 11:59 pm PST. Please get your proposed sessions in soon
as possible for our selection committee to review. Thanks!

https://ceph.com/cephalocon/barcelona-2019/
https://linuxfoundation.smapply.io/prog/cephalocon_2019/
https://pad.ceph.com/p/cfp-coordination


--
Mike Perez (thingee)



On Mon, Dec 10, 2018 at 8:00 AM Mike Perez  wrote:
>
> Hello everyone!
>
> It gives me great pleasure to announce the CFP for Cephalocon Barcelona 2019 
> is now open [1]!
>
> Cephalocon Barcelona aims to bring together more than 800 technologists and 
> adopters from across the globe to showcase Ceph’s history and its future, 
> demonstrate real-world applications, and highlight vendor solutions. Join us 
> in Barcelona, Spain on 19-20 May 2019 for our second international conference 
> event.
>
> CFP closes Friday, February 1 at 11:59 pm PST.
>
> We will again have a selection committee that will look over the 
> presentations for quality content for the community. If you have any 
> questions, please let me know.
>
> [1] - https://ceph.com/cephalocon/barcelona-2019/
>
> --
>
> Mike Perez (thingee)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw cannot create pool

2019-01-17 Thread Jan Kasprzak
Hello, Ceph users,

TL;DR: radosgw fails on me with the following message:

2019-01-17 09:34:45.247721 7f52722b3dc0  0 rgw_init_ioctx ERROR: 
librados::Rados::pool_create returned (34) Numerical result out of range (this 
can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num 
or mon_max_pg_per_osd exceeded)

Detailed description:

I have a Ceph cluster installed long time ago as firefly on CentOS 7,
and now running luminous. So far I have used it for RBD pools, but now
I want to try using radosgw as well.

I tried to deploy radosgw using

# ceph-deploy rgw create myhost

Which went well until it tried to start it up:

[myhost][INFO  ] Running command: service ceph-radosgw start
[myhost][WARNIN] Redirecting to /bin/systemctl start ceph-radosgw.service
[myhost][WARNIN] Failed to start ceph-radosgw.service: Unit not found.
[myhost][ERROR ] RuntimeError: command returned non-zero exit status: 5
[ceph_deploy.rgw][ERROR ] Failed to execute command: service ceph-radosgw start
[ceph_deploy][ERROR ] GenericError: Failed to create 1 RGWs

Comparing it to my testing deployment of mimic, where radosgw works,
the problem was with the unit name, the correct way to start it up
apparently was

# systemctl start ceph-radosgw@rgw.myhost.service

Now it is apparently running:

/usr/bin/radosgw -f --cluster ceph --name client.rgw.myhost --setuser ceph 
--setgroup ceph

However, when I want to add the first user, radosgw-admin fails and
radosgw itself exits with the similar message:

# radosgw-admin user create --uid=kas --display-name="Jan Kasprzak"
2019-01-17 09:52:29.805828 7fea6cfd2dc0  0 rgw_init_ioctx ERROR: 
librados::Rados::pool_create returned (34) Numerical result out of range (this 
can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num 
or mon_max_pg_per_osd exceeded)
2019-01-17 09:52:29.805957 7fea6cfd2dc0 -1 ERROR: failed to initialize watch: 
(34) Numerical result out of range
couldn't init storage provider

So I guess it is trying to create a pool for data, but it fails somehow.
Can I determine which pool it is and what parameters it tries to use?

I have looked at my testing mimic cluster, and radosgw there created the
following pools:

.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log
default.rgw.buckets.index
default.rgw.buckets.data

So I created these pools manually on my luminous cluster as well:

# ceph osd pool create .rgw.root 128
(repeat for all the above pool names)

Which helped, and I am able to create the user with radosgw-admin.
Now where should I look for the exact parameters radosgw is trying
to use when creating its pools?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-filesystem wthin a cluster

2019-01-17 Thread Dan van der Ster
On Wed, Jan 16, 2019 at 11:17 PM Patrick Donnelly  wrote:
>
> On Wed, Jan 16, 2019 at 1:21 AM Marvin Zhang  wrote:
> > Hi CephFS experts,
> > From document, I know multi-fs within a cluster is still experiment feature.
> > 1. Is there any estimation about stability and performance for this feature?
>
> Remaining blockers [1] need completed. No developer has yet taken on
> this task. Perhaps by O release.
>
> > 2. It seems that each FS will consume at least 1 active MDS and
> > different FS can't share MDS. Suppose I want to create 10 FS , I need
> > at least 10 MDS. Is it right? Is ther any limit number for MDS within
> > a cluster?
>
> No limit on number of MDS but there is a limit on the number of
> actives (multimds).

TIL...
What is the max number of actives in a single FS?

Cheers, Dan

> In the not-to-distant future, container
> orchestration platforms (e.g. Rook) underneath Ceph would provide a
> way to dynamically spin up new MDSs in response to the creation of a
> file system.
>
> [1] http://tracker.ceph.com/issues/22477
>
> --
> Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Johan Thomsen
Hi,

I have a sad ceph cluster.
All my osds complain about failed reply on heartbeat, like so:

osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
ever on either front or back, first ping sent 2019-01-16
22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)

.. I've checked the network sanity all I can, and all ceph ports are
open between nodes both on the public network and the cluster network,
and I have no problems sending traffic back and forth between nodes.
I've tried tcpdump'ing and traffic is passing in both directions
between the nodes, but unfortunately I don't natively speak the ceph
protocol, so I can't figure out what's going wrong in the heartbeat
conversation.

Still:

# ceph health detail

HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
pgs inactive, 1072 pgs peering
OSDMAP_FLAGS nodown,noout flag(s) set
PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
pg 7.3cd is stuck inactive for 245901.560813, current state
creating+peering, last acting [13,41,1]
pg 7.3ce is stuck peering for 245901.560813, current state
creating+peering, last acting [1,40,7]
pg 7.3cf is stuck peering for 245901.560813, current state
creating+peering, last acting [0,42,9]
pg 7.3d0 is stuck peering for 245901.560813, current state
creating+peering, last acting [20,8,38]
pg 7.3d1 is stuck peering for 245901.560813, current state
creating+peering, last acting [10,20,42]
   ()


I've set "noout" and "nodown" to prevent all osd's from being removed
from the cluster. They are all running and marked as "up".

# ceph osd tree

ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
 -1   249.73434 root default
-25   166.48956 datacenter m1
-2483.24478 pod kube1
-3541.62239 rack 10
-3441.62239 host ceph-sto-p102
 40   hdd   7.27689 osd.40 up  1.0 1.0
 41   hdd   7.27689 osd.41 up  1.0 1.0
 42   hdd   7.27689 osd.42 up  1.0 1.0
   ()

I'm at a point where I don't know which options and what logs to check anymore?

Any debug hint would be very much appreciated.

btw. I have no important data in the cluster (yet), so if the solution
is to drop all osd and recreate them, it's ok for now. But I'd really
like to know how the cluster ended in this state.

/Johan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Kevin Olbrich
Are you sure, no service like firewalld is running?
Did you check that all machines have the same MTU and jumbo frames are
enabled if needed?

I had this problem when I first started with ceph and forgot to
disable firewalld.
Replication worked perfectly fine but the OSD was kicked out every few seconds.

Kevin

Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen :
>
> Hi,
>
> I have a sad ceph cluster.
> All my osds complain about failed reply on heartbeat, like so:
>
> osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> ever on either front or back, first ping sent 2019-01-16
> 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
>
> .. I've checked the network sanity all I can, and all ceph ports are
> open between nodes both on the public network and the cluster network,
> and I have no problems sending traffic back and forth between nodes.
> I've tried tcpdump'ing and traffic is passing in both directions
> between the nodes, but unfortunately I don't natively speak the ceph
> protocol, so I can't figure out what's going wrong in the heartbeat
> conversation.
>
> Still:
>
> # ceph health detail
>
> HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> pgs inactive, 1072 pgs peering
> OSDMAP_FLAGS nodown,noout flag(s) set
> PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
> pg 7.3cd is stuck inactive for 245901.560813, current state
> creating+peering, last acting [13,41,1]
> pg 7.3ce is stuck peering for 245901.560813, current state
> creating+peering, last acting [1,40,7]
> pg 7.3cf is stuck peering for 245901.560813, current state
> creating+peering, last acting [0,42,9]
> pg 7.3d0 is stuck peering for 245901.560813, current state
> creating+peering, last acting [20,8,38]
> pg 7.3d1 is stuck peering for 245901.560813, current state
> creating+peering, last acting [10,20,42]
>()
>
>
> I've set "noout" and "nodown" to prevent all osd's from being removed
> from the cluster. They are all running and marked as "up".
>
> # ceph osd tree
>
> ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
>  -1   249.73434 root default
> -25   166.48956 datacenter m1
> -2483.24478 pod kube1
> -3541.62239 rack 10
> -3441.62239 host ceph-sto-p102
>  40   hdd   7.27689 osd.40 up  1.0 1.0
>  41   hdd   7.27689 osd.41 up  1.0 1.0
>  42   hdd   7.27689 osd.42 up  1.0 1.0
>()
>
> I'm at a point where I don't know which options and what logs to check 
> anymore?
>
> Any debug hint would be very much appreciated.
>
> btw. I have no important data in the cluster (yet), so if the solution
> is to drop all osd and recreate them, it's ok for now. But I'd really
> like to know how the cluster ended in this state.
>
> /Johan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] monitor cephfs mount io's

2019-01-17 Thread Marc Roos



How / where can I monitor the ios on cephfs mount / client?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to do multiple cephfs mounts.

2019-01-17 Thread Marc Roos



Should I not be able to increase the io's by splitting the data writes 
over eg. 2 cephfs mounts? I am still getting similar overall 
performance. Is it even possible to increase performance by using 
multiple mounts?

Using 2 kernel mounts on CentOS 7.6




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Johan Thomsen
Thanks you for responding!

First thing: I disabled the firewall on all the nodes.
More specifically not firewalld, but the NixOS firewall, since I run NixOS.
I can netcat both udp and tcp traffic on all ports between all nodes
without problems.

Next, I tried raising the mtu to 9000 on the nics where the cluster
network is connected - although I don't see why the mtu should affect
the heartbeat?
I have two bonded nics connected to the cluster network (mtu 9000) and
two separate bonded nics hooked on the public network (mtu 1500).
I've tested traffic and routing on both pairs of nics and traffic gets
through without issues, apparently.


None of the above solved the problem :-(


Den tor. 17. jan. 2019 kl. 12.01 skrev Kevin Olbrich :
>
> Are you sure, no service like firewalld is running?
> Did you check that all machines have the same MTU and jumbo frames are
> enabled if needed?
>
> I had this problem when I first started with ceph and forgot to
> disable firewalld.
> Replication worked perfectly fine but the OSD was kicked out every few 
> seconds.
>
> Kevin
>
> Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen :
> >
> > Hi,
> >
> > I have a sad ceph cluster.
> > All my osds complain about failed reply on heartbeat, like so:
> >
> > osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> > ever on either front or back, first ping sent 2019-01-16
> > 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
> >
> > .. I've checked the network sanity all I can, and all ceph ports are
> > open between nodes both on the public network and the cluster network,
> > and I have no problems sending traffic back and forth between nodes.
> > I've tried tcpdump'ing and traffic is passing in both directions
> > between the nodes, but unfortunately I don't natively speak the ceph
> > protocol, so I can't figure out what's going wrong in the heartbeat
> > conversation.
> >
> > Still:
> >
> > # ceph health detail
> >
> > HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> > pgs inactive, 1072 pgs peering
> > OSDMAP_FLAGS nodown,noout flag(s) set
> > PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs 
> > peering
> > pg 7.3cd is stuck inactive for 245901.560813, current state
> > creating+peering, last acting [13,41,1]
> > pg 7.3ce is stuck peering for 245901.560813, current state
> > creating+peering, last acting [1,40,7]
> > pg 7.3cf is stuck peering for 245901.560813, current state
> > creating+peering, last acting [0,42,9]
> > pg 7.3d0 is stuck peering for 245901.560813, current state
> > creating+peering, last acting [20,8,38]
> > pg 7.3d1 is stuck peering for 245901.560813, current state
> > creating+peering, last acting [10,20,42]
> >()
> >
> >
> > I've set "noout" and "nodown" to prevent all osd's from being removed
> > from the cluster. They are all running and marked as "up".
> >
> > # ceph osd tree
> >
> > ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT 
> > PRI-AFF
> >  -1   249.73434 root default
> > -25   166.48956 datacenter m1
> > -2483.24478 pod kube1
> > -3541.62239 rack 10
> > -3441.62239 host ceph-sto-p102
> >  40   hdd   7.27689 osd.40 up  1.0 
> > 1.0
> >  41   hdd   7.27689 osd.41 up  1.0 
> > 1.0
> >  42   hdd   7.27689 osd.42 up  1.0 
> > 1.0
> >()
> >
> > I'm at a point where I don't know which options and what logs to check 
> > anymore?
> >
> > Any debug hint would be very much appreciated.
> >
> > btw. I have no important data in the cluster (yet), so if the solution
> > is to drop all osd and recreate them, it's ok for now. But I'd really
> > like to know how the cluster ended in this state.
> >
> > /Johan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Turning RGW data pool into an EC pool

2019-01-17 Thread Hayashida, Mami
I would like to know the simplest and surest way to set up a RGW instance
with an EC-pool for storing large quantity of data.

1. I am currently trying to do this on a cluster that is not yet open to
users. (i.e. I can mess around with it and, in the worst case, start all
over.)

2. I deployed RGW instances using the `ceph-deploy rgw create `,
which automatically created the following pools (the small amount of data
in these pools are from a few tests I have run):
POOLS:
NAME   ID USED%USED MAX AVAIL
   OBJECTS
.rgw.root  1  1.1 KiB 0   412 TiB
 4
default.rgw.control2  0 B 0   412 TiB
 8
default.rgw.meta   3  1.2 KiB 0   412 TiB
 7
default.rgw.log4  0 B 0   412 TiB
   207
default.rgw.buckets.index  5  0 B 0   412 TiB
 1
default.rgw.buckets.non-ec 6  0 B 0   412 TiB
 6
default.rgw.buckets.data   7   33 GiB 0   412 TiB
 13564

3. These pools, by default, are all replicated (factor of 3).

4. My understanding is that the only good candidate RGW pool for EC is the
.rgw.buckets.data pool, is that correct?

5. Now, given that these pools have already been created, is my only option
to create a new EC-ed pool, migrate the data from the old one, and do some
re-naming of the pools so that at the end, the new pool will be
default.rgw.buckets.data  (as outlined in
https://ceph.com/geen-categorie/ceph-pool-migration/)?  If so, do I have to
worry about the last line of that document?: "But it does not work in all
cases."  (whatever that means)

6. Do I need to manually edit the CRUSH map in addition to step 5?  And/or
anything else I need to change or be aware of?

7. Is there an easier way to do this (i.e. create an EC-ed pool for RGW at
the point of RGW initialization rather than migrating data from an old pool
to a new one and renaming the new one to the old one AFTER the initial RGW
pools are created) if I had not yet installed RGW? I am asking this for my
future reference.


*Mami Hayashida*
*Research Computing Associate*
Research Computing Infrastructure
University of Kentucky Information Technology Services
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor cephfs mount io's

2019-01-17 Thread Mohamad Gebai
You can do that either straight from your client, or by querying the
perf dump if you're using ceph-fuse.

Mohamad

On 1/17/19 6:19 AM, Marc Roos wrote:
>
> How / where can I monitor the ios on cephfs mount / client?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rebuilding RGW bucket indices from objects

2019-01-17 Thread Bryan Stillwell
This is sort of related to my email yesterday, but has anyone ever rebuilt a 
bucket index using the objects themselves?

It seems to be that it would be possible since the bucket_id is contained 
within the rados object name:

# rados -p .rgw.buckets.index listomapkeys .dir.default.56630221.139618
error getting omap key set .rgw.buckets.index/.dir.default.56630221.139618: (2) 
No such file or directory
# rados -p .rgw.buckets ls | grep default.56630221.139618
default.56630221.139618__shadow_.IxIe8byqV61eu6g7gSVXBpHfrB3BlC4_1
default.56630221.139618_backup.20181214
default.56630221.139618_backup.20181220
default.56630221.139618__shadow_.GQcmQKfbBkb9WEF1X-6qGBEVfppGKEJ_1
...[ many more snipped ]...

Thanks,
Bryan



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore device’s device selector for Samsung NVMe

2019-01-17 Thread kefu chai
On Mon, Jan 14, 2019 at 8:52 PM Yanko Davila  wrote:
>
>
> Hello
>
> My name is Yanko Davila, I´m new to ceph so please pardon my ignorance. I 
> have a question about Bluestore and SPDK.
>
> I´m currently running ceph version:
>
> ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous 
> (stable)
>
> on Debian:
>
> Linux  4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 (2018-08-21) 
> x86_64 GNU/Linux
>
> Distributor ID:Debian
> Description:Debian GNU/Linux 9.5 (stretch)
> Release:9.5
> Codename:stretch
>
> I´m trying to add an NVMe osd using bluestore but I´m struggling to find the 
> device selector for that NVMe. So far I´ve been  able to compile spdk and 
> succesfully run the setup.sh script. I can also succesfully run the identify 
> example which leads me to think that spdk is working as expected.
>
> When I read the online manual ( 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#spdk-usage
>  ) it gives an example for an Intel PCIe SSD:

Yanko, i think the doc only applies to the master branch. if you are
testing luminous, you might need to use the serial number to identify
a NVMe device instead.

>
> 
>
> For example, users can find the device selector of an Intel PCIe SSD with:
>
> $ lspci -mm -n -D -d 8086:0953
>
> 
>
> When I try the same command adjusting for my Samsung SSD it returns nothing 
> or the return is just blank here is what I tried:
>
> $ lspci -mm -n -D -d 144d:a801
>
>
> Assuming that I gave you enough information. Can anyone spot what I´m doing 
> wrong? Does spdk only works on Intel SSDs ? Any comment is highly 
> appreciated. Thank You for your time.
>
> Yanko.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Bryan Stillwell
Since you're using jumbo frames, make sure everything between the nodes 
properly supports them (nics & switches).  I've tested this in the past by 
using the size option in ping (you need to use  a payload size of 8972 instead 
of 9000 to account for the 28 byte header):

ping -s 8972 192.168.160.237

If that works, then you'll need to pull out tcpdump/wireshark to determine why 
the packets aren't able to return.

Bryan

From: ceph-users  on behalf of Johan Thomsen 

Date: Thursday, January 17, 2019 at 5:42 AM
To: Kevin Olbrich 
Cc: ceph-users 
Subject: Re: [ceph-users] pgs stuck in creating+peering state

Thanks you for responding!

First thing: I disabled the firewall on all the nodes.
More specifically not firewalld, but the NixOS firewall, since I run NixOS.
I can netcat both udp and tcp traffic on all ports between all nodes
without problems.

Next, I tried raising the mtu to 9000 on the nics where the cluster
network is connected - although I don't see why the mtu should affect
the heartbeat?
I have two bonded nics connected to the cluster network (mtu 9000) and
two separate bonded nics hooked on the public network (mtu 1500).
I've tested traffic and routing on both pairs of nics and traffic gets
through without issues, apparently.


None of the above solved the problem :-(


Den tor. 17. jan. 2019 kl. 12.01 skrev Kevin Olbrich 
mailto:k...@sv01.de>>:

Are you sure, no service like firewalld is running?
Did you check that all machines have the same MTU and jumbo frames are
enabled if needed?

I had this problem when I first started with ceph and forgot to
disable firewalld.
Replication worked perfectly fine but the OSD was kicked out every few seconds.

Kevin

Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen 
mailto:wr...@ownrisk.dk>>:
>
> Hi,
>
> I have a sad ceph cluster.
> All my osds complain about failed reply on heartbeat, like so:
>
> osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> ever on either front or back, first ping sent 2019-01-16
> 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
>
> .. I've checked the network sanity all I can, and all ceph ports are
> open between nodes both on the public network and the cluster network,
> and I have no problems sending traffic back and forth between nodes.
> I've tried tcpdump'ing and traffic is passing in both directions
> between the nodes, but unfortunately I don't natively speak the ceph
> protocol, so I can't figure out what's going wrong in the heartbeat
> conversation.
>
> Still:
>
> # ceph health detail
>
> HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> pgs inactive, 1072 pgs peering
> OSDMAP_FLAGS nodown,noout flag(s) set
> PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
> pg 7.3cd is stuck inactive for 245901.560813, current state
> creating+peering, last acting [13,41,1]
> pg 7.3ce is stuck peering for 245901.560813, current state
> creating+peering, last acting [1,40,7]
> pg 7.3cf is stuck peering for 245901.560813, current state
> creating+peering, last acting [0,42,9]
> pg 7.3d0 is stuck peering for 245901.560813, current state
> creating+peering, last acting [20,8,38]
> pg 7.3d1 is stuck peering for 245901.560813, current state
> creating+peering, last acting [10,20,42]
>()
>
>
> I've set "noout" and "nodown" to prevent all osd's from being removed
> from the cluster. They are all running and marked as "up".
>
> # ceph osd tree
>
> ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
>  -1   249.73434 root default
> -25   166.48956 datacenter m1
> -2483.24478 pod kube1
> -3541.62239 rack 10
> -3441.62239 host ceph-sto-p102
>  40   hdd   7.27689 osd.40 up  1.0 1.0
>  41   hdd   7.27689 osd.41 up  1.0 1.0
>  42   hdd   7.27689 osd.42 up  1.0 1.0
>()
>
> I'm at a point where I don't know which options and what logs to check 
> anymore?
>
> Any debug hint would be very much appreciated.
>
> btw. I have no important data in the cluster (yet), so if the solution
> is to drop all osd and recreate them, it's ok for now. But I'd really
> like to know how the cluster ended in this state.
>
> /Johan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How many rgw buckets is too many?

2019-01-17 Thread Matthew Vernon

Hi,

The default limit for buckets per user in ceph is 1000, but it is 
adjustable via radosgw-admin user modify --max-buckets


One of our users is asking for a significant increase (they're mooting 
100,000), and I worry about the impact on RGW performance since, I 
think, there's only one object that stores the bucket identifiers.


Has anyone here got experience of rgw with very large numbers of 
buckets? FWIW we're running Jewel with a Luminous upgrade planned for 
Quite Soon...


Regards,

Matthew


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore SPDK OSD

2019-01-17 Thread kefu chai
On Tue, Jan 15, 2019 at 4:42 AM Yanko Davila  wrote:
>
> Hello
>
>  I was able to find the device selector. Now I have an issue understanding 
> the steps to activate the osd. Once I setup spdk the device disappears from 
> lsblk as expected. So the ceph manual is not very helpful after spdk is 
> enabled. Is there any manual that walks you through the steps to add an spdk 
> nvme to ceph ?? Thanks again for your time.

Yanko, please note, we use different ways to designate NVMe devices in
master (nautilus) and pre-nautilus . if you are using nautilus,
probably you could take a look at
https://github.com/tone-zhang/ceph/blob/71639aee4d40f16d5cdb9abb6f241db206d2/src/vstart.sh
, search "spdk" for related setting and steps, and
https://github.com/tone-zhang/ceph/blob/78012eed1bf125ec37afa7605fa64cc798036e18/doc/rados/configuration/bluestore-config-ref.rst
for steps.

as these changes were merged after we branch luminous and mimic, so
they are not in either of these branches. and since we are using the
PCI selector in master, the vstart.sh script and document were updated
accordingly. that's why you need to consult an older version of them.

please note, the SPDK backed bluestore is not tested in our release
cycles like the filestore and aio backed bluestore are, so it's not
that well supported.

>
> Yanko.___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore SPDK OSD

2019-01-17 Thread kefu chai
On Fri, Jan 18, 2019 at 12:53 AM kefu chai  wrote:
>
> On Tue, Jan 15, 2019 at 4:42 AM Yanko Davila  wrote:
> >
> > Hello
> >
> >  I was able to find the device selector. Now I have an issue understanding 
> > the steps to activate the osd. Once I setup spdk the device disappears from 
> > lsblk as expected. So the ceph manual is not very helpful after spdk is 
> > enabled. Is there any manual that walks you through the steps to add an 
> > spdk nvme to ceph ?? Thanks again for your time.
>
> Yanko, please note, we use different ways to designate NVMe devices in
> master (nautilus) and pre-nautilus . if you are using nautilus,
sorry, should be "if you are using luminous or mimic"

> probably you could take a look at
> https://github.com/tone-zhang/ceph/blob/71639aee4d40f16d5cdb9abb6f241db206d2/src/vstart.sh
> , search "spdk" for related setting and steps, and
> https://github.com/tone-zhang/ceph/blob/78012eed1bf125ec37afa7605fa64cc798036e18/doc/rados/configuration/bluestore-config-ref.rst
> for steps.
>
> as these changes were merged after we branch luminous and mimic, so
> they are not in either of these branches. and since we are using the
> PCI selector in master, the vstart.sh script and document were updated
> accordingly. that's why you need to consult an older version of them.
>
> please note, the SPDK backed bluestore is not tested in our release
> cycles like the filestore and aio backed bluestore are, so it's not
> that well supported.
>
> >
> > Yanko.___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Regards
> Kefu Chai



-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Vasu Kulkarni
On Thu, Jan 17, 2019 at 4:42 AM Johan Thomsen  wrote:

> Thanks you for responding!
>
> First thing: I disabled the firewall on all the nodes.
> More specifically not firewalld, but the NixOS firewall, since I run NixOS.
> I can netcat both udp and tcp traffic on all ports between all nodes
> without problems.
>
> Next, I tried raising the mtu to 9000 on the nics where the cluster
> network is connected - although I don't see why the mtu should affect
> the heartbeat?
> I have two bonded nics connected to the cluster network (mtu 9000) and
> two separate bonded nics hooked on the public network (mtu 1500).
> I've tested traffic and routing on both pairs of nics and traffic gets
> through without issues, apparently.
>
Try 'osd hearbeat min size = 100' in ceph.conf on all osd nodes and
restart, we have seen this in some network
configurtion with mtu size mismatch between ports.

>
>
> None of the above solved the problem :-(
>
>
> Den tor. 17. jan. 2019 kl. 12.01 skrev Kevin Olbrich :
> >
> > Are you sure, no service like firewalld is running?
> > Did you check that all machines have the same MTU and jumbo frames are
> > enabled if needed?
> >
> > I had this problem when I first started with ceph and forgot to
> > disable firewalld.
> > Replication worked perfectly fine but the OSD was kicked out every few
> seconds.
> >
> > Kevin
> >
> > Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen <
> wr...@ownrisk.dk>:
> > >
> > > Hi,
> > >
> > > I have a sad ceph cluster.
> > > All my osds complain about failed reply on heartbeat, like so:
> > >
> > > osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> > > ever on either front or back, first ping sent 2019-01-16
> > > 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
> > >
> > > .. I've checked the network sanity all I can, and all ceph ports are
> > > open between nodes both on the public network and the cluster network,
> > > and I have no problems sending traffic back and forth between nodes.
> > > I've tried tcpdump'ing and traffic is passing in both directions
> > > between the nodes, but unfortunately I don't natively speak the ceph
> > > protocol, so I can't figure out what's going wrong in the heartbeat
> > > conversation.
> > >
> > > Still:
> > >
> > > # ceph health detail
> > >
> > > HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> > > pgs inactive, 1072 pgs peering
> > > OSDMAP_FLAGS nodown,noout flag(s) set
> > > PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs
> peering
> > > pg 7.3cd is stuck inactive for 245901.560813, current state
> > > creating+peering, last acting [13,41,1]
> > > pg 7.3ce is stuck peering for 245901.560813, current state
> > > creating+peering, last acting [1,40,7]
> > > pg 7.3cf is stuck peering for 245901.560813, current state
> > > creating+peering, last acting [0,42,9]
> > > pg 7.3d0 is stuck peering for 245901.560813, current state
> > > creating+peering, last acting [20,8,38]
> > > pg 7.3d1 is stuck peering for 245901.560813, current state
> > > creating+peering, last acting [10,20,42]
> > >()
> > >
> > >
> > > I've set "noout" and "nodown" to prevent all osd's from being removed
> > > from the cluster. They are all running and marked as "up".
> > >
> > > # ceph osd tree
> > >
> > > ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT
> PRI-AFF
> > >  -1   249.73434 root default
> > > -25   166.48956 datacenter m1
> > > -2483.24478 pod kube1
> > > -3541.62239 rack 10
> > > -3441.62239 host ceph-sto-p102
> > >  40   hdd   7.27689 osd.40 up  1.0
> 1.0
> > >  41   hdd   7.27689 osd.41 up  1.0
> 1.0
> > >  42   hdd   7.27689 osd.42 up  1.0
> 1.0
> > >()
> > >
> > > I'm at a point where I don't know which options and what logs to check
> anymore?
> > >
> > > Any debug hint would be very much appreciated.
> > >
> > > btw. I have no important data in the cluster (yet), so if the solution
> > > is to drop all osd and recreate them, it's ok for now. But I'd really
> > > like to know how the cluster ended in this state.
> > >
> > > /Johan
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-17 Thread Void Star Nill
Hi,

We am trying to use Ceph in our products to address some of the use cases.
We think Ceph block device for us. One of the use cases is that we have a
number of jobs running in containers that need to have Read-Only access to
shared data. The data is written once and is consumed multiple times. I
have read through some of the similar discussions and the recommendations
on using CephFS for these situations, but in our case Block device makes
more sense as it fits well with other use cases and restrictions we have
around this use case.

The following scenario seems to work as expected when we tried on a test
cluster, but we wanted to get an expert opinion to see if there would be
any issues in production. The usage scenario is as follows:

- A block device is created with "--image-shared" options:

rbd create mypool/foo --size 4G --image-shared


- The image is mapped to a host, formatted in ext4 format (or other file
formats), mounted to a directory in read/write mode and data is written to
it. Please note that the image will be mapped in exclusive write mode -- no
other read/write mounts are allowed a this time.

- The volume is unmapped from the host and then mapped on to N number of
other hosts where it will be mounted in read-only mode and the data is read
simultaneously from N readers

As mentioned above, this seems to work as expected, but we wanted to
confirm that we won't run into any unexpected issues.

Appreciate any inputs on this.

Thanks,
Shridhar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-17 Thread Oliver Freyermuth
Hi,

first of: I'm probably not the expert you are waiting for, but we are using 
CephFS for HPC / HTC (storing datafiles), and make use of containers for all 
jobs (up to ~2000 running in parallel). 
We also use RBD, but for our virtualization infrastructure. 

While I'm always one of the first to recommend CephFS / RBD, I personally think 
that another (open source) file system - CVMFS - may suit your 
container-usecase significantly better. 
We use that to store our container images (and software in several versions). 
The containers are rebuilt daily. 
CVMFS is read-only for the clients by design. An administrator commits changes 
on the "Stratum 0" server,
and the clients see the new changes shortly after the commit has happened. 
Things are revisioned, and you can roll back in case something goes wrong. 
Why did we choose CVMFS here? 
- No need to have an explicit write-lock when changing things. 
- Deduplication built-in. We build several new containers daily, and keep them 
for 30 days (for long-running jobs). 
  Deduplication spares us from the need to have many factors more of storage. 
  I still hope Ceph learns deduplication some day ;-). 
- Extreme caching. The file system works via HTTP, i.e. you can use standard 
caching proxies (squids), and all clients have their own local disk cache. The 
deduplication
  also applies to that, so only unique chunks need to be fetched. 
High availability is rather easy to get (not as easy as with Ceoh, but you can 
have it by running one "Stratum 0" machine which does the writing,
at least two "Stratum 1" machines syncing everything, and if you want more 
performance also at least two squid servers in front). 
It's a FUSE filesystem, but unexpectedly well performing especially for small 
files as you have them for software and containers. 
The caching and deduplication heavily reduce traffic when you run many 
containers, especially when they all start concurrently. 

That's just my 2 cents, and your mileage may vary (for example, this does not 
work well if the machines running the containers do not have any local storage 
to cache things). 
And maybe you do not run thousands of containers in parallel, and you do not 
gain as much as we do from the deduplication. 

If it does not fit your case, I think RBD is a good way to go, but sadly I can 
not share experience how well / stable it works with many clients mounting the 
volume read-only in parallel. 
In our virtualization, there's always only one exclusive lock on a volume. 

Cheers,
Oliver

Am 17.01.19 um 19:27 schrieb Void Star Nill:
> Hi,
> 
> We am trying to use Ceph in our products to address some of the use cases. We 
> think Ceph block device for us. One of the use cases is that we have a number 
> of jobs running in containers that need to have Read-Only access to shared 
> data. The data is written once and is consumed multiple times. I have read 
> through some of the similar discussions and the recommendations on using 
> CephFS for these situations, but in our case Block device makes more sense as 
> it fits well with other use cases and restrictions we have around this use 
> case.
> 
> The following scenario seems to work as expected when we tried on a test 
> cluster, but we wanted to get an expert opinion to see if there would be any 
> issues in production. The usage scenario is as follows:
> 
> - A block device is created with "--image-shared" options:
> 
> rbd create mypool/foo --size 4G --image-shared
> 
> 
> - The image is mapped to a host, formatted in ext4 format (or other file 
> formats), mounted to a directory in read/write mode and data is written to 
> it. Please note that the image will be mapped in exclusive write mode -- no 
> other read/write mounts are allowed a this time.
> 
> - The volume is unmapped from the host and then mapped on to N number of 
> other hosts where it will be mounted in read-only mode and the data is read 
> simultaneously from N readers
> 
> As mentioned above, this seems to work as expected, but we wanted to confirm 
> that we won't run into any unexpected issues.
> 
> Appreciate any inputs on this.
> 
> Thanks,
> Shridhar
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-17 Thread Mark Nelson

Hi Stefan,


I'm taking a stab at reproducing this in-house.  Any details you can 
give me that might help would be much appreciated.  I'll let you know 
what I find.



Thanks,

Mark


On 1/16/19 1:56 PM, Stefan Priebe - Profihost AG wrote:

i reverted the whole cluster back to 12.2.8 - recovery speed also
dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
broken.

Greets,
Stefan
Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:

This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
boot all pgs are instantly active - not inactive pgs at least not
noticable in ceph -s.

With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
minutes until all pgs are active again.

Greets,
Stefan
Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:

Hello,

while digging into this further i saw that it takes ages until all pgs
are active. After starting the OSD 3% of all pgs are inactive and it
takes minutes after they're active.

The log of the OSD is full of:


2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size 3 up 3
2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset size 3 up 2
2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 3 upset size 3 up 3
2019-01-16 15:19:23.503231 7fecb97ff700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=183,(3+0)=3}}] _update_calc_stats ml 183 upset size 3 up 2

Greets,
Stefan
Am 16.01.19 um 09:12 schrieb Stefan Priebe - Profihost AG:

Hi,

no ok it was not. Bug still present. It was only working because the
osdmap was so far away that it has started backfill instead of recovery.

So it happens only in the recovery case.

Greets,
Stefan

Am 15.01.19 um 16:02 schrieb Stefan Priebe - Profihost AG:

Am 15.01.19 um 12:45 schrieb Marc Roos:
  
I upgraded this weekend from 12.2.8 to 12.2.10 without such issues

(osd's are idle)


it turns out this was a kernel bug. Updating to a newer kernel - has
solved this issue.

Greets,
Stefan



-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag]
Sent: 15 January 2019 10:26
To: ceph-users@lists.ceph.com
Cc: n.fahldi...@profihost.ag
Subject: Re: [ceph-users] slow requests and high i/o / read rate on
bluestore osds after upgrade 12.2.8 -> 12.2.10

Hello list,

i also tested current upstream/luminous branch and it happens as well. A
clean install works fine. It only happens on upgraded bluestore osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:

while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm

experience

issues with bluestore osds - so i canceled the upgrade and all

bluestore

osds are stopped now.

After starting a bluestore osd i'm seeing a lot of slow requests

caused

[ceph-users] Bluestore 32bit max_object_size limit

2019-01-17 Thread KEVIN MICHAEL HRPCEK
Hey,

I recall reading about this somewhere but I can't find it in the docs or list 
archive and confirmation from a dev or someone who knows for sure would be 
nice. What I recall is that bluestore has a max 4GB file size limit based on 
the design of bluestore not the osd_max_object_size setting. The bluestore 
source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, 
giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing 
the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe 
I'm reading all this wrong..?

If bluestore has a hard 4GB object limit using radosstriper to break up an 
object would work, but does using an EC pool that breaks up the object to 
shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get 
around a 4GB limit? We use rados directly and would like to move to bluestore 
but we have some large objects <= 13G that may need attention if this 4GB limit 
does exist and an ec pool doesn't get around it.


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395

 // sanity check(s)
  auto osd_max_object_size =
cct->_conf.get_val("osd_max_object_size");
  if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
  << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  
std::dec << dendl;
return -EINVAL;
  }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
  if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
  } else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
  }

Thanks!
Kevin


--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to reduce min_size of an EC pool?

2019-01-17 Thread Félix Barbeira
I want to bring back my cluster to HEALTHY state because right now I have
not access to the data.

I have an 3+2 EC pool on a 5 node cluster. 3 nodes were lost, all data
wiped. They were reinstalled and added to cluster again.

The "ceph health detail" command says to reduce min_size number to a value
lower than 3, but:

root@ceph-monitor02:~# ceph osd pool set default.rgw.buckets.data min_size 2
Error EINVAL: pool min_size must be between 3 and 5
root@ceph-monitor02:~#


This is the situation:


root@ceph-monitor01:~# ceph -s
  cluster:
id: ce78b02d-03df-4f9e-a35a-31b5f05c4c63
health: HEALTH_WARN
Reduced data availability: 515 pgs inactive, 512 pgs incomplete

  services:
mon: 3 daemons, quorum ceph-monitor01,ceph-monitor03,ceph-monitor02
mgr: ceph-monitor02(active), standbys: ceph-monitor01, ceph-monitor03
osd: 57 osds: 57 up, 57 in

  data:
pools:   8 pools, 568 pgs
objects: 4.48 M objects, 10 TiB
usage:   24 TiB used, 395 TiB / 419 TiB avail
pgs: 0.528% pgs unknown
 90.141% pgs not active
 512 incomplete
 53  active+clean
 3   unknown

root@ceph-monitor01:~#


And this is the output of health detail:

root@ceph-monitor01:~# ceph health detail
HEALTH_WARN Reduced data availability: 515 pgs inactive, 512 pgs incomplete
PG_AVAILABILITY Reduced data availability: 515 pgs inactive, 512 pgs
incomplete
pg 10.1cd is stuck inactive since forever, current state incomplete,
last acting [9,48,41,58,17] (reducing pool default.rgw.buckets.data
min_size from 3 may help; search ceph.com/docs for 'incomplete')
pg 10.1ce is incomplete, acting [3,13,14,42,21] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1cf is incomplete, acting [36,27,3,39,51] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d0 is incomplete, acting [29,9,38,4,56] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d1 is incomplete, acting [2,34,17,7,30] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d2 is incomplete, acting [41,45,53,13,32] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d3 is incomplete, acting [7,28,15,20,3] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d4 is incomplete, acting [11,40,25,23,0] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d5 is incomplete, acting [32,51,20,57,28] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d6 is incomplete, acting [2,53,8,16,15] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d7 is incomplete, acting [1,2,33,43,42] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d8 is incomplete, acting [27,49,9,48,20] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1d9 is incomplete, acting [37,8,7,11,20] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1da is incomplete, acting [27,14,33,15,53] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1db is incomplete, acting [58,53,6,26,4] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1dc is incomplete, acting [21,12,47,35,19] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1dd is incomplete, acting [51,4,52,24,7] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1de is incomplete, acting [38,29,21,41,44] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1df is incomplete, acting [25,4,30,61,11] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1e0 is incomplete, acting [27,57,21,6,13] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1e1 is incomplete, acting [8,7,25,15,29] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1e2 is incomplete, acting [49,37,62,11,31] (reducing pool
default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs for
'incomplete')
pg 10.1e3 is incomplete, acting [1,49,32,56,48] (reducing pool
default

Re: [ceph-users] How to reduce min_size of an EC pool?

2019-01-17 Thread Bryan Stillwell
When you use 3+2 EC that means you have 3 data chunks and 2 erasure chunks for 
your data.  So you can handle two failures, but not three.  The min_size 
setting is preventing you from going below 3 because that's the number of data 
chunks you specified for the pool.  I'm sorry to say this, but since the data 
was wiped off the other 3 nodes there isn't anything that can be done to 
recover it.

Bryan


From: ceph-users  on behalf of Félix 
Barbeira 
Date: Thursday, January 17, 2019 at 1:27 PM
To: Ceph Users 
Subject: [ceph-users] How to reduce min_size of an EC pool?

I want to bring back my cluster to HEALTHY state because right now I have not 
access to the data.

I have an 3+2 EC pool on a 5 node cluster. 3 nodes were lost, all data wiped. 
They were reinstalled and added to cluster again.

The "ceph health detail" command says to reduce min_size number to a value 
lower than 3, but:

root@ceph-monitor02:~# ceph osd pool set default.rgw.buckets.data min_size 2
Error EINVAL: pool min_size must be between 3 and 5
root@ceph-monitor02:~#

This is the situation:

root@ceph-monitor01:~# ceph -s
  cluster:
id: ce78b02d-03df-4f9e-a35a-31b5f05c4c63
health: HEALTH_WARN
Reduced data availability: 515 pgs inactive, 512 pgs incomplete

  services:
mon: 3 daemons, quorum ceph-monitor01,ceph-monitor03,ceph-monitor02
mgr: ceph-monitor02(active), standbys: ceph-monitor01, ceph-monitor03
osd: 57 osds: 57 up, 57 in

  data:
pools:   8 pools, 568 pgs
objects: 4.48 M objects, 10 TiB
usage:   24 TiB used, 395 TiB / 419 TiB avail
pgs: 0.528% pgs unknown
 90.141% pgs not active
 512 incomplete
 53  active+clean
 3   unknown

root@ceph-monitor01:~#

And this is the output of health detail:

root@ceph-monitor01:~# ceph health detail
HEALTH_WARN Reduced data availability: 515 pgs inactive, 512 pgs incomplete
PG_AVAILABILITY Reduced data availability: 515 pgs inactive, 512 pgs incomplete
pg 10.1cd is stuck inactive since forever, current state incomplete, last 
acting [9,48,41,58,17] (reducing pool default.rgw.buckets.data min_size from 3 
may help; search ceph.com/docs for 'incomplete')
pg 10.1ce is incomplete, acting [3,13,14,42,21] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1cf is incomplete, acting [36,27,3,39,51] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d0 is incomplete, acting [29,9,38,4,56] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d1 is incomplete, acting [2,34,17,7,30] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d2 is incomplete, acting [41,45,53,13,32] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d3 is incomplete, acting [7,28,15,20,3] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d4 is incomplete, acting [11,40,25,23,0] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d5 is incomplete, acting [32,51,20,57,28] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d6 is incomplete, acting [2,53,8,16,15] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d7 is incomplete, acting [1,2,33,43,42] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d8 is incomplete, acting [27,49,9,48,20] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1d9 is incomplete, acting [37,8,7,11,20] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1da is incomplete, acting [27,14,33,15,53] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1db is incomplete, acting [58,53,6,26,4] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1dc is incomplete, acting [21,12,47,35,19] (reducing pool 
default.rgw.buckets.data min_size from 3 may help; search 
ceph.com/docs for 'incomplete')
pg 10.1dd is incomplete, acting [51,4,52,24,7] (red

Re: [ceph-users] Suggestions/experiences with mixed disk sizes and models from 4TB - 14TB

2019-01-17 Thread Bryan Stillwell
I've run my home cluster with drives ranging in size from 500GB to 8TB before 
and the biggest issue you run into is that the bigger drives will get a 
proportional more number of PGs which will increase the memory requirements on 
them.  Typically you want around 100 PGs/OSD, but if you mix 4TB and 14TB 
drives in a cluster the 14TB drives will have 3.5 times the number of PGs.  So 
if the 4TB drives have 100 PGs, the 14TB drives will have 350.   Or if the 14TB 
drives have 100 PGs, the 4TB drives will only have just 28 PGs on them.  Using 
the balancer plugin in the mgr will pretty much be required.

Also since you're using EC you'll need to make sure the math works with these 
nodes receiving 2-3.5 times the data.

Bryan

From: ceph-users  on behalf of Götz Reinicke 

Date: Wednesday, January 16, 2019 at 2:33 AM
To: ceph-users 
Subject: [ceph-users] Suggestions/experiences with mixed disk sizes and models 
from 4TB - 14TB

Dear Ceph users,

I’d like to get some feedback for the following thought:

Currently I run some 24*4TB bluestore OSD nodes. The main focus is on storage 
space over IOPS.

We use erasure code and cephfs, and things look good right now.

The „but“ is, I do need more disk space and don’t have so much more rack space 
available, so I was thinking of adding some 8TB or even 12TB OSDs and/or 
exchange over time 4TB OSDs with bigger disks.

My question is: How are your experiences with the current >=8TB SATA disks are 
some very bad models out there which I should avoid?

The current OSD nodes are connected by 4*10Gb bonds, so for 
replication/recovery speed is a 24 Chassis with bigger disks useful, or should 
I go with smaller chassis? Or dose the chassi sice does not matter at all that 
much in my setup.

I know, EC is quit computing intense, so may be bigger disks hav also there an 
impact?

Lot’s of questions, may be you can help answering some.

Best regards and Thanks a lot for feedback . Götz



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-17 Thread Stefan Priebe - Profihost AG
Hello Mark,

for whatever reason i didn't get your mails - most probably you kicked
me out of CC/TO and only sent to the ML? I've only subscribed to a daily
digest. (changed that for now)

So i'm very sorry to answer so late.

My messages might sound a bit confuse as it isn't easy reproduced and we
tried a lot to find out what's going on.

As 12.2.10 does not contain the pg hard limit i don't suspect it is
related to it.

What i can tell right now is:

1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824

2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
1073741824

3.) i also tried 12.2.10 without setting osd_memory_target or
bluestore_cache_size

4.) it's not kernel related - for some unknown reason it worked for some
hours with a newer kernel but gave problems again later

5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
12.2.10 while it took 2 hours with 12.2.8

6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
read on 12.2.8.

7.) upgrades on small clusters or fresh installs seem to work fine. (no
idea why or it is related to cluste size)

That's currently all i know.

Thanks a lot!

Greets,
Stefan
Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost AG:
> i reverted the whole cluster back to 12.2.8 - recovery speed also
> dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
> broken.
> 
> Greets,
> Stefan
> Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:
>> This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
>> boot all pgs are instantly active - not inactive pgs at least not
>> noticable in ceph -s.
>>
>> With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
>> minutes until all pgs are active again.
>>
>> Greets,
>> Stefan
>> Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:
>>> Hello,
>>>
>>> while digging into this further i saw that it takes ages until all pgs
>>> are active. After starting the OSD 3% of all pgs are inactive and it
>>> takes minutes after they're active.
>>>
>>> The log of the OSD is full of:
>>>
>>>
>>> 2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>> overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
>>> mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
>>> 2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>> overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
>>> mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size 3 up 3
>>> 2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>> mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset size 3 up 2
>>> 2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>> mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 3 upset size 3 up 3
>>> 2019-01-16 15:19:23.503231 7fecb97ff700  0 osd.33 pg_epoch: 1318479
>>> pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
>>> 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
>>> ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
>>> 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
>>> rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
>>> overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
>>> mbc={255={(2+0)=183,(3+0)=3}}] _update_calc_stats ml 183 upset size 3 up 2
>>>
>>> Greets,
>>> Stefan
>>> Am 16.01.19 u

Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-17 Thread Stefan Priebe - Profihost AG
Hello Mark,

after reading
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
again i'm really confused how the behaviour is exactly under 12.2.8
regarding memory and 12.2.10.

Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
we're compiling against and using jemalloc. What happens in this case?

Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
Mem (with bluestore_cache_size = 1073741824).

Greets,
Stefan

Am 17.01.19 um 22:59 schrieb Stefan Priebe - Profihost AG:
> Hello Mark,
> 
> for whatever reason i didn't get your mails - most probably you kicked
> me out of CC/TO and only sent to the ML? I've only subscribed to a daily
> digest. (changed that for now)
> 
> So i'm very sorry to answer so late.
> 
> My messages might sound a bit confuse as it isn't easy reproduced and we
> tried a lot to find out what's going on.
> 
> As 12.2.10 does not contain the pg hard limit i don't suspect it is
> related to it.
> 
> What i can tell right now is:
> 
> 1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824
> 
> 2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
> 1073741824
> 
> 3.) i also tried 12.2.10 without setting osd_memory_target or
> bluestore_cache_size
> 
> 4.) it's not kernel related - for some unknown reason it worked for some
> hours with a newer kernel but gave problems again later
> 
> 5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
> 12.2.10 while it took 2 hours with 12.2.8
> 
> 6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
> on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
> read on 12.2.8.
> 
> 7.) upgrades on small clusters or fresh installs seem to work fine. (no
> idea why or it is related to cluste size)
> 
> That's currently all i know.
> 
> Thanks a lot!
> 
> Greets,
> Stefan
> Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost AG:
>> i reverted the whole cluster back to 12.2.8 - recovery speed also
>> dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
>> broken.
>>
>> Greets,
>> Stefan
>> Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:
>>> This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
>>> boot all pgs are instantly active - not inactive pgs at least not
>>> noticable in ceph -s.
>>>
>>> With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
>>> minutes until all pgs are active again.
>>>
>>> Greets,
>>> Stefan
>>> Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:
 Hello,

 while digging into this further i saw that it takes ages until all pgs
 are active. After starting the OSD 3% of all pgs are inactive and it
 takes minutes after they're active.

 The log of the OSD is full of:


 2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
 pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
 ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
 rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
 overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
 mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
 2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
 pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
 ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
 rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
 overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
 mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 2 upset size 3 up 3
 2019-01-16 15:19:15.909327 7fecbf7da700  0 osd.33 pg_epoch: 1318479
 pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
 ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
 rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
 overing+degraded m=183 snaptrimq=[ec1a0~1,ec808~1]
 mbc={255={(2+0)=184,(3+0)=3}}] _update_calc_stats ml 184 upset size 3 up 2
 2019-01-16 15:19:15.909446 7fecbf7da700  0 osd.33 pg_epoch: 1318479
 pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
 21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
 ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
 8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
 rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
 overing+degraded m=183 snaptrimq=[ec1a0~1,ec808

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-01-17 Thread Tim Serong
On 01/17/2019 04:46 AM, Mike Perez wrote:
> Hey everyone,
> 
> We're getting close to the release of Ceph Nautilus, and I wanted to
> start the discussion of our next shirt!
> 
> It looks like in the past we've used common works from Wikipedia pages.
> 
> https://en.wikipedia.org/wiki/Nautilus
> 
> I thought it would be fun to see who in our community would like to
> contribute to the next design. If we don't have any designers we can
> also gather images and I can get something worked on that the
> community can provide feedback on.
> 
> Lenz has provided this image that is currently being used for the 404
> page of the dashboard:
> 
> https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/frontend/src/assets/1280px-Nautilus_Octopus.jpg

Nautilus *shells* are somewhat iconic/well known/distinctive.  Maybe a
variant of https://en.wikipedia.org/wiki/File:Nautilus_Section_cut.jpg
would be interesting on a t-shirt?

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-17 Thread Mark Nelson


On 1/17/19 4:06 PM, Stefan Priebe - Profihost AG wrote:

Hello Mark,

after reading
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
again i'm really confused how the behaviour is exactly under 12.2.8
regarding memory and 12.2.10.

Also i stumpled upon "When tcmalloc and cache autotuning is enabled," -
we're compiling against and using jemalloc. What happens in this case?



Hi Stefan,


The autotuner uses the existing in-tree perfglue code that grabs the 
tcmalloc heap and unmapped memory statistics to determine how to tune 
the caches.  Theoretically we might be able to do the same thing for 
jemalloc and maybe even glibc malloc, but there's no perfglue code for 
those yet.  If the autotuner can't get heap statistics it won't try to 
tune the caches and should instead revert to using the 
bluestore_cache_size and whatever the ratios are (the same as if you set 
bluestore_cache_autotune to false).




Also i saw now - that 12.2.10 uses 1GB mem max while 12.2.8 uses 6-7GB
Mem (with bluestore_cache_size = 1073741824).



If you are using the autotuner (but it sounds like maybe you are not if 
jemalloc is being used?) you'll want to set the osd_memory_target at 
least 1GB higher than what you previously had the bluestore_cache_size 
set to.  It's likely that trying to set the OSD to stay within 1GB of 
memory will cause the cache to sit at osd_memory_cache_min because the 
tuner simply can't shrink the cache enough to meet the target (too much 
other memory consumed by pglog, rocksdb WAL buffers, random other stuff).


The fact that you see 6-7GB of mem usage with 12.2.8 vs 1GB with 12.2.10 
sounds like a clue.  A bluestore OSD using 1GB of memory is going to 
have very little space for cache and it's quite likely that it would be 
performing reads from disk for a variety of reasons.  Getting to the 
root of that might explain what's going on.  If you happen to still have 
a 12.2.8 OSD up that's consuming 6-7GB of memory (with 
bluestore_cache_size = 1073741824), can you dump the mempool stats and 
running configuration for it?



ceph daemon osd.NNN dump_mempools


And


ceph daemon osd.NNN show config


Thanks,

Mark




Greets,
Stefan

Am 17.01.19 um 22:59 schrieb Stefan Priebe - Profihost AG:

Hello Mark,

for whatever reason i didn't get your mails - most probably you kicked
me out of CC/TO and only sent to the ML? I've only subscribed to a daily
digest. (changed that for now)

So i'm very sorry to answer so late.

My messages might sound a bit confuse as it isn't easy reproduced and we
tried a lot to find out what's going on.

As 12.2.10 does not contain the pg hard limit i don't suspect it is
related to it.

What i can tell right now is:

1.) Under 12.2.8 we've set bluestore_cache_size = 1073741824

2.) While upgrading to 12.2.10 we replaced it with osd_memory_target =
1073741824

3.) i also tried 12.2.10 without setting osd_memory_target or
bluestore_cache_size

4.) it's not kernel related - for some unknown reason it worked for some
hours with a newer kernel but gave problems again later

5.) a backfill with 12.2.10 of 6x 2TB SSDs took about 14 hours using
12.2.10 while it took 2 hours with 12.2.8

6.) with 12.2.10 i have a constant rate of 100% read i/o (400-500MB/s)
on most of my bluestore OSDs - while on 12.2.8 i've 100kb - 2MB/s max
read on 12.2.8.

7.) upgrades on small clusters or fresh installs seem to work fine. (no
idea why or it is related to cluste size)

That's currently all i know.

Thanks a lot!

Greets,
Stefan
Am 16.01.19 um 20:56 schrieb Stefan Priebe - Profihost AG:

i reverted the whole cluster back to 12.2.8 - recovery speed also
dropped from 300-400MB/s to 20MB/s on 12.2.10. So something is really
broken.

Greets,
Stefan
Am 16.01.19 um 16:00 schrieb Stefan Priebe - Profihost AG:

This is not the case with 12.2.8 - it happens with 12.2.9 as well. After
boot all pgs are instantly active - not inactive pgs at least not
noticable in ceph -s.

With 12.2.9 or 12.2.10 or eben current upstream/luminous it takes
minutes until all pgs are active again.

Greets,
Stefan
Am 16.01.19 um 15:22 schrieb Stefan Priebe - Profihost AG:

Hello,

while digging into this further i saw that it takes ages until all pgs
are active. After starting the OSD 3% of all pgs are inactive and it
takes minutes after they're active.

The log of the OSD is full of:


2019-01-16 15:19:13.568527 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'61576253 (1318287'615747
21,1318474'61584855] local-lis/les=1318472/1318473 n=1912
ec=133405/133405 lis/c 1318472/1278145 les/c/f 1318473/1278148/1211861 131
8472/1318472/1318472) [33,3,22] r=0 lpr=1318472 pi=[1278145,1318472)/1
rops=4 crt=1318474'61584855 mlcod 1318356'61576253 active+rec
overing+degraded m=184 snaptrimq=[ec1a0~1,ec808~1]
mbc={255={(2+0)=185,(3+0)=2}}] _update_calc_stats ml 185 upset size 3 up 2
2019-01-16 15:19:13.568637 7fecbf7da700  0 osd.33 pg_epoch: 1318479
pg[5.563( v 1318474'61584855 lc 1318356'

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-01-17 Thread Anthony D'Atri
>> Lenz has provided this image that is currently being used for the 404
>> page of the dashboard:
>> 
>> https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/frontend/src/assets/1280px-Nautilus_Octopus.jpg
> 
> Nautilus *shells* are somewhat iconic/well known/distinctive.  Maybe a
> variant of https://en.wikipedia.org/wiki/File:Nautilus_Section_cut.jpg
> would be interesting on a t-shirt?

I agree with Tim.  T shirts with photos can be tricky, it’s easy for them to 
look cheesy and they don’t age well.

In the same vein, something with a lower bit-depth and not non-cross-section 
might be slightly more recognizable:

https://www.vectorstock.com/royalty-free-vector/nautilus-vector-2806848

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to do multiple cephfs mounts.

2019-01-17 Thread Patrick Donnelly
On Thu, Jan 17, 2019 at 3:23 AM Marc Roos  wrote:
> Should I not be able to increase the io's by splitting the data writes
> over eg. 2 cephfs mounts? I am still getting similar overall
> performance. Is it even possible to increase performance by using
> multiple mounts?
>
> Using 2 kernel mounts on CentOS 7.6

It's unlikely this changes anything unless you also split the workload
into two. That may allow the kernel to do parallel requests?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-filesystem wthin a cluster

2019-01-17 Thread Patrick Donnelly
On Thu, Jan 17, 2019 at 2:44 AM Dan van der Ster  wrote:
>
> On Wed, Jan 16, 2019 at 11:17 PM Patrick Donnelly  wrote:
> >
> > On Wed, Jan 16, 2019 at 1:21 AM Marvin Zhang  wrote:
> > > Hi CephFS experts,
> > > From document, I know multi-fs within a cluster is still experiment 
> > > feature.
> > > 1. Is there any estimation about stability and performance for this 
> > > feature?
> >
> > Remaining blockers [1] need completed. No developer has yet taken on
> > this task. Perhaps by O release.
> >
> > > 2. It seems that each FS will consume at least 1 active MDS and
> > > different FS can't share MDS. Suppose I want to create 10 FS , I need
> > > at least 10 MDS. Is it right? Is ther any limit number for MDS within
> > > a cluster?
> >
> > No limit on number of MDS but there is a limit on the number of
> > actives (multimds).
>
> TIL...
> What is the max number of actives in a single FS?

https://github.com/ceph/ceph/blob/39f9e8db4dc7f8bfcb01a9ad20b8961c36138f4f/src/mds/mdstypes.h#L40

I don't think there's a particular reason for this limit. There may be
some parts of the code that expect fewer than 256 active MDS but that
could probably be easily changed.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] export a rbd over rdma

2019-01-17 Thread Will Zhao
Hi:
   Recently , I was trying to find a way to map a rbd device that can
talk with back end with rdma,  There are three ways to export a rbd
device , krbd,   nbd,   iscsi .It seems that only iscsi may give a
chance.  Has anyone tried to configure this and can give some advices
?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [ceph-ansible]Failure at TASK [ceph-osd : activate osd(s) when device is a disk]

2019-01-17 Thread Cody
Hello,

I run into an error [1] while using OpenStack-Ansible to deploy Ceph
(using ceph-ansible 3.1).

My configuration was to use a non-collocated scenario with one SSD
(/dev/sdb) and two HDDs (/dev/sdc, /dev/sdd) on every host. Ceph OSD
configuration can be found at here [2].

[1] https://pasted.tech/pastes/af4e0b3b76c08e2f5790c89123a9fcb7ac7f726e
[2] https://pasted.tech/pastes/48551abd7d07cd647c7d6c585bb496af80669290

Any suggestions would be much appreciated.

Thank you very much!

Regards,
Cody
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reduce min_size of an EC pool?

2019-01-17 Thread Félix Barbeira
Ok, lesson learned the hard way. Thank goodness it was a test cluster.
Thanks a lot Bryan!

El jue., 17 ene. 2019 a las 21:46, Bryan Stillwell ()
escribió:

> When you use 3+2 EC that means you have 3 data chunks and 2 erasure chunks
> for your data.  So you can handle two failures, but not three.  The
> min_size setting is preventing you from going below 3 because that's the
> number of data chunks you specified for the pool.  I'm sorry to say this,
> but since the data was wiped off the other 3 nodes there isn't anything
> that can be done to recover it.
>
>
>
> Bryan
>
>
>
>
>
> *From: *ceph-users  on behalf of Félix
> Barbeira 
> *Date: *Thursday, January 17, 2019 at 1:27 PM
> *To: *Ceph Users 
> *Subject: *[ceph-users] How to reduce min_size of an EC pool?
>
>
>
> I want to bring back my cluster to HEALTHY state because right now I have
> not access to the data.
>
>
>
> I have an 3+2 EC pool on a 5 node cluster. 3 nodes were lost, all data
> wiped. They were reinstalled and added to cluster again.
>
>
>
> The "ceph health detail" command says to reduce min_size number to a value
> lower than 3, but:
>
>
>
> root@ceph-monitor02:~# ceph osd pool set default.rgw.buckets.data
> min_size 2
>
> Error EINVAL: pool min_size must be between 3 and 5
>
> root@ceph-monitor02:~#
>
>
>
> This is the situation:
>
>
>
> root@ceph-monitor01:~# ceph -s
>
>   cluster:
>
> id: ce78b02d-03df-4f9e-a35a-31b5f05c4c63
>
> health: HEALTH_WARN
>
> Reduced data availability: 515 pgs inactive, 512 pgs incomplete
>
>
>
>   services:
>
> mon: 3 daemons, quorum ceph-monitor01,ceph-monitor03,ceph-monitor02
>
> mgr: ceph-monitor02(active), standbys: ceph-monitor01, ceph-monitor03
>
> osd: 57 osds: 57 up, 57 in
>
>
>
>   data:
>
> pools:   8 pools, 568 pgs
>
> objects: 4.48 M objects, 10 TiB
>
> usage:   24 TiB used, 395 TiB / 419 TiB avail
>
> pgs: 0.528% pgs unknown
>
>  90.141% pgs not active
>
>  512 incomplete
>
>  53  active+clean
>
>  3   unknown
>
>
>
> root@ceph-monitor01:~#
>
>
>
> And this is the output of health detail:
>
>
>
> root@ceph-monitor01:~# ceph health detail
>
> HEALTH_WARN Reduced data availability: 515 pgs inactive, 512 pgs incomplete
>
> PG_AVAILABILITY Reduced data availability: 515 pgs inactive, 512 pgs
> incomplete
>
> pg 10.1cd is stuck inactive since forever, current state incomplete,
> last acting [9,48,41,58,17] (reducing pool default.rgw.buckets.data
> min_size from 3 may help; search ceph.com/docs for 'incomplete')
>
> pg 10.1ce is incomplete, acting [3,13,14,42,21] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1cf is incomplete, acting [36,27,3,39,51] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d0 is incomplete, acting [29,9,38,4,56] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d1 is incomplete, acting [2,34,17,7,30] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d2 is incomplete, acting [41,45,53,13,32] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d3 is incomplete, acting [7,28,15,20,3] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d4 is incomplete, acting [11,40,25,23,0] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d5 is incomplete, acting [32,51,20,57,28] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d6 is incomplete, acting [2,53,8,16,15] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d7 is incomplete, acting [1,2,33,43,42] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d8 is incomplete, acting [27,49,9,48,20] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1d9 is incomplete, acting [37,8,7,11,20] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1da is incomplete, acting [27,14,33,15,53] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1db is incomplete, acting [58,53,6,26,4] (reducing pool
> default.rgw.buckets.data min_size from 3 may help; search ceph.com/docs
> for 'incomplete')
>
> pg 10.1dc is incomplete, acting [21,12,47,35,19] (reducing pool
> default.rgw.buckets.data min_size from 3 may he