Re: [ceph-users] Seeking your feedback on the Ceph monitoring and management functionality in openATTIC

2016-09-15 Thread John Spray
Congratulations on the release!

John

On Wed, Sep 14, 2016 at 4:08 PM, Lenz Grimmer  wrote:
> Hi,
>
> if you're running a Ceph cluster and would be interested in trying out a
> new tool for managing/monitoring it, we've just released version 2.0.14
> of openATTIC that now provides a first implementation of a cluster
> monitoring dashboard.
>
> This is work in progress, but we'd like to solicit your input and
> feedback early on, to make sure we're on the right track. See this blog
> post for more details:
>
> https://blog.openattic.org/posts/seeking-your-feedback-on-the-ceph-monitoring-and-management-functionality-in-openattic/
>
> Any comments and suggestions are welcome! Thanks in advance.
>
> Lenz
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel blocked requests

2016-09-15 Thread Wido den Hollander

> Op 13 september 2016 om 18:54 schreef "WRIGHT, JON R (JON R)" 
> :
> 
> 
> VM Client OS: ubuntu 14.04
> 
> Openstack: kilo
> 
> libvirt: 1.2.12
> 
> nova-compute-kvm: 1:2015.1.4-0ubuntu2
> 

What librados/librbd version are you running on the client?

Wido

> Jon
> 
> On 9/13/2016 11:17 AM, Wido den Hollander wrote:
> 
> >> Op 13 september 2016 om 15:58 schreef "WRIGHT, JON R (JON R)" 
> >> :
> >>
> >>
> >> Yes, I do have old clients running.  The clients are all vms.  Is it
> >> typical that vm clients have to be rebuilt after a ceph upgrade?
> >>
> > No, not always, but it is just that I saw this happening recently after a 
> > Jewel upgrade.
> >
> > What version are the client(s) still running?
> >
> > Wido
> >
> >> Thanks,
> >>
> >> Jon
> >>
> >>
> >> On 9/12/2016 4:05 PM, Wido den Hollander wrote:
>  Op 12 september 2016 om 18:47 schreef "WRIGHT, JON R (JON R)" 
>  :
> 
> 
>  Since upgrading to Jewel from Hammer, we're started to see HEALTH_WARN
>  because of 'blocked requests > 32 sec'.   Seems to be related to writes.
> 
>  Has anyone else seen this?  Or can anyone suggest what the problem might 
>  be?
> 
> >>> Do you by any chance have old clients connecting? I saw this after a 
> >>> Jewel upgrade as well and it was because of very old clients still 
> >>> connecting to the cluster.
> >>>
> >>> Wido
> >>>
>  Thanks!
>  ___
>  ceph-users mailing list
>  ceph-users@lists.ceph.com
>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel ceph-mon : high memory usage after few days

2016-09-15 Thread Wido den Hollander

> Op 15 september 2016 om 10:34 schreef Florent B :
> 
> 
> Hi everyone,
> 
> I have a Ceph cluster on Jewel.
> 
> Monitors are on 32GB ram hosts.
> 
> After a few days, ceph-mon process uses 25 to 35% of 32GB (8 to 11 GB) :
> 
>  1150 ceph  20   0 15.454g 7.983g   7852 S   0.3 25.5 490:29.11
> ceph-mon   
> 
> Is it expected ?
> 

No, that's rather high.

Is the cluster in HEALTH_OK? And did you change any mon related configuration 
settings?

Wido

> My cluster is small : 17 OSD, 13TB total space.
> 
> When I restart monitor process, it uses less than 1 GB.
> 
> Could it be a memory leak ? What information can I send to you for
> debugging purpose ?
> 
> Thank you.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacing a failed OSD

2016-09-15 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jim 
> Kilborn
> Sent: 14 September 2016 20:30
> To: Reed Dier 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Replacing a failed OSD
> 
> Reed,
> 
> 
> 
> Thanks for the response.
> 
> 
> 
> Your process is the one that I ran. However, I have a crushmap with ssd and 
> sata drives in different buckets (host made up of host
> types, with and ssd and spinning hosttype for each host) because I am using 
> ssd drives for a replicated cache in front of an
erasure
> code data for cephfs.
> 
> 
> 
> I have "osd crush update on start = false" so that osds don't randomly get 
> added to the crush map, because it wouldn't know where
> to put that osd.
> 
> 
> 
> I am using puppet to provision the drives when it sees one in a slot and it 
> doesn't see the ceph signature (I guess). I am using
the ceph
> puppet module.
> 
> 
> 
> The real confusion is why I have to remove it from the crush map. Once I 
> remove it from the crush map, it does bring it up as the
same
> osd number, but its not in the crush map, so I have to put it back where it 
> belongs. Just seems strange that it must be removed
from
> the crush map.
> 
> 
> 
> Basically, I export the crush map, remove the osd from the crush map, then 
> redeploy the drive. Then when it gets up and running as
> the same osd number, I import the exported crush map to get it back in the 
> cluster.
> 
> 
> 
> I guess that is just how it has to be done.

You can pass a script in via the 'osd crush location hook' variable so that the 
OSD's automatically get placed in the right location
when they startup. Thanks to Wido there is already a script that you can 
probably use with very few modifications:

https://gist.github.com/wido/5d26d88366e28e25e23d


> 
> 
> 
> Thanks again
> 
> 
> 
> Sent from Mail for Windows 10
> 
> 
> 
> From: Reed Dier
> Sent: Wednesday, September 14, 2016 1:39 PM
> To: Jim Kilborn
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Replacing a failed OSD
> 
> 
> 
> Hi Jim,
> 
> This is pretty fresh in my mind so hopefully I can help you out here.
> 
> Firstly, the crush map will back fill any holes in the enumeration that are 
> existing. So assuming only one drive has been removed
from
> the crush map, it will repopulate the same OSD number.
> 
> My steps for removing an OSD are run from the host node:
> 
> > ceph osd down osd.i
> > ceph osd out osd.i
> > stop ceph-osd id=i
> > umount /var/lib/ceph/osd/ceph-i
> > ceph osd crush remove osd.i
> > ceph auth del osd.i
> > ceph osd rm osd.i
> 
> 
> From here, the disk is removed from the ceph cluster, crush map, and is ready 
> for removal and replacement.
> 
> From there I deploy the new osd with ceph-deploy from my admin node using:
> 
> > ceph-deploy disk list nodei
> > ceph-deploy disk zap nodei:sdX
> > ceph-deploy --overwrite-conf osd prepare nodei:sdX
> 
> 
> This will prepare the disk and insert it back into the crush map, bringing it 
> back up and in. The OSD number should remain the
same, as
> it will fill the gap left from the previous OSD removal.
> 
> Hopefully this helps,
> 
> Reed
> 
> > On Sep 14, 2016, at 11:00 AM, Jim Kilborn  wrote:
> >
> > I am finishing testing our new cephfs cluster and wanted to document a 
> > failed osd procedure.
> > I noticed that when I pulled a drive, to simulate a failure, and run 
> > through the replacement steps, the osd has to be removed
from
> the crushmap in order to initialize the new drive as the same osd number.
> >
> > Is this correct that I have to remove it from the crushmap, then after the 
> > osd is initialized, and mounted, add it back to the
crush
> map? Is there no way to have it reuse the same osd # without removing if from 
> the crush map?
> >
> > Thanks for taking the time..
> >
> >
> > -  Jim
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs/ceph-fuse: mds0: Client XXX:XXX failingtorespond to capability release

2016-09-15 Thread Wido den Hollander

> Op 14 september 2016 om 14:56 schreef "Dennis Kramer (DT)" :
> 
> 
> Hi Burkhard,
> 
> Thank you for your reply, see inline:
> 
> On Wed, 14 Sep 2016, Burkhard Linke wrote:
> 
> > Hi,
> >
> >
> > On 09/14/2016 12:43 PM, Dennis Kramer (DT) wrote:
> >> Hi Goncalo,
> >> 
> >> Thank you. Yes, i have seen that thread, but I have no near full osds and 
> >> my mds cache size is pretty high.
> >
> > You can use the daemon socket on the mds server to get an overview of the 
> > current cache state:
> >
> > ceph daemon mds.XXX perf dump
> >
> > The message itself indicates that the mds is in fact trying to convince 
> > clients to release capabilities, probably because it is running out of 
> > cache.
> 
> My cache is set to mds_cache_size = 1500, but you are right, it seems 
> the complete cache is used, but that shouldn't be a real problem if the 
> clients can release the caps in time. Correct me if i'm wrong but the 
> cache_size is pretty high compared to the default (100k). I will raise the 
> mds_cache_size a bit and see if it helps a bit.
> 

The 100k is very, very conservative. Each cached inode will consume roughly 4k 
of memory.

15.000.000 * 4k =~ 58GB of memory

Can you verify that the MDS is indeed using about that amount of memory?

If you have enough memory available you can always increase the cache size on 
the MDS node(s). More caching in the MDS doesn't hurt in most situations.

Wido

> > The 'session ls' command on the daemon socket lists all current ceph 
> > clients 
> > and the number capabilities for each client. Depending on your workload / 
> > applications you might be surprised how many capabilities are assigned to 
> > individual nodes...
> >
> > From the client side of view the error means that there's either a bug in 
> > the 
> > client, or an application is keeping a large number of files open (e.g. do 
> > you run mlocate on the clients?)
> I haven't had this issue when I was on hammer and the amount of clients 
> haven't changed. I have "ceph fuse.ceph fuse.ceph-fuse" in my PRUNEFS for 
> updatedb, so it probably isn't mlocate which would cause this issue.
> The only real difference is my upgrade to Jewel.
> 
> 
> > If you use the kernel based client re-mounting won't help, since the 
> > internal 
> > state is keep the same (afaik). In case of the ceph-fuse client the ugly 
> > way 
> > to get rid off the mount point is a lazy / forced umount and killing the 
> > ceph-fuse process if necessary. Processes with open file handles will 
> > complain afterwards.
> >
> >
> > Before using rude ways to terminate the client session i would propose to 
> > look for rogue applications on the involved host. We had a number of 
> > problems 
> > with multithreaded applications and concurrent file access on the past 
> > (both 
> > with ceph-fuse from hammer and kernel based clients). lsof or other tools 
> > might help locating the application.
> 
> My cluster is back to HEALTH_OK, the involved host has been restarted by 
> the user. But I will debug some more on the host when i see this issue 
> again next time.
> 
> PS: For completeness, i've stated that this issue was often seen in my 
> current Jewel environment, I meant to say that this issue comes up 
> sometimes (so not so often). But the times when i *do* have this issue, it 
> blocks some 
> I/O for clients as a consequence.
> 
> > Regards,
> > Burkhard
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: Upper limit for number of files in a directory?

2016-09-15 Thread Burkhard Linke

Hi,

does CephFS impose an upper limit on the number of files in a directory?


We currently have one directory with a large number of subdirectories:

$ ls | wc -l
158141

Creating a new subdirectory fails:

$ touch foo
touch: cannot touch 'foo': No space left on device

Creating files in a different directory does not show any problems. The 
last message in the MDS log relates to the large directory:


2016-09-15 07:51:54.539216 7f24ef2a6700  0 mds.0.bal replicating dir 
[dir 18bf9a6 /volumes/biodb/ncbi_genomes/all/ [2,head] auth 
v=57751905 cv=0/0 ap=0+2+2 state=1073741826|complete f(v0 m2016-08-22 
08:51:34.714570 158141=3+158138) n(v182285 rc2016-08-22 11:59:34.976373 
b3360569421156 2989670=2478235+511435) hs=158141+842,ss=0+0 | child=1 
waiter=0 authpin=0 0x7f252ca05f00] pop 12842 .. rdp 7353.96 adj 0


Any hints what might go wrong in this case? MDS is taken from the 
current jewel git branch due to some pending backports:


# ceph-mds --version
ceph version 10.2.2-508-g9bfc0cf (9bfc0cf178dc21b0fe33e0ce3b90a18858abaf1b)

CephFS is mounted via kernel implementation:

# uname -a
Linux waas 4.6.6-040606-generic #201608100733 SMP Wed Aug 10 11:35:29 
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux


ceph-fuse from jewel is also affected.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacing a failed OSD

2016-09-15 Thread Dennis Kramer (DBS)
Hi Jim,

I'm using a location script for OSDs, so when I add an OSD this script
will determine its place in the cluster and in which bucket it belongs.

In your ceph.conf there is a setting you can use:
osd_crush_location_hook = 

With regards,

On 09/14/2016 09:30 PM, Jim Kilborn wrote:
> Reed,
> 
> 
> 
> Thanks for the response.
> 
> 
> 
> Your process is the one that I ran. However, I have a crushmap with ssd and 
> sata drives in different buckets (host made up of host types, with and ssd 
> and spinning hosttype for each host) because I am using ssd drives for a 
> replicated cache in front of an erasure code data for cephfs.
> 
> 
> 
> I have “osd crush update on start = false” so that osds don’t randomly get 
> added to the crush map, because it wouldn’t know where to put that osd.
> 
> 
> 
> I am using puppet to provision the drives when it sees one in a slot and it 
> doesn’t see the ceph signature (I guess). I am using the ceph puppet module.
> 
> 
> 
> The real confusion is why I have to remove it from the crush map. Once I 
> remove it from the crush map, it does bring it up as the same osd number, but 
> its not in the crush map, so I have to put it back where it belongs. Just 
> seems strange that it must be removed from the crush map.
> 
> 
> 
> Basically, I export the crush map, remove the osd from the crush map, then 
> redeploy the drive. Then when it gets up and running as the same osd number, 
> I import the exported crush map to get it back in the cluster.
> 
> 
> 
> I guess that is just how it has to be done.
> 
> 
> 
> Thanks again
> 
> 
> 
> Sent from Mail for Windows 10
> 
> 
> 
> From: Reed Dier
> Sent: Wednesday, September 14, 2016 1:39 PM
> To: Jim Kilborn
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Replacing a failed OSD
> 
> 
> 
> Hi Jim,
> 
> This is pretty fresh in my mind so hopefully I can help you out here.
> 
> Firstly, the crush map will back fill any holes in the enumeration that are 
> existing. So assuming only one drive has been removed from the crush map, it 
> will repopulate the same OSD number.
> 
> My steps for removing an OSD are run from the host node:
> 
>> ceph osd down osd.i
>> ceph osd out osd.i
>> stop ceph-osd id=i
>> umount /var/lib/ceph/osd/ceph-i
>> ceph osd crush remove osd.i
>> ceph auth del osd.i
>> ceph osd rm osd.i
> 
> 
> From here, the disk is removed from the ceph cluster, crush map, and is ready 
> for removal and replacement.
> 
> From there I deploy the new osd with ceph-deploy from my admin node using:
> 
>> ceph-deploy disk list nodei
>> ceph-deploy disk zap nodei:sdX
>> ceph-deploy --overwrite-conf osd prepare nodei:sdX
> 
> 
> This will prepare the disk and insert it back into the crush map, bringing it 
> back up and in. The OSD number should remain the same, as it will fill the 
> gap left from the previous OSD removal.
> 
> Hopefully this helps,
> 
> Reed
> 
>> On Sep 14, 2016, at 11:00 AM, Jim Kilborn  wrote:
>>
>> I am finishing testing our new cephfs cluster and wanted to document a 
>> failed osd procedure.
>> I noticed that when I pulled a drive, to simulate a failure, and run through 
>> the replacement steps, the osd has to be removed from the crushmap in order 
>> to initialize the new drive as the same osd number.
>>
>> Is this correct that I have to remove it from the crushmap, then after the 
>> osd is initialized, and mounted, add it back to the crush map? Is there no 
>> way to have it reuse the same osd # without removing if from the crush map?
>>
>> Thanks for taking the time….
>>
>>
>> -  Jim
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Kramer M.D.
Infrastructure Engineer


Nederlands Forensisch Instituut
Digitale Technologie & Biometrie
Laan van Ypenburg 6 | 2497 GB | Den Haag
Postbus 24044 | 2490 AA | Den Haag

T 070 888 64 30
M 06 29 62 12 02
d.kra...@nfi.minvenj.nl / den...@holmes.nl
PGP publickey: http://www.holmes.nl/dennis.asc
www.forensischinstituut.nl

Nederlands Forensisch Instituut. In feiten het beste.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Suiciding and corrupted OSDs zero out Ceph cluster IO

2016-09-15 Thread Kostis Fardelas
Hello cephers,
last week we survived a 3-day outage on our ceph cluster (Hammer
0.94.7, 162 OSDs, 27 "fat" nodes, 1000s of clients) due to 6 out of
162 OSDs crash in the SAME node. The outage was caused in the
following timeline:
time 0:  OSDs living in the same node (rd0-19) start heavily flapping
(in the logs: failed, wrongly marked me down, RESETSESSION etc). Some
more OSDs on other nodes are also flapping but the OSDs of this single
node seem to have played the major part in this problem

time +6h: rd0-19 OSDs assert. Two of them suicide on OSD::osd_op_tp
thread timeout and the other ones assert with EPERM and corrupted
leveldb related errors. Something like this:

2016-09-10 02:40:47.155718 7f699b724700  0 filestore(/rados/rd0-19-01)
 error (1) Operation not permitted not handled on operation 0x46db2d00
(1731767079.0.0, or op 0, counting from 0)
2016-09-10 02:40:47.155731 7f699b724700  0 filestore(/rados/rd0-19-01)
unexpected error code
2016-09-10 02:40:47.155732 7f699b724700  0 filestore(/rados/rd0-19-01)
 transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "omap_setkeys",
"collection": "3.b30_head",
"oid": "3\/b30\/\/head",
"attr_lens": {
"_epoch": 4,
"_info": 734
}
}
]
}


2016-09-10 02:40:47.155778 7f699671a700 -1 os/FileStore.cc: In
function 'unsigned int
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
ThreadPool::TPH
andle*)' thread 7f699671a700 time 2016-09-10 02:40:47.153544
os/FileStore.cc: 2761: FAILED assert(0 == "unexpected error")

This leaves the cluster in a state like below:
2016-09-10 03:04:31.927635 mon.0 62.217.119.14:6789/0 948003 : cluster
[INF] osdmap e281474: 162 osds: 156 up, 156 in
2016-09-10 03:04:32.145074 mon.0 62.217.119.14:6789/0 948004 : cluster
[INF] pgmap v105867219: 28672 pgs: 1
active+recovering+undersized+degraded, 26684 active+clean, 1889
active+undersized+degraded, 98 down+peering; 95983 GB data, 179 TB
used, 101379 GB / 278 TB avail; 12106 B/s rd, 11 op/s;
2408539/69641962 objects degraded (3.458%); 1/34820981 unfound
(0.000%)

Almost no IO propably due to 98 down+peering PGs, 1 unfound object and
1000s of librados clients stuck.
As of now, we have not managed to pinpoint what caused the crashes (no
disk errors, no network errors, no general hardware errors, nothing so
far) but things are still under investigation. Finally we managed to
bring up enough crashed OSDs for IO to continue (using gdb, leveldb
repairs, ceph-objectstore-tool), but our main questions exists:

A. the 6 OSDs were on the same node. What is so special about
suiciding + EPERMs that leave the cluster with down+peering and zero
IO? Is this a normal behaviour after a crash like this? Notice that
the cluster has marked the crashed OSDs down+out, so it seems that the
cluster somehow "fenced" these OSDs but in a manner that leaves the
cluster unusable
B. would replication=3 help? Would we need replication=3 and min=2 to
avoid such a problem in the future? Right now we are on size=2 &
min_size=1
C. would an increase in suicide timeouts help for future incidents like this?

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel ceph-mon : high memory usage after few days

2016-09-15 Thread Wido den Hollander

> Op 15 september 2016 om 10:40 schreef Florent B :
> 
> 
> On 09/15/2016 10:37 AM, Wido den Hollander wrote:
> >> Op 15 september 2016 om 10:34 schreef Florent B :
> >>
> >>
> >> Hi everyone,
> >>
> >> I have a Ceph cluster on Jewel.
> >>
> >> Monitors are on 32GB ram hosts.
> >>
> >> After a few days, ceph-mon process uses 25 to 35% of 32GB (8 to 11 GB) :
> >>
> >>  1150 ceph  20   0 15.454g 7.983g   7852 S   0.3 25.5 490:29.11
> >> ceph-mon   
> >>
> >> Is it expected ?
> >>
> > No, that's rather high.
> >
> > Is the cluster in HEALTH_OK? And did you change any mon related 
> > configuration settings?
> >
> > Wido
> >
> 
> Hi, yes cluster status is HEALTH_OK, no problem. No configuration change
> for weeks.

Just wondering if you have changed anything at all for the monitor 
configuration values.

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: Upper limit for number of files in a directory?

2016-09-15 Thread John Spray
On Thu, Sep 15, 2016 at 2:20 PM, Burkhard Linke
 wrote:
> Hi,
>
> does CephFS impose an upper limit on the number of files in a directory?
>
>
> We currently have one directory with a large number of subdirectories:
>
> $ ls | wc -l
> 158141
>
> Creating a new subdirectory fails:
>
> $ touch foo
> touch: cannot touch 'foo': No space left on device

This limit was added recently: it's a limit on the size of a directory fragment.

Previously folks were hitting nasty OSD issues with very large
directory fragments, so we added this limit to give a clean failure
instead.

Directory fragmentation (mds_bal_frag setting) is turned off by
default in Jewel: I was planning to get this activated by default in
Kraken, but haven't quite got there yet.  Once fragmentation is
enabled you should find that the threshold for splitting dirfrags is
hit well before you hit the safety limit that gives you ENOSPC.

Note that if you set mds_bal_frag then you also need to use the "ceph
fs set  allow_dirfrags true" (that command from memory so check
the help if it's wrong), or the MDSs will ignore the setting.

John

>
> Creating files in a different directory does not show any problems. The last
> message in the MDS log relates to the large directory:
>
> 2016-09-15 07:51:54.539216 7f24ef2a6700  0 mds.0.bal replicating dir [dir
> 18bf9a6 /volumes/biodb/ncbi_genomes/all/ [2,head] auth v=57751905 cv=0/0
> ap=0+2+2 state=1073741826|complete f(v0 m2016-08-22 08:51:34.714570
> 158141=3+158138) n(v182285 rc2016-08-22 11:59:34.976373 b3360569421156
> 2989670=2478235+511435) hs=158141+842,ss=0+0 | child=1 waiter=0 authpin=0
> 0x7f252ca05f00] pop 12842 .. rdp 7353.96 adj 0
>
> Any hints what might go wrong in this case? MDS is taken from the current
> jewel git branch due to some pending backports:
>
> # ceph-mds --version
> ceph version 10.2.2-508-g9bfc0cf (9bfc0cf178dc21b0fe33e0ce3b90a18858abaf1b)
>
> CephFS is mounted via kernel implementation:
>
> # uname -a
> Linux waas 4.6.6-040606-generic #201608100733 SMP Wed Aug 10 11:35:29 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> ceph-fuse from jewel is also affected.
>
>
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-15 Thread Ilya Dryomov
On Thu, Sep 15, 2016 at 10:22 AM, Nikolay Borisov
 wrote:
>
>
> On 09/15/2016 09:22 AM, Nikolay Borisov wrote:
>>
>>
>> On 09/14/2016 05:53 PM, Ilya Dryomov wrote:
>>> On Wed, Sep 14, 2016 at 3:30 PM, Nikolay Borisov  wrote:


 On 09/14/2016 02:55 PM, Ilya Dryomov wrote:
> On Wed, Sep 14, 2016 at 9:01 AM, Nikolay Borisov  wrote:
>>
>>
>> On 09/14/2016 09:55 AM, Adrian Saul wrote:
>>>
>>> I found I could ignore the XFS issues and just mount it with the 
>>> appropriate options (below from my backup scripts):
>>>
>>> #
>>> # Mount with nouuid (conflicting XFS) and norecovery (ro 
>>> snapshot)
>>> #
>>> if ! mount -o ro,nouuid,norecovery  $SNAPDEV /backup${FS}; then
>>> echo "FAILED: Unable to mount snapshot $DATESTAMP of 
>>> $FS - cleaning up"
>>> rbd unmap $SNAPDEV
>>> rbd snap rm ${RBDPATH}@${DATESTAMP}
>>> exit 3;
>>> fi
>>> echo "Backup snapshot of $RBDPATH mounted at: /backup${FS}"
>>>
>>> It's impossible without clones to do it without norecovery.
>>
>> But shouldn't freezing the fs and doing a snapshot constitute a "clean
>> unmount" hence no need to recover on the next mount (of the snapshot) -
>> Ilya?
>
> I *thought* it should (well, except for orphan inodes), but now I'm not
> sure.  Have you tried reproducing with loop devices yet?

 Here is what the checksum tests showed:

 fsfreeze -f  /mountpoit
 md5sum /dev/rbd0
 f33c926373ad604da674bcbfbe6460c5  /dev/rbd0
 rbd snap create xx@xxx && rbd snap protect xx@xxx
 rbd map xx@xxx
 md5sum /dev/rbd1
 6f702740281874632c73aeb2c0fcf34a  /dev/rbd1

 where rbd1 is a snapshot of the rbd0 device. So the checksum is indeed
 different, worrying.
>>>
>>> Sorry, for the filesystem device you should do
>>>
>>> md5sum <(dd if=/dev/rbd0 iflag=direct bs=8M)
>>>
>>> to get what's actually on disk, so that it's apples to apples.
>>
>> root@alxc13:~# rbd showmapped  |egrep "device|c11579"
>> id  pool image  snap  device
>> 47  rbd  c11579 - /dev/rbd47
>> root@alxc13:~# fsfreeze -f /var/lxc/c11579
>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 617.815 s, 174 MB/s
>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63  <--- Check sum after freeze
>> root@alxc13:~# rbd snap create rbd/c11579@snap_test
>> root@alxc13:~# rbd map c11579@snap_test
>> /dev/rbd1
>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 610.043 s, 176 MB/s
>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63 <--- Check sum of snapshot
>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 592.164 s, 181 MB/s
>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63<--- Check sum of original 
>> device, not changed - GOOD
>> root@alxc13:~# file -s /dev/rbd1
>> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (extents) (large files) (huge 
>> files)
>> root@alxc13:~# fsfreeze -u /var/lxc/c11579
>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 647.01 s, 166 MB/s
>> 92b7182591d7d7380435cfdea79a8897  /dev/fd/63   <--- After unfreeze checksum 
>> is different - OK
>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 590.556 s, 182 MB/s
>> bc3b68f0276c608d9435223f89589962  /dev/fd/63 <--- Why the heck the checksum 
>> of the snapshot is different after unfreeze? BAD?
>> root@alxc13:~# file -s /dev/rbd1
>> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (needs journal recovery) 
>> (extents) (large files) (huge files)
>> root@alxc13:~#
>>
>
> And something even more peculiar - taking an md5sum some hours after the
> above test produced this:
>
> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 636.836 s, 169 MB/s
> e68e41616489d41544cd873c73defb08  /dev/fd/63
>
> Meaning the read-only snapshot somehow has "mutated". E.g. it wasn't
> recreated, just the same old snapshot. Is this normal?

Hrm, I wonder if it missed a snapshot context update.  Please pastebin
entire dmesg for that boot.

Have those devices been remapped or alxc13 rebooted since then?  If
not, what's the output of

$ rados -p rbd listwatchers $(rbd info c11579 | grep block_name_prefix
| awk '{ print $2 }' | sed 's/rbd_data/rbd_header/')

and can you check whether that snapshot is continuing to mutate as the
image is mutated - freeze /var/lxc/c11579 again and check rbd47 and
rbd1?

Thanks,


[ceph-users] rgw: Swift API X-Storage-Url

2016-09-15 Thread Василий Ангапов
Hello,

I have Ceph Jewel 10.2.1 cluster and RadosGW. Issue is that when
authenticating against Swift API I receive different values for
X-Storage-Url header:

# curl -i -H "X-Auth-User: internal-it:swift" -H "X-Auth-Key: ***"
https://ed-1-vip.cloud/auth/v1.0 | grep X-Storage-Url
X-Storage-Url: https://ed-1-vip.cloud/swift/v1/AUTH_internal-it

# curl -i -H "X-Auth-User: internal-it:swift" -H "X-Auth-Key: ***"
https://ed-1-vip.cloud/auth/v1.0 | grep X-Storage-Url
X-Storage-Url: http://ed-1-vip.cloud:443/swift/v1

See, first time it gives correct HTTPS link, while the second time it
gives erroneous HTTP to 443 port link.

In config I have the following:
[client.rgw.ed-1]
keyring = /etc/ceph/ceph.client.rgw.ed-1.keyring
admin socket = /var/run/ceph/client.rgw.ed-1.asok
rgw frontends = fastcgi socket_port=9000 socket_host=10.144.66.180
rgw zone = ed-1
rgw dns name = ed-1-vip.cloud
rgw ops log rados = false
rgw enable usage log = true
rgw enable ops log = true
rgw print continue = false
rgw override bucket index max shards = 64
rgw swift url = https://ed-1-vip.cloud
rgw swift account in url = true

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: Upper limit for number of files in adirectory?

2016-09-15 Thread Burkhard Linke

Hi,


On 09/15/2016 12:00 PM, John Spray wrote:

On Thu, Sep 15, 2016 at 2:20 PM, Burkhard Linke
 wrote:

Hi,

does CephFS impose an upper limit on the number of files in a directory?


We currently have one directory with a large number of subdirectories:

$ ls | wc -l
158141

Creating a new subdirectory fails:

$ touch foo
touch: cannot touch 'foo': No space left on device

thansk for the fast reply.

This limit was added recently: it's a limit on the size of a directory fragment.

Previously folks were hitting nasty OSD issues with very large
directory fragments, so we added this limit to give a clean failure
instead.

I remember seeing a thread on the devel mailing list about this issue.


Directory fragmentation (mds_bal_frag setting) is turned off by
default in Jewel: I was planning to get this activated by default in
Kraken, but haven't quite got there yet.  Once fragmentation is
enabled you should find that the threshold for splitting dirfrags is
hit well before you hit the safety limit that gives you ENOSPC.
Does enabling directory fragmentation require a MDS restart? And are 
directories processed at restart or on demand during the first access? 
Are there known problems with fragmentation?




Note that if you set mds_bal_frag then you also need to use the "ceph
fs set  allow_dirfrags true" (that command from memory so check
the help if it's wrong), or the MDSs will ignore the setting.
So its allowing fragmentation first and changing the MDS configuration 
afterwards.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-15 Thread Nikolay Borisov


On 09/15/2016 01:24 PM, Ilya Dryomov wrote:
> On Thu, Sep 15, 2016 at 10:22 AM, Nikolay Borisov
>  wrote:
>>
>>
>> On 09/15/2016 09:22 AM, Nikolay Borisov wrote:
>>>
>>>
>>> On 09/14/2016 05:53 PM, Ilya Dryomov wrote:
 On Wed, Sep 14, 2016 at 3:30 PM, Nikolay Borisov  wrote:
>
>
> On 09/14/2016 02:55 PM, Ilya Dryomov wrote:
>> On Wed, Sep 14, 2016 at 9:01 AM, Nikolay Borisov  wrote:
>>>
>>>
>>> On 09/14/2016 09:55 AM, Adrian Saul wrote:

 I found I could ignore the XFS issues and just mount it with the 
 appropriate options (below from my backup scripts):

 #
 # Mount with nouuid (conflicting XFS) and norecovery (ro 
 snapshot)
 #
 if ! mount -o ro,nouuid,norecovery  $SNAPDEV /backup${FS}; then
 echo "FAILED: Unable to mount snapshot $DATESTAMP of 
 $FS - cleaning up"
 rbd unmap $SNAPDEV
 rbd snap rm ${RBDPATH}@${DATESTAMP}
 exit 3;
 fi
 echo "Backup snapshot of $RBDPATH mounted at: /backup${FS}"

 It's impossible without clones to do it without norecovery.
>>>
>>> But shouldn't freezing the fs and doing a snapshot constitute a "clean
>>> unmount" hence no need to recover on the next mount (of the snapshot) -
>>> Ilya?
>>
>> I *thought* it should (well, except for orphan inodes), but now I'm not
>> sure.  Have you tried reproducing with loop devices yet?
>
> Here is what the checksum tests showed:
>
> fsfreeze -f  /mountpoit
> md5sum /dev/rbd0
> f33c926373ad604da674bcbfbe6460c5  /dev/rbd0
> rbd snap create xx@xxx && rbd snap protect xx@xxx
> rbd map xx@xxx
> md5sum /dev/rbd1
> 6f702740281874632c73aeb2c0fcf34a  /dev/rbd1
>
> where rbd1 is a snapshot of the rbd0 device. So the checksum is indeed
> different, worrying.

 Sorry, for the filesystem device you should do

 md5sum <(dd if=/dev/rbd0 iflag=direct bs=8M)

 to get what's actually on disk, so that it's apples to apples.
>>>
>>> root@alxc13:~# rbd showmapped  |egrep "device|c11579"
>>> id  pool image  snap  device
>>> 47  rbd  c11579 - /dev/rbd47
>>> root@alxc13:~# fsfreeze -f /var/lxc/c11579
>>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>>> 12800+0 records in
>>> 12800+0 records out
>>> 107374182400 bytes (107 GB) copied, 617.815 s, 174 MB/s
>>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63  <--- Check sum after 
>>> freeze
>>> root@alxc13:~# rbd snap create rbd/c11579@snap_test
>>> root@alxc13:~# rbd map c11579@snap_test
>>> /dev/rbd1
>>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>>> 12800+0 records in
>>> 12800+0 records out
>>> 107374182400 bytes (107 GB) copied, 610.043 s, 176 MB/s
>>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63 <--- Check sum of snapshot
>>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>>> 12800+0 records in
>>> 12800+0 records out
>>> 107374182400 bytes (107 GB) copied, 592.164 s, 181 MB/s
>>> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63<--- Check sum of original 
>>> device, not changed - GOOD
>>> root@alxc13:~# file -s /dev/rbd1
>>> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (extents) (large files) (huge 
>>> files)
>>> root@alxc13:~# fsfreeze -u /var/lxc/c11579
>>> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
>>> 12800+0 records in
>>> 12800+0 records out
>>> 107374182400 bytes (107 GB) copied, 647.01 s, 166 MB/s
>>> 92b7182591d7d7380435cfdea79a8897  /dev/fd/63   <--- After unfreeze checksum 
>>> is different - OK
>>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>>> 12800+0 records in
>>> 12800+0 records out
>>> 107374182400 bytes (107 GB) copied, 590.556 s, 182 MB/s
>>> bc3b68f0276c608d9435223f89589962  /dev/fd/63 <--- Why the heck the checksum 
>>> of the snapshot is different after unfreeze? BAD?
>>> root@alxc13:~# file -s /dev/rbd1
>>> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (needs journal recovery) 
>>> (extents) (large files) (huge files)
>>> root@alxc13:~#
>>>
>>
>> And something even more peculiar - taking an md5sum some hours after the
>> above test produced this:
>>
>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>> 12800+0 records in
>> 12800+0 records out
>> 107374182400 bytes (107 GB) copied, 636.836 s, 169 MB/s
>> e68e41616489d41544cd873c73defb08  /dev/fd/63
>>
>> Meaning the read-only snapshot somehow has "mutated". E.g. it wasn't
>> recreated, just the same old snapshot. Is this normal?
> 
> Hrm, I wonder if it missed a snapshot context update.  Please pastebin
> entire dmesg for that boot.

The machine has been up more than 2 and the dmesg has been rewritten
several times for that time. Also the node is rather busy so there's
plenty of irrelevant stuff in the dmesg. Grepped for rbd1/0 and found

Re: [ceph-users] rgw: Swift API X-Storage-Url

2016-09-15 Thread Василий Ангапов
Sorry, a revoke my  question. On one node there was a duplicate RGW
daemon with old config. That's why sometimes I was receiving wrong
URLs.

2016-09-15 13:23 GMT+03:00 Василий Ангапов :
> Hello,
>
> I have Ceph Jewel 10.2.1 cluster and RadosGW. Issue is that when
> authenticating against Swift API I receive different values for
> X-Storage-Url header:
>
> # curl -i -H "X-Auth-User: internal-it:swift" -H "X-Auth-Key: ***"
> https://ed-1-vip.cloud/auth/v1.0 | grep X-Storage-Url
> X-Storage-Url: https://ed-1-vip.cloud/swift/v1/AUTH_internal-it
>
> # curl -i -H "X-Auth-User: internal-it:swift" -H "X-Auth-Key: ***"
> https://ed-1-vip.cloud/auth/v1.0 | grep X-Storage-Url
> X-Storage-Url: http://ed-1-vip.cloud:443/swift/v1
>
> See, first time it gives correct HTTPS link, while the second time it
> gives erroneous HTTP to 443 port link.
>
> In config I have the following:
> [client.rgw.ed-1]
> keyring = /etc/ceph/ceph.client.rgw.ed-1.keyring
> admin socket = /var/run/ceph/client.rgw.ed-1.asok
> rgw frontends = fastcgi socket_port=9000 socket_host=10.144.66.180
> rgw zone = ed-1
> rgw dns name = ed-1-vip.cloud
> rgw ops log rados = false
> rgw enable usage log = true
> rgw enable ops log = true
> rgw print continue = false
> rgw override bucket index max shards = 64
> rgw swift url = https://ed-1-vip.cloud
> rgw swift account in url = true
>
> Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs thread leak during degraded cluster state

2016-09-15 Thread Kostis Fardelas
Hello cephers,
being in a degraded cluster state with 6/162 OSDs down ((Hammer
0.94.7, 162 OSDs, 27 "fat" nodes, 1000s of clients) ) like the below
ceph cluster log indicates:

2016-09-12 06:26:08.443152 mon.0 62.217.119.14:6789/0 217309 : cluster
[INF] pgmap v106027148: 28672 pgs: 2 down+remapped+peering, 25904
active+clean, 23 stale+down+peering, 1 active+recovery_wait+degraded,
1 active+recovery_wait+undersized+degraded, 170 down+peering, 1
active+clean+scrubbing, 8
active+undersized+degraded+remapped+wait_backfill, 27
stale+active+undersized+degraded, 3 active+remapped+wait_backfill,
2531 active+undersized+degraded, 1
active+recovering+undersized+degraded+remapped; 95835 GB data, 186 TB
used, 94341 GB / 278 TB avail; 11230 B/s rd, 164 kB/s wr, 42 op/s;
3148226/69530815 objects degraded (4.528%); 59272/69530815 objects
misplaced (0.085%); 1/34756893 unfound (0.000%)

we experienced extensive thread leaks on the remaining up+in OSDs,
which lead to even more random crashes with Thread::create asserts:

2016-09-10 09:08:40.211713 7f8576bd6700 -1 common/Thread.cc: In
function 'void Thread::create(size_t)' thread 7f8576bd6700 time
2016-09-10 09:08:40.199211
common/Thread.cc: 131: FAILED assert(ret == 0)

Thread count under normal operations are ~6500 on all nodes, but in
this degraded state we reached even ~35000.

Is this expected behaviour when you have down+peering OSDs?
Is it possible to mitigate this problem using ceph configuration or
our only resort is kernel pid_max bump?

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs thread leak during degraded cluster state

2016-09-15 Thread Wido den Hollander

> Op 15 september 2016 om 13:27 schreef Kostis Fardelas :
> 
> 
> Hello cephers,
> being in a degraded cluster state with 6/162 OSDs down ((Hammer
> 0.94.7, 162 OSDs, 27 "fat" nodes, 1000s of clients) ) like the below
> ceph cluster log indicates:
> 
> 2016-09-12 06:26:08.443152 mon.0 62.217.119.14:6789/0 217309 : cluster
> [INF] pgmap v106027148: 28672 pgs: 2 down+remapped+peering, 25904
> active+clean, 23 stale+down+peering, 1 active+recovery_wait+degraded,
> 1 active+recovery_wait+undersized+degraded, 170 down+peering, 1
> active+clean+scrubbing, 8
> active+undersized+degraded+remapped+wait_backfill, 27
> stale+active+undersized+degraded, 3 active+remapped+wait_backfill,
> 2531 active+undersized+degraded, 1
> active+recovering+undersized+degraded+remapped; 95835 GB data, 186 TB
> used, 94341 GB / 278 TB avail; 11230 B/s rd, 164 kB/s wr, 42 op/s;
> 3148226/69530815 objects degraded (4.528%); 59272/69530815 objects
> misplaced (0.085%); 1/34756893 unfound (0.000%)
> 
> we experienced extensive thread leaks on the remaining up+in OSDs,
> which lead to even more random crashes with Thread::create asserts:
> 
> 2016-09-10 09:08:40.211713 7f8576bd6700 -1 common/Thread.cc: In
> function 'void Thread::create(size_t)' thread 7f8576bd6700 time
> 2016-09-10 09:08:40.199211
> common/Thread.cc: 131: FAILED assert(ret == 0)
> 
> Thread count under normal operations are ~6500 on all nodes, but in
> this degraded state we reached even ~35000.
> 
> Is this expected behaviour when you have down+peering OSDs?
> Is it possible to mitigate this problem using ceph configuration or
> our only resort is kernel pid_max bump?
> 

You should bump that setting. The default 32k is way to low during recovery.

Set it to at least 512k or so.

Wido

> Regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-15 Thread Ilya Dryomov
On Thu, Sep 15, 2016 at 12:54 PM, Nikolay Borisov  wrote:
>
>
> On 09/15/2016 01:24 PM, Ilya Dryomov wrote:
>> On Thu, Sep 15, 2016 at 10:22 AM, Nikolay Borisov
>>  wrote:
>>>
>>>
>>> On 09/15/2016 09:22 AM, Nikolay Borisov wrote:


 On 09/14/2016 05:53 PM, Ilya Dryomov wrote:
> On Wed, Sep 14, 2016 at 3:30 PM, Nikolay Borisov  wrote:
>>
>>
>> On 09/14/2016 02:55 PM, Ilya Dryomov wrote:
>>> On Wed, Sep 14, 2016 at 9:01 AM, Nikolay Borisov  
>>> wrote:


 On 09/14/2016 09:55 AM, Adrian Saul wrote:
>
> I found I could ignore the XFS issues and just mount it with the 
> appropriate options (below from my backup scripts):
>
> #
> # Mount with nouuid (conflicting XFS) and norecovery (ro 
> snapshot)
> #
> if ! mount -o ro,nouuid,norecovery  $SNAPDEV /backup${FS}; 
> then
> echo "FAILED: Unable to mount snapshot $DATESTAMP of 
> $FS - cleaning up"
> rbd unmap $SNAPDEV
> rbd snap rm ${RBDPATH}@${DATESTAMP}
> exit 3;
> fi
> echo "Backup snapshot of $RBDPATH mounted at: /backup${FS}"
>
> It's impossible without clones to do it without norecovery.

 But shouldn't freezing the fs and doing a snapshot constitute a "clean
 unmount" hence no need to recover on the next mount (of the snapshot) -
 Ilya?
>>>
>>> I *thought* it should (well, except for orphan inodes), but now I'm not
>>> sure.  Have you tried reproducing with loop devices yet?
>>
>> Here is what the checksum tests showed:
>>
>> fsfreeze -f  /mountpoit
>> md5sum /dev/rbd0
>> f33c926373ad604da674bcbfbe6460c5  /dev/rbd0
>> rbd snap create xx@xxx && rbd snap protect xx@xxx
>> rbd map xx@xxx
>> md5sum /dev/rbd1
>> 6f702740281874632c73aeb2c0fcf34a  /dev/rbd1
>>
>> where rbd1 is a snapshot of the rbd0 device. So the checksum is indeed
>> different, worrying.
>
> Sorry, for the filesystem device you should do
>
> md5sum <(dd if=/dev/rbd0 iflag=direct bs=8M)
>
> to get what's actually on disk, so that it's apples to apples.

 root@alxc13:~# rbd showmapped  |egrep "device|c11579"
 id  pool image  snap  device
 47  rbd  c11579 - /dev/rbd47
 root@alxc13:~# fsfreeze -f /var/lxc/c11579
 root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
 12800+0 records in
 12800+0 records out
 107374182400 bytes (107 GB) copied, 617.815 s, 174 MB/s
 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63  <--- Check sum after 
 freeze
 root@alxc13:~# rbd snap create rbd/c11579@snap_test
 root@alxc13:~# rbd map c11579@snap_test
 /dev/rbd1
 root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
 12800+0 records in
 12800+0 records out
 107374182400 bytes (107 GB) copied, 610.043 s, 176 MB/s
 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63 <--- Check sum of snapshot
 root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
 12800+0 records in
 12800+0 records out
 107374182400 bytes (107 GB) copied, 592.164 s, 181 MB/s
 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63<--- Check sum of original 
 device, not changed - GOOD
 root@alxc13:~# file -s /dev/rbd1
 /dev/rbd1: Linux rev 1.0 ext4 filesystem data (extents) (large files) 
 (huge files)
 root@alxc13:~# fsfreeze -u /var/lxc/c11579
 root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
 12800+0 records in
 12800+0 records out
 107374182400 bytes (107 GB) copied, 647.01 s, 166 MB/s
 92b7182591d7d7380435cfdea79a8897  /dev/fd/63   <--- After unfreeze 
 checksum is different - OK
 root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
 12800+0 records in
 12800+0 records out
 107374182400 bytes (107 GB) copied, 590.556 s, 182 MB/s
 bc3b68f0276c608d9435223f89589962  /dev/fd/63 <--- Why the heck the 
 checksum of the snapshot is different after unfreeze? BAD?
 root@alxc13:~# file -s /dev/rbd1
 /dev/rbd1: Linux rev 1.0 ext4 filesystem data (needs journal recovery) 
 (extents) (large files) (huge files)
 root@alxc13:~#

>>>
>>> And something even more peculiar - taking an md5sum some hours after the
>>> above test produced this:
>>>
>>> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
>>> 12800+0 records in
>>> 12800+0 records out
>>> 107374182400 bytes (107 GB) copied, 636.836 s, 169 MB/s
>>> e68e41616489d41544cd873c73defb08  /dev/fd/63
>>>
>>> Meaning the read-only snapshot somehow has "mutated". E.g. it wasn't
>>> recreated, just the same old snapshot. Is this normal?
>>
>> Hrm, I wonder if it missed a snapshot context update.  Please pastebin
>> entire dmesg for that boot.
>
> Th

Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-15 Thread Nikolay Borisov


On 09/15/2016 03:15 PM, Ilya Dryomov wrote:
> On Thu, Sep 15, 2016 at 12:54 PM, Nikolay Borisov  wrote:
>>
>>
>> On 09/15/2016 01:24 PM, Ilya Dryomov wrote:
>>> On Thu, Sep 15, 2016 at 10:22 AM, Nikolay Borisov
>>>  wrote:


 On 09/15/2016 09:22 AM, Nikolay Borisov wrote:
>
>
> On 09/14/2016 05:53 PM, Ilya Dryomov wrote:
>> On Wed, Sep 14, 2016 at 3:30 PM, Nikolay Borisov  wrote:
>>>
>>>
>>> On 09/14/2016 02:55 PM, Ilya Dryomov wrote:
 On Wed, Sep 14, 2016 at 9:01 AM, Nikolay Borisov  
 wrote:
>
>
> On 09/14/2016 09:55 AM, Adrian Saul wrote:
>>
>> I found I could ignore the XFS issues and just mount it with the 
>> appropriate options (below from my backup scripts):
>>
>> #
>> # Mount with nouuid (conflicting XFS) and norecovery (ro 
>> snapshot)
>> #
>> if ! mount -o ro,nouuid,norecovery  $SNAPDEV /backup${FS}; 
>> then
>> echo "FAILED: Unable to mount snapshot $DATESTAMP of 
>> $FS - cleaning up"
>> rbd unmap $SNAPDEV
>> rbd snap rm ${RBDPATH}@${DATESTAMP}
>> exit 3;
>> fi
>> echo "Backup snapshot of $RBDPATH mounted at: /backup${FS}"
>>
>> It's impossible without clones to do it without norecovery.
>
> But shouldn't freezing the fs and doing a snapshot constitute a "clean
> unmount" hence no need to recover on the next mount (of the snapshot) 
> -
> Ilya?

 I *thought* it should (well, except for orphan inodes), but now I'm not
 sure.  Have you tried reproducing with loop devices yet?
>>>
>>> Here is what the checksum tests showed:
>>>
>>> fsfreeze -f  /mountpoit
>>> md5sum /dev/rbd0
>>> f33c926373ad604da674bcbfbe6460c5  /dev/rbd0
>>> rbd snap create xx@xxx && rbd snap protect xx@xxx
>>> rbd map xx@xxx
>>> md5sum /dev/rbd1
>>> 6f702740281874632c73aeb2c0fcf34a  /dev/rbd1
>>>
>>> where rbd1 is a snapshot of the rbd0 device. So the checksum is indeed
>>> different, worrying.
>>
>> Sorry, for the filesystem device you should do
>>
>> md5sum <(dd if=/dev/rbd0 iflag=direct bs=8M)
>>
>> to get what's actually on disk, so that it's apples to apples.
>
> root@alxc13:~# rbd showmapped  |egrep "device|c11579"
> id  pool image  snap  device
> 47  rbd  c11579 - /dev/rbd47
> root@alxc13:~# fsfreeze -f /var/lxc/c11579
> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 617.815 s, 174 MB/s
> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63  <--- Check sum after 
> freeze
> root@alxc13:~# rbd snap create rbd/c11579@snap_test
> root@alxc13:~# rbd map c11579@snap_test
> /dev/rbd1
> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 610.043 s, 176 MB/s
> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63 <--- Check sum of 
> snapshot
> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 592.164 s, 181 MB/s
> 2ddc99ce1b3ef51da1945d9da25ac296  /dev/fd/63<--- Check sum of 
> original device, not changed - GOOD
> root@alxc13:~# file -s /dev/rbd1
> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (extents) (large files) 
> (huge files)
> root@alxc13:~# fsfreeze -u /var/lxc/c11579
> root@alxc13:~# md5sum <(dd if=/dev/rbd47 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 647.01 s, 166 MB/s
> 92b7182591d7d7380435cfdea79a8897  /dev/fd/63   <--- After unfreeze 
> checksum is different - OK
> root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
> 12800+0 records in
> 12800+0 records out
> 107374182400 bytes (107 GB) copied, 590.556 s, 182 MB/s
> bc3b68f0276c608d9435223f89589962  /dev/fd/63 <--- Why the heck the 
> checksum of the snapshot is different after unfreeze? BAD?
> root@alxc13:~# file -s /dev/rbd1
> /dev/rbd1: Linux rev 1.0 ext4 filesystem data (needs journal recovery) 
> (extents) (large files) (huge files)
> root@alxc13:~#
>

 And something even more peculiar - taking an md5sum some hours after the
 above test produced this:

 root@alxc13:~# md5sum <(dd if=/dev/rbd1 iflag=direct bs=8M)
 12800+0 records in
 12800+0 records out
 107374182400 bytes (107 GB) copied, 636.836 s, 169 MB/s
 e68e41616489d41544cd873c73defb08  /dev/fd/63

 Meaning the read-only snapshot somehow has "mutated". 

Re: [ceph-users] Replacing a failed OSD

2016-09-15 Thread Jim Kilborn
Nick/Dennis,



Thanks for the info. I did fiddle with a location script that would determine 
whether the drive is a spinning or ssd drive, and put it in the appropriate 
bucket. I might move back to that now that I understand ceph better.



Thanks for the link to the sample script as well.



Sent from Mail for Windows 10



From: Nick Fisk
Sent: Thursday, September 15, 2016 3:40 AM
To: Jim Kilborn; 'Reed 
Dier'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Replacing a failed OSD




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jim 
> Kilborn
> Sent: 14 September 2016 20:30
> To: Reed Dier 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Replacing a failed OSD
>
> Reed,
>
>
>
> Thanks for the response.
>
>
>
> Your process is the one that I ran. However, I have a crushmap with ssd and 
> sata drives in different buckets (host made up of host
> types, with and ssd and spinning hosttype for each host) because I am using 
> ssd drives for a replicated cache in front of an
erasure
> code data for cephfs.
>
>
>
> I have "osd crush update on start = false" so that osds don't randomly get 
> added to the crush map, because it wouldn't know where
> to put that osd.
>
>
>
> I am using puppet to provision the drives when it sees one in a slot and it 
> doesn't see the ceph signature (I guess). I am using
the ceph
> puppet module.
>
>
>
> The real confusion is why I have to remove it from the crush map. Once I 
> remove it from the crush map, it does bring it up as the
same
> osd number, but its not in the crush map, so I have to put it back where it 
> belongs. Just seems strange that it must be removed
from
> the crush map.
>
>
>
> Basically, I export the crush map, remove the osd from the crush map, then 
> redeploy the drive. Then when it gets up and running as
> the same osd number, I import the exported crush map to get it back in the 
> cluster.
>
>
>
> I guess that is just how it has to be done.

You can pass a script in via the 'osd crush location hook' variable so that the 
OSD's automatically get placed in the right location
when they startup. Thanks to Wido there is already a script that you can 
probably use with very few modifications:

https://gist.github.com/wido/5d26d88366e28e25e23d


>
>
>
> Thanks again
>
>
>
> Sent from Mail for Windows 10
>
>
>
> From: Reed Dier
> Sent: Wednesday, September 14, 2016 1:39 PM
> To: Jim Kilborn
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Replacing a failed OSD
>
>
>
> Hi Jim,
>
> This is pretty fresh in my mind so hopefully I can help you out here.
>
> Firstly, the crush map will back fill any holes in the enumeration that are 
> existing. So assuming only one drive has been removed
from
> the crush map, it will repopulate the same OSD number.
>
> My steps for removing an OSD are run from the host node:
>
> > ceph osd down osd.i
> > ceph osd out osd.i
> > stop ceph-osd id=i
> > umount /var/lib/ceph/osd/ceph-i
> > ceph osd crush remove osd.i
> > ceph auth del osd.i
> > ceph osd rm osd.i
>
>
> From here, the disk is removed from the ceph cluster, crush map, and is ready 
> for removal and replacement.
>
> From there I deploy the new osd with ceph-deploy from my admin node using:
>
> > ceph-deploy disk list nodei
> > ceph-deploy disk zap nodei:sdX
> > ceph-deploy --overwrite-conf osd prepare nodei:sdX
>
>
> This will prepare the disk and insert it back into the crush map, bringing it 
> back up and in. The OSD number should remain the
same, as
> it will fill the gap left from the previous OSD removal.
>
> Hopefully this helps,
>
> Reed
>
> > On Sep 14, 2016, at 11:00 AM, Jim Kilborn  wrote:
> >
> > I am finishing testing our new cephfs cluster and wanted to document a 
> > failed osd procedure.
> > I noticed that when I pulled a drive, to simulate a failure, and run 
> > through the replacement steps, the osd has to be removed
from
> the crushmap in order to initialize the new drive as the same osd number.
> >
> > Is this correct that I have to remove it from the crushmap, then after the 
> > osd is initialized, and mounted, add it back to the
crush
> map? Is there no way to have it reuse the same osd # without removing if from 
> the crush map?
> >
> > Thanks for taking the time..
> >
> >
> > -  Jim
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

__

Re: [ceph-users] Consistency problems when taking RBD snapshot

2016-09-15 Thread Ilya Dryomov
On Thu, Sep 15, 2016 at 2:43 PM, Nikolay Borisov  wrote:
>
> [snipped]
>
> cat /sys/bus/rbd/devices/47/client_id
> client157729
> cat /sys/bus/rbd/devices/1/client_id
> client157729
>
> Client client157729 is alxc13, based on correlation by the ip address
> shown by the rados -p ... command. So it's the only client where the rbd
> images are mapped.

Well, the watches are there, but cookie numbers indicate that they may
have been re-established, so that's inconclusive.

My suggestion would be to repeat the test and do repeated freezes to
see if snapshot continues to follow HEAD.

Further, to rule out a missed snap context update, repeat the test, but
stick

# echo 1 >/sys/bus/rbd/devices//refresh

after "rbd snap create" (for the today's test, ID_OF_THE_ORIG_DEVICE
would be 47).

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs thread leak during degraded cluster state

2016-09-15 Thread Kostis Fardelas
Our ceph cluster (from emperor till hammer) has made many times
recoveries during host outages/network failures and threads never
exceeded 10K. The thread leaks we experienced with down+peering PGs
(lasting for several hours) was something that we saw for the first
time. I don't see the reason to bump this. It looks like a leak (and
of course I could extend the leak by bumping pid_max) but this is not
the case, isn't it?

Kostis

On 15 September 2016 at 14:40, Wido den Hollander  wrote:
>
>> Op 15 september 2016 om 13:27 schreef Kostis Fardelas :
>>
>>
>> Hello cephers,
>> being in a degraded cluster state with 6/162 OSDs down ((Hammer
>> 0.94.7, 162 OSDs, 27 "fat" nodes, 1000s of clients) ) like the below
>> ceph cluster log indicates:
>>
>> 2016-09-12 06:26:08.443152 mon.0 62.217.119.14:6789/0 217309 : cluster
>> [INF] pgmap v106027148: 28672 pgs: 2 down+remapped+peering, 25904
>> active+clean, 23 stale+down+peering, 1 active+recovery_wait+degraded,
>> 1 active+recovery_wait+undersized+degraded, 170 down+peering, 1
>> active+clean+scrubbing, 8
>> active+undersized+degraded+remapped+wait_backfill, 27
>> stale+active+undersized+degraded, 3 active+remapped+wait_backfill,
>> 2531 active+undersized+degraded, 1
>> active+recovering+undersized+degraded+remapped; 95835 GB data, 186 TB
>> used, 94341 GB / 278 TB avail; 11230 B/s rd, 164 kB/s wr, 42 op/s;
>> 3148226/69530815 objects degraded (4.528%); 59272/69530815 objects
>> misplaced (0.085%); 1/34756893 unfound (0.000%)
>>
>> we experienced extensive thread leaks on the remaining up+in OSDs,
>> which lead to even more random crashes with Thread::create asserts:
>>
>> 2016-09-10 09:08:40.211713 7f8576bd6700 -1 common/Thread.cc: In
>> function 'void Thread::create(size_t)' thread 7f8576bd6700 time
>> 2016-09-10 09:08:40.199211
>> common/Thread.cc: 131: FAILED assert(ret == 0)
>>
>> Thread count under normal operations are ~6500 on all nodes, but in
>> this degraded state we reached even ~35000.
>>
>> Is this expected behaviour when you have down+peering OSDs?
>> Is it possible to mitigate this problem using ceph configuration or
>> our only resort is kernel pid_max bump?
>>
>
> You should bump that setting. The default 32k is way to low during recovery.
>
> Set it to at least 512k or so.
>
> Wido
>
>> Regards,
>> Kostis
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Designing ceph cluster

2016-09-15 Thread Gaurav Goyal
Dear Ceph users,

Any suggestion on this please.


Regards
Gaurav Goyal

On Wed, Sep 14, 2016 at 2:50 PM, Gaurav Goyal 
wrote:

> Dear Ceph Users,
>
> I need you help to sort out following issue with my cinder volume.
>
> I have created ceph as backend for cinder. Since i was using SAN storage
> for ceph and want to get rid of it i had completely uninstalled ceph from
> my openstack environment.
>
> Right now i am in a situation where we have ordered local disks to create
> ceph storage on local disks. but prior to configure ceph, we want to create
> cinder volume using LVM on one of local disk.
>
> I could create the cinder volume but unable to attach this volume to
> instance.
>
> *Volume Overview*
> Information
> --
> Name
> test123
> ID
> e13d0ffc-3ed4-4a22-b270-987e81b1ca8f
> Status
> Available
> Specs
> --
> Size
> 1 GB
> Created
> Sept. 13, 2016, 7:12 p.m.
> Attachments
> --
> Attached To   *Not attached*
>
> [root@OSKVM1 ~]# fdisk -l
>
> Disk /dev/sda: 599.6 GB, 599550590976 bytes, 1170997248 sectors
>
> Units = sectors of 1 * 512 = 512 bytes
>
> Sector size (logical/physical): 512 bytes / 512 bytes
>
> I/O size (minimum/optimal): 512 bytes / 512 bytes
>
> Disk label type: dos
>
> Disk identifier: 0x0002a631
>
>Device Boot  Start End  Blocks   Id  System
>
> /dev/sda1   *2048 1026047  512000   83  Linux
>
> /dev/sda2 1026048  1170997247   584985600   8e  Linux LVM
>
> Disk /dev/mapper/centos-root: 53.7 GB, 53687091200 bytes, 104857600 sectors
>
> Units = sectors of 1 * 512 = 512 bytes
>
> Sector size (logical/physical): 512 bytes / 512 bytes
>
> I/O size (minimum/optimal): 512 bytes / 512 bytes
>
> Disk /dev/mapper/centos-swap: 4294 MB, 4294967296 bytes, 8388608 sectors
>
> Units = sectors of 1 * 512 = 512 bytes
>
> Sector size (logical/physical): 512 bytes / 512 bytes
>
> I/O size (minimum/optimal): 512 bytes / 512 byte
>
> Disk /dev/mapper/centos-home: 541.0 GB, 540977135616 bytes, 1056595968
> sectors
>
> Units = sectors of 1 * 512 = 512 bytes
>
> Sector size (logical/physical): 512 bytes / 512 bytes
>
> I/O size (minimum/optimal): 512 bytes / 512 bytes
>
> Disk /dev/sdb: 1099.5 GB, 1099526307840 bytes, 2147512320 sectors
>
> Units = sectors of 1 * 512 = 512 bytes
>
> Sector size (logical/physical): 512 bytes / 512 bytes
>
> I/O size (minimum/optimal): 512 bytes / 512 byte
>
> *Disk
> /dev/mapper/cinder--volumes-volume--e13d0ffc--3ed4--4a22--b270--987e81b1ca8f:
> 1073 MB, 1073741824 bytes, 2097152 sectors*
>
> *Units = sectors of 1 * 512 = 512 bytes*
>
> *Sector size (logical/physical): 512 bytes / 512 bytes*
>
> *I/O size (minimum/optimal): 512 bytes / 512 bytes*
>
>
> I am getting following error while attaching new volume to my new
> instance. Please suggest a way forward
>
> 2016-09-13 16:48:18.335 55367 INFO nova.compute.manager
> [req-d19d0eb4-7ecc-4baa-8733-9c0f07f8890b dff16cdb3bea43a199ec4b29d2ba3309
> 9ef033cefb684be68105e30ef2b3b651 - - -] [instance:
> 8115ad54-dd36-47ba-bbd1-5c1df9989bf7] Attaching volume
> d90e4835-58f5-45a8-869e-fc3f30f0eaf3 to /dev/vdb
>
> 2016-09-13 16:48:20.548 55367 WARNING os_brick.initiator.connector
> [req-d19d0eb4-7ecc-4baa-8733-9c0f07f8890b dff16cdb3bea43a199ec4b29d2ba3309
> 9ef033cefb684be68105e30ef2b3b651 - - -] ISCSI volume not yet found at:
> [u'/dev/disk/by-path/ip-10.24.0.4:3260-iscsi-iqn.2010-10.
> org.openstack:volume-d90e4835-58f5-45a8-869e-fc3f30f0eaf3-lun-0']. Will
> rescan & retry.  Try number: 0.
>
> 2016-09-13 16:48:21.656 55367 WARNING os_brick.initiator.connector
> [req-d19d0eb4-7ecc-4baa-8733-9c0f07f8890b dff16cdb3bea43a199ec4b29d2ba3309
> 9ef033cefb684be68105e30ef2b3b651 - - -] ISCSI volume not yet found at:
> [u'/dev/disk/by-path/ip-10.24.0.4:3260-iscsi-iqn.2010-10.
> org.openstack:volume-d90e4835-58f5-45a8-869e-fc3f30f0eaf3-lun-0']. Will
> rescan & retry.  Try number: 1.
>
> 2016-09-13 16:48:25.772 55367 WARNING os_brick.initiator.connector
> [req-d19d0eb4-7ecc-4baa-8733-9c0f07f8890b dff16cdb3bea43a199ec4b29d2ba3309
> 9ef033cefb684be68105e30ef2b3b651 - - -] ISCSI volume not yet found at:
> [u'/dev/disk/by-path/ip-10.24.0.4:3260-iscsi-iqn.2010-10.
> org.openstack:volume-d90e4835-58f5-45a8-869e-fc3f30f0eaf3-lun-0']. Will
> rescan & retry.  Try number: 2.
>
> 2016-09-13 16:48:34.875 55367 WARNING os_brick.initiator.connector
> [req-d19d0eb4-7ecc-4baa-8733-9c0f07f8890b dff16cdb3bea43a199ec4b29d2ba3309
> 9ef033cefb684be68105e30ef2b3b651 - - -] ISCSI volume not yet found at:
> [u'/dev/disk/by-path/ip-10.24.0.4:3260-iscsi-iqn.2010-10.
> org.openstack:volume-d90e4835-58f5-45a8-869e-fc3f30f0eaf3-lun-0']. Will
> rescan & retry.  Try number: 3.
>
> 2016-09-13 16:48:42.418 55367 INFO nova.compute.resource_tracker
> [req-58348829-5b26-4835-ba5b-4e8796800b63 - - - - -] Auditing locally
> available compute resources for node controller
>
> 2016-09-13 16:48:43.841 55367 INFO nova.compute.resource_tracker
> [req-58348829-5b

[ceph-users] RADOSGW and LDAP

2016-09-15 Thread Andrus, Brian Contractor
All,
I have been making some progress on troubleshooting this.
I am seeing that when rgw is configured for LDAP, I am getting an error in my 
slapd log:

Sep 14 06:56:21 mgmt1 slapd[23696]: conn=1762 op=0 RESULT tag=97 err=2 
text=historical protocol version requested, use LDAPv3 instead

Am I correct with an interpretation that rgw does not do LDAPv3?
Is there a way to enable this, or must I allow older versions in my OpenLDAP 
configuration?

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds damage detected - Jewel

2016-09-15 Thread Jim Kilborn
I have a replicated cache pool and metadata pool which reside on ssd drives, 
with a size of 2, backed by a erasure coded data pool
The cephfs filesystem was in a healthy state. I pulled an SSD drive, to perform 
an exercise in osd failure.

The cluster recognized the ssd failure, and replicated back to a healthy state, 
but I got a message saying the mds0 Metadata damage detected.


   cluster 62ed97d6-adf4-12e4-8fd5-3d9701b22b86
 health HEALTH_ERR
mds0: Metadata damage detected
mds0: Client master01.div18.swri.org failing to respond to cache 
pressure
 monmap e2: 3 mons at 
{ceph01=192.168.19.241:6789/0,ceph02=192.168.19.242:6789/0,ceph03=192.168.19.243:6789/0}
election epoch 24, quorum 0,1,2 
ceph01,darkjedi-ceph02,darkjedi-ceph03
  fsmap e25: 1/1/1 up {0=-ceph04=up:active}, 1 up:standby
 osdmap e1327: 20 osds: 20 up, 20 in
flags sortbitwise
  pgmap v11630: 1536 pgs, 3 pools, 100896 MB data, 442 kobjects
201 GB used, 62915 GB / 63116 GB avail
1536 active+clean

In the mds logs of the active mds, I see the following:

7fad0c4b2700  0 -- 192.168.19.244:6821/1 >> 192.168.19.243:6805/5090 
pipe(0x7fad25885400 sd=56 :33513 s=1 pgs=0 cs=0 l=1 c=0x7fad2585f980).fault
7fad14add700  0 mds.beacon.darkjedi-ceph04 handle_mds_beacon no longer laggy
7fad101d3700  0 mds.0.cache.dir(1016c08) _fetched missing object for [dir 
1016c08 /usr/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741952 f() n() 
hs=0+0,ss=0+0 | waiter=1 authpin=1 0x7fad25ced500]
7fad101d3700 -1 log_channel(cluster) log [ERR] : dir 1016c08 object missing 
on disk; some files may be lost
7fad0f9d2700  0 -- 192.168.19.244:6821/1 >> 192.168.19.242:6800/3746 
pipe(0x7fad25a4e800 sd=42 :0 s=1 pgs=0 cs=0 l=1 c=0x7fad25bd5180).fault
7fad14add700 -1 log_channel(cluster) log [ERR] : unmatched fragstat size on 
single dirfrag 1016c08, inode has f(v0 m2016-09-14 14:00:36.654244 
13=1+12), dirfrag has f(v0 m2016-09-14 14:00:36.654244 1=0+1)
7fad14add700 -1 log_channel(cluster) log [ERR] : unmatched rstat rbytes on 
single dirfrag 1016c08, inode has n(v77 rc2016-09-14 14:00:36.654244 
b1533163206 48173=43133+5040), dirfrag has n(v77 rc2016-09-14 14:00:36.654244 
1=0+1)
7fad101d3700 -1 log_channel(cluster) log [ERR] : unmatched rstat on 
1016c08, inode has n(v78 rc2016-09-14 14:00:36.656244 2=0+2), dirfrags have 
n(v0 rc2016-09-14 14:00:36.656244 3=0+3)

I’m not sure why the metadata got damaged, since its being replicated, but I 
want to fix the issue, and test again. However, I cant figure out the steps to 
repair the metadata.
I saw something about running a damage ls, but I can’t seem to find a more 
detailed repair document. Any pointers to get the metadata fixed? Seems both my 
mds daemons are running correctly, but that error bothers me. Shouldn’t happen 
I think.

I tried the following command, but it doesn’t understand it….
ceph --admin-daemon /var/run/ceph/ceph-mds. ceph03.asok damage ls


I then rebooted all 4 ceph servers simultaneously (another stress test), and 
the ceph cluster came back up healthy, and the mds damaged status has been 
cleared!!  I  then replaced the ssd, put it back into service, and let the 
backfill complete. The cluster was fully healthy. I pulled another ssd, and 
repeated this process, yet I never got the damaged mds messages. Was this just 
a random metadata damage due to yanking a drive out? Is there any lingering 
affects of the metadata that I need to address?


-  Jim

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cleanup old osdmaps after #13990 fix applied

2016-09-15 Thread Dan Van Der Ster

> On 14 Sep 2016, at 23:07, Gregory Farnum  wrote:
> 
> On Wed, Sep 14, 2016 at 7:19 AM, Dan Van Der Ster
>  wrote:
>> Indeed, seems to be trimmed by osd_target_transaction_size (default 30) per 
>> new osdmap.
>> Thanks a lot for your help!
> 
> IIRC we had an entire separate issue before adding that field, where
> cleaning up from bad situations like that would result in the OSD
> killing itself as removing 2k maps exceeded the heartbeat timeouts. ;)
> Thus the limit.

Thanks Greg. FTR, I did some experimenting and found that setting 
osd_target_transaction_size = 1000 is a very bad idea (tried on one osd... 
FileStore merging of the meta subdirs lead to a slow/down osd). But setting it 
to ~60 was OK.

I cleaned up 90TB of old osdmaps today, generating new maps in a loop by doing:

   watch -n10 ceph osd pool set data min_size 2

Anything more aggressive than that was disruptive on our cluster.

Cheers, Dan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: Writes are faster than reads?

2016-09-15 Thread Andreas Gerstmayr
Thanks a lot for your explanation!
I just increased the 'rasize' option of the kernel module and got
significant better throughput for sequential reads.


Thanks,
Andreas


2016-09-15 0:29 GMT+02:00 Gregory Farnum :
> Oh hrm, I missed the stripe count settings. I'm not sure if that's
> helping you or not; I don't have a good intuitive grasp of what
> readahead will do in that case. I think you may need to adjust the
> readahead config knob in order to make it read all those objects
> together instead of one or two at a time.
> -Greg
>
> On Wed, Sep 14, 2016 at 3:24 PM, Andreas Gerstmayr
>  wrote:
>> 2016-09-14 23:19 GMT+02:00 Gregory Farnum :
>>> This is pretty standard behavior within Ceph as a whole — the journals
>>> really help on writes;
>> How does the journal help with large blocks? I thought the journal
>> speed up is because of coalescing lots of small writes into bigger
>> blocks - but in my benchmark the block size is already 1MB.
>>
>>> and especially with big block sizes you'll
>>> exceed the size of readahead, but writes will happily flush out in
>>> parallel.
>> The client buffers lots of pages in the page cache and sends them in
>> bulk to the storage nodes where multiple OSDs can write the data in
>> parallel (because each OSD has their own disk), whereas the size of
>> readahead is way smaller than the buffer cache and therefore it can't
>> be parallelized that much (too little data is requested from the
>> cluster)?
>> Did I get that right?
>>
>> I just started a rados benchmark with similar settings (same
>> blocksize, 10 threads (same as the stripe count)):
>> $ rados bench -p repl1 180 -b 1M -t 10 write --no-cleanup
>> Total time run: 180.148379
>> Total writes made:  47285
>> Write size: 1048576
>> Object size:1048576
>> Bandwidth (MB/sec): 262.478
>>
>> Reading:
>> $ rados bench -p repl1 60 -t 10 seq
>> Total time run:   49.936949
>> Total reads made: 47285
>> Read size:1048576
>> Object size:  1048576
>> Bandwidth (MB/sec):   946.894
>>
>> Here the write is slower than the read benchmark. Is it because rados
>> does sync() each object after write? And there is no readahead, so all
>> the 10 threads are busy all the time during the benchmark, where in
>> the CephFS scenario it depends on the client readahead setting if 10
>> stripes are requested in parallel all the time?
>>
>>
>>>
>>> On Wed, Sep 14, 2016 at 12:51 PM, Henrik Korkuc  wrote:
 On 16-09-14 18:21, Andreas Gerstmayr wrote:
>
> Hello,
>
> I'm currently performing some benchmark tests with our Ceph storage
> cluster and trying to find the bottleneck in our system.
>
> I'm writing a random 30GB file with the following command:
> $ time fio --name=job1 --rw=write --blocksize=1MB --size=30GB
> --randrepeat=0 --end_fsync=1
> [...]
>   WRITE: io=30720MB, aggrb=893368KB/s, minb=893368KB/s,
> maxb=893368KB/s, mint=35212msec, maxt=35212msec
>
> real0m35.539s
>
> This makes use of the page cache, but fsync()s at the end (network
> traffic from the client stops here, so the OSDs should have the data).
>
> When I read the same file back:
> $ time fio --name=job1 --rw=read --blocksize=1MB --size=30G
> [...]
> READ: io=30720MB, aggrb=693854KB/s, minb=693854KB/s,
> maxb=693854KB/s, mint=45337msec, maxt=45337msec
>
> real0m45.627s
>
> It takes 10s longer. Why? When writing data to a Ceph storage cluster,
> the data is written twice (unbuffered to the journal and buffered to
> the backing filesystem [1]). On the other hand, reading should be much
> faster because it needs only a single operation, the data should be
> already in the page cache of the OSDs (I'm reading the same file I've
> written before, and the OSDs have plenty of RAM) and reading from
> disks is generally faster than writing. Any idea what is going on in
> the background, which makes reads more expensive than writes?

 I am not an expert here, but I think it basically boils down to that you
 read it linearly and write (flush cache) in parallel.

 If you could read multiple parts of the same file in parallel you could
 achieve better speeds
>>
>> I thought the striping feature of CephFS does exactly that? Write and
>> read stripe_count stripes in parallel?
>>


>
> I've run these tests multiple times with fairly consistent results.
>
> Cluster Config:
> Ceph jewel, 3 nodes with 256GB RAM and 25 disks each (only HDDs,
> journal on same disk)
> Pool with size=1 and 2048 PGs, CephFS stripe unit: 1MB, stripe count:
> 10, object size: 10MB
> 10 GbE, separate frontend+backend network
>
> [1] https://www.sebastien-han.fr/blog/2014/02/17/ceph-io-patterns-the-bad/
>
>
> Thanks,
> Andreas
> ___
> ceph-users mailing list

[ceph-users] High CPU load with radosgw instances

2016-09-15 Thread lewis.geo...@innoscale.net
Hi,
 So, maybe someone has an idea of where to go on this.
  
 I have just setup 2 rgw instances in a multisite setup. They are working 
nicely. I have add a couple of test buckets and some files to make sure it 
works is all. The status shows both are caught up. Nobody else is accessing 
or using them.
  
 However, the CPU load on both hosts is sitting at like 3.00, with the 
radosgw process taking up 99% CPU constantly. I do not see anything in the  
logs happening at all.
  
 Any thoughts or direction?
  
 Have a good day,
  
 Lewis George
  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Error while searching on the mailing list archives

2016-09-15 Thread Erick Perez - Quadrian Enterprises
Hi, just to let the admins know that when searching for terms (i searched
for erasure coding) in the mailing list
http://lists.ceph.com/pipermail/ceph-users-ceph.com/

This error is returned in the browser at
http://lists.ceph.com/mmsearch.cgi/ceph-users-ceph.com

ht://Dig error

htsearch detected an error. Please report this to the webmaster of this
site by sending an e-mail to: mail...@listserver-dap.dreamhost.com The
error message is:

Unable to read word database file '/dh/mailman/dap/archives/private/
ceph-users-ceph.com/htdig/db.words.db'
Did you run htdig?

-- 
Erick.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure coding general information Openstack+kvm virtual machine block storage

2016-09-15 Thread Erick Perez - Quadrian Enterprises
Can someone point me to a thread or site that uses ceph+erasure coding to
serve block storage for Virtual Machines running with Openstack+KVM?
All references that I found are using erasure coding for cold data or *not*
VM block access.

thanks,

-- 

-
Erick
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] swiftclient call radosgw, it always response 401 Unauthorized

2016-09-15 Thread Brian Chang-Chien
Can anyone know this problem,please help me to watch this

2016年9月13日 下午5:58,"Brian Chang-Chien" 寫道:

> Hi ,naga.b
>
> I use Ceph jewel 10.2.2
> my ceph.conf  as follow
> [global]
> fsid = d056c174-2e3a-4c36-a067-cb774d176ce2
> mon_initial_members = brianceph
> mon_host = 10.62.9.140
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> osd_crush_chooseleaf_type = 0
> osd_pool_default_size = 1
> osd_journal_size = 100
> [client.radosgw.gateway]
> host = brianceph
> keyring = /etc/ceph/ceph.client.radosgw.keyring
> log_file = /var/log/ceph/radosgw.log
> rgw_dns_name = brianceph
> rgw_keystone_url = http://10.62.13.253:35357
> rgw_keystone_admin_token = 7bb8e26cbc714c47a26ffec3d96f246f
> rgw_keystone_accepted_roles = admin, swiftuser
> rgw_ketstone_token_cache_size = 200
> rgw_keystone_revocation_interval = 30
> rgw_s3_auth_use_keystone = true
> nss_db_path = /var/ceph/nss
>
> and my radosgw.log
>
> 2016-09-13 17:42:38.638462 7efd964619c0  0 starting handler: fastcgi
> 2016-09-13 17:42:38.638523 7efcadf9b700  0 ERROR: no socket server point
> defined, cannot start fcgi frontend
> 2016-09-13 17:47:33.597070 7efcdeffd700  1 == starting new request
> req=0x7efcdeff7710 =
> 2016-09-13 17:47:33.597329 7efcdeffd700  1 == req done
> req=0x7efcdeff7710 op status=0 http_status=401 ==
> 2016-09-13 17:47:33.597379 7efcdeffd700  1 civetweb: 0x7efd2bb0:
> 10.62.9.34 - - [13/Sep/2016:17:47:33 +0800] "HEAD /swift/v1 HTTP/1.1" 401 0
> - python-swiftclient-2.6.0
> 2016-09-13 17:47:34.755291 7efcd700  1 == starting new request
> req=0x7efcdfff9710 =
> 2016-09-13 17:47:34.755443 7efcd700  1 == req done
> req=0x7efcdfff9710 op status=0 http_status=401 ==
> 2016-09-13 17:47:34.755481 7efcd700  1 civetweb: 0x7efd48004020:
> 10.62.9.34 - - [13/Sep/2016:17:47:34 +0800] "HEAD /swift/v1 HTTP/1.1" 401 0
> - python-swiftclient-2.6.0
> 2016-09-13 17:49:04.718249 7efcdf7fe700  1 == starting new request
> req=0x7efcdf7f8710 =
> 2016-09-13 17:49:04.718438 7efcdf7fe700  1 == req done
> req=0x7efcdf7f8710 op status=0 http_status=401 ==
> 2016-09-13 17:49:04.718483 7efcdf7fe700  1 civetweb: 0x7efd68001f60:
> 10.62.9.34 - - [13/Sep/2016:17:49:04 +0800] "HEAD /swift/v1 HTTP/1.1" 401 0
> - python-swiftclient-2.6.0
> 2016-09-13 17:49:05.870115 7efcde7fc700  1 == starting new request
> req=0x7efcde7f6710 =
> 2016-09-13 17:49:05.870280 7efcde7fc700  1 == req done
> req=0x7efcde7f6710 op status=0 http_status=401 ==
> 2016-09-13 17:49:05.870324 7efcde7fc700  1 civetweb: 0x7efd28000bb0:
> 10.62.9.34 - - [13/Sep/2016:17:49:05 +0800] "HEAD /swift/v1 HTTP/1.1" 401 0
> - python-swiftclient-2.6.0
> 2016-09-13 17:51:32.036065 7efd157fa700  1 handle_sigterm
> 2016-09-13 17:51:32.036099 7efd157fa700  1 handle_sigterm set alarm for 120
> 2016-09-13 17:51:32.036153 7efd964619c0 -1 shutting down
> 2016-09-13 17:51:32.037977 7efd78df9700  0 monclient: hunting for new mon
> 2016-09-13 17:51:32.038172 7efd783f6700  0 -- 10.62.9.140:0/1002906388 >>
> 10.62.9.140:6789/0 pipe(0x7efd60016670 sd=7 :0 s=1 pgs=0 cs=0 l=1
> c=0x7efd60014d70).fault
> 2016-09-13 17:51:32.906553 7efd964619c0  1 final shutdown
> 2016-09-13 17:51:39.294948 7ff5175f29c0  0 deferred set uid:gid to 167:167
> (ceph:ceph)
> 2016-09-13 17:51:39.295097 7ff5175f29c0  0 ceph version 10.2.2 (
> 45107e21c568dd033c2f0a3107dec8f0b0e58374), process radosgw, pid 13251
> 2016-09-13 17:51:39.318311 7ff5175e8700  0 -- :/175783115 >>
> 10.62.9.140:6789/0 pipe(0x7ff51987b9b0 sd=7 :0 s=1 pgs=0 cs=0 l=1
> c=0x7ff519842430).fault
> 2016-09-13 17:51:39.596568 7ff4fc10d700  0 -- 10.62.9.140:0/175783115 >>
> 10.62.9.140:6800/11336 pipe(0x7ff519880080 sd=8 :0 s=1 pgs=0 cs=0 l=1
> c=0x7ff519881390).fault
> 2016-09-13 17:51:40.197109 7ff4fc10d700  0 -- 10.62.9.140:0/175783115 >>
> 10.62.9.140:6800/11336 pipe(0x7ff519880080 sd=8 :42233 s=1 pgs=0 cs=0 l=1
> c=0x7ff519881390).connect claims to be 10.62.9.140:6800/13358 not
> 10.62.9.140:6800/11336 - wrong node!
> 2016-09-13 17:51:40.997618 7ff4fc10d700  0 -- 10.62.9.140:0/175783115 >>
> 10.62.9.140:6800/11336 pipe(0x7ff519880080 sd=8 :42234 s=1 pgs=0 cs=0 l=1
> c=0x7ff519881390).connect claims to be 10.62.9.140:6800/13358 not
> 10.62.9.140:6800/11336 - wrong node!
> 2016-09-13 17:51:42.598080 7ff4fc10d700  0 -- 10.62.9.140:0/175783115 >>
> 10.62.9.140:6800/11336 pipe(0x7ff519880080 sd=8 :42235 s=1 pgs=0 cs=0 l=1
> c=0x7ff519881390).connect claims to be 10.62.9.140:6800/13358 not
> 10.62.9.140:6800/11336 - wrong node!
> 2016-09-13 17:51:45.798587 7ff4fc10d700  0 -- 10.62.9.140:0/175783115 >>
> 10.62.9.140:6800/11336 pipe(0x7ff519880080 sd=8 :42236 s=1 pgs=0 cs=0 l=1
> c=0x7ff519881390).connect claims to be 10.62.9.140:6800/13358 not
> 10.62.9.140:6800/11336 - wrong node!
> 2016-09-13 17:51:52.199050 7ff4fc10d700  0 -- 10.62.9.140:0/175783115 >>
> 10.62.9.140:6800/11336 pipe(0x7ff519880080 sd=8 :42237 s=1 pgs=0 cs=0 l=1
> c=0x7ff519881390).connect claims to

Re: [ceph-users] Erasure coding general information Openstack+kvm virtual machine block storage

2016-09-15 Thread Josh Durgin

On 09/16/2016 09:46 AM, Erick Perez - Quadrian Enterprises wrote:

Can someone point me to a thread or site that uses ceph+erasure coding
to serve block storage for Virtual Machines running with Openstack+KVM?
All references that I found are using erasure coding for cold data or
*not* VM block access.


Erasure coding is not supported by RBD currently, since EC pools only 
support append operations. There's work in progress to make it

possible, by allowing overwrites for EC pools, but it won't be usable
until at earliest Luminous [0].

Josh

[0] http://tracker.ceph.com/issues/14031
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU load with radosgw instances

2016-09-15 Thread Yehuda Sadeh-Weinraub
On Thu, Sep 15, 2016 at 4:53 PM, lewis.geo...@innoscale.net
 wrote:
> Hi,
> So, maybe someone has an idea of where to go on this.
>
> I have just setup 2 rgw instances in a multisite setup. They are working
> nicely. I have add a couple of test buckets and some files to make sure it
> works is all. The status shows both are caught up. Nobody else is accessing
> or using them.
>
> However, the CPU load on both hosts is sitting at like 3.00, with the
> radosgw process taking up 99% CPU constantly. I do not see anything in the
> logs happening at all.
>
> Any thoughts or direction?
>

We've seen that happening when running on a system with older version
of libcurl (e.g., 7.29). If that's the case upgrading to a newer
version should fix it for you.

Yehuda


> Have a good day,
>
> Lewis George
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com