Re: [ceph-users] How to get Active set of OSD Map in serial order of osd index

2016-07-27 Thread Syed Hussain
Fundamentally, I wanted to know what chunks are allocated in which OSDs.
This way I can preserve the array structure required for my
Erasure Code. If all the chunks are placed in randomly ordered OSDs (like
in Jerasure or ISA) then I loss that array structure required in the
Encoding/Decoding algorithm of my Plugin.
I'm trying to develop an Erasure Code plugin for RDP (or RAID-DP) kind of
code.

Thanks,
Syed

On Wed, Jul 27, 2016 at 4:12 AM, Samuel Just  wrote:

> Why do you want them in serial increasing order?
> -Sam
>
> On Tue, Jul 26, 2016 at 2:43 PM, Samuel Just  wrote:
>
>> How would such a code work if there were more than 24 osds?
>> -Sam
>>
>> On Tue, Jul 26, 2016 at 2:37 PM, Syed Hussain  wrote:
>>
>>> Hi,
>>>
>>> I'm working to develop an Erasure Code plugin (variation of ISA) that
>>> have typical requirement that the active set of the Erasure Coded pool in
>>> serial order.
>>> For example,
>>>
>>> 
>>> >ceph osd erasure-code-profile set reed_k16m8_isa k=16 m=8 plugin=isa
>>> technique=reed_sol_van ruleset-failure-domain=osd
>>> >ceph osd pool create reed_k16m8_isa_pool 128 128 erasure reed_k16m8_isa
>>> >echo "ABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHI" | rados
>>> --pool reed_k16m8_isa_pool put myobj16_8 -
>>> >ceph osd map reed_k16m8_isa_pool myobj16_8
>>> osdmap e86 pool 'reed_k16m8_isa_pool' (1) object 'myobj16_8' -> pg
>>> 1.cf6ec86f (1.6f) -> up
>>> ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4)
>>> acting ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4)
>>>
>>> 
>>>
>>> That means the chunks 0, 1, 2, ...23 of the erasure coding are saved int
>>> osd 4, 23, 22, 10, ...2 respectively as per the order given in the active
>>> set.
>>>
>>> Now my question is how I'll be able to get the PG map for object
>>> myobj16_8 having active set as: [0, 1, 2, ...23] so that the i-th chunk of
>>> the Erasure Coded object saves into
>>> i-th osd.
>>>
>>> Is there any option available in "ceph osd pool create" to do it?
>>> Or there may be other way available to accomplish this case.
>>>
>>> Appreciate your suggestions..
>>>
>>> Thanks,
>>> Syed Hussain
>>> NetWorld
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unknown error (95->500) when creating buckets or putting files to RGW after upgrade from Infernalis to Jewel

2016-07-27 Thread Naruszewicz, Maciej
Sure Nick, here they are:

# ceph osd lspools
72 .rgw.control,73 .rgw,74 .rgw.gc,75 .log,76 .users.uid,77 .users,78 
.users.swift,79 .rgw.buckets.index,80 .rgw.buckets.extra,81 .rgw.buckets,82 
.rgw.root.backup,83 .rgw.root,84 logs,85 default.rgw.meta,

Thanks for your help nonetheless!

-Original Message-
From: nick [mailto:n...@nine.ch] 
Sent: Wednesday, July 27, 2016 6:31 AM
To: Naruszewicz, Maciej 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Unknown error (95->500) when creating buckets or 
putting files to RGW after upgrade from Infernalis to Jewel

Hi Maciej,
slowly I am running out of ideas :-) Could you send the output of 'ceph osd 
lspools' so that I can compare your pools with ours?

Maybe someone else got similiar problems and can help?

Cheers
Nick

On Tuesday, July 26, 2016 03:56:39 PM Naruszewicz, Maciej wrote:
> Unfortunately none of our pools are erasure-code pools - I just 
> double-checked that.
> 
> I found another issue with deleting (I only can't create buckets or 
> upload files, get/delete work fine) which looks almost identically 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003100.h
> tml
> but it was unanswered.
> 
> 
> -Original Message-
> From: nick [mailto:n...@nine.ch]
> Sent: Tuesday, July 26, 2016 8:27 AM
> To: Naruszewicz, Maciej 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Unknown error (95->500) when creating 
> buckets or putting files to RGW after upgrade from Infernalis to Jewel
> 
> Hey Maciej,
> I compared the output of your commands with the output on our cluster 
> and they are the same. So I do not see any problems on that site. 
> After that I googled for the warning you get in the debug log: """
> WARNING: set_req_state_err err_no=95 resorting to 500 """
> 
> I found some reports about problems with EC coded pools and rados gw. 
> Do you use that?
> 
> 
> Cheers
> Nick
> 
> On Monday, July 25, 2016 04:50:56 PM Naruszewicz, Maciej wrote:
> > WARNING: set_req_state_err err_no=95 resorting to 500
 
--
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41 44 
637 40 00 | Support +41 44 637 40 40 | www.nine.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW container deletion problem

2016-07-27 Thread Daniel Schneller


Bump

On 2016-07-25 14:05:38 +, Daniel Schneller said:


Hi!

I created a bunch of test containers with some objects in them via
RGW/Swift (Ubuntu, RGW via Apache, Ceph Hammer 0.94.1)

Now I try to get rid of the test data.

I manually staretd with one container:

~/rgwtest ➜  swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K
<...> --insecure delete test_a6b3e80c-e880-bef9-b1b5-892073e3b153
test_10
test_5
test_100
test_20
test_30

So far so good. Notice that locahost:8405 is bound by haproxy,
distributing requests to 4 RGWs on different servers, in case that is
relevant.

To make sure my script gets error handling right, I tried to delete the
same container again, leading to an error:

~/rgwtest ➜  swift -v --retries=0 -V 1.0 -A http://localhost:8405/auth
-U <...> -K <...> --insecure delete
test_a6b3e80c-e880-bef9-b1b5-892073e3b153
Container DELETE failed:
http://localhost:8405:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153
500 Internal Server Error   UnknownError

Stat'ing it still works:

~/rgwtest ➜  swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K
<...> --insecure stat test_a6b3e80c-e880-bef9-b1b5-892073e3b153
   URL:
http://localhost:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153
Auth Token: AUTH_rgwtk...
   Account: v1
 Container: test_a6b3e80c-e880-bef9-b1b5-892073e3b153
   Objects: 0
 Bytes: 0
  Read ACL:
 Write ACL:
   Sync To:
  Sync Key:
Server: Apache/2.4.7 (Ubuntu)
X-Container-Bytes-Used-Actual: 0
X-Storage-Policy: default-placement
  Content-Type: text/plain; charset=utf-8


Checking the RGW Logs I found this:

2016-07-25 15:21:29.751055 7fbcd67f4700  1 == starting new request
req=0x7fbce40a1100 =
2016-07-25 15:21:29.768688 7fbcd67f4700  0 WARNING: set_req_state_err
err_no=125 resorting to 500
2016-07-25 15:21:29.768743 7fbcd67f4700  1 == req done
req=0x7fbce40a1100 http_status=500 ==

Googling a little and finding this:

http://tracker.ceph.com/issues/14208

mentioning similar issues and an out-of-sync metadata cache between
different RGWs. I vaguely remember having seen something like this
in the Firefly  timeframe before, but I am not sure if it is the same.

Where does this metadata cache live? Can it be flushed somehow without
disturbing other operations?

I found this PDF

https://archive.fosdem.org/2016/schedule/event/virt_iaas_ceph_rados_gateway_overview/attachments/audio/1077/export/events/attachments/virt_iaas_ceph_rados_gateway_overview/audio/1077/Fosdem_RGW.pdf 




but without the "audio track" it doesn't really help me.

Thanks!
Daniel



--
--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD host swap usage

2016-07-27 Thread Kenneth Waegeman

Hi all,

When our OSD hosts are running for some time, we start see increased 
usage of swap on a number of them. Some OSDs don't use swap for weeks, 
while others has a full (4G) swap, and start filling swap again after we 
did a swapoff/swapon.
We have 8 8TB OSDS and 2 cache SSDs on each hosts, and 80GB of Memory. 
There is still about 15-20GB memory available when this happens. Running 
Centos7; We had swapiness set to 0. There is no client io right now, 
only scrubbing. some OSDs are using 20-80% of cpu.


Has somebody seen this behaviour? It doesn't have to be bad, but what 
could explain some hosts keep on swapping, and others don't?

Could this be some issue?

Thanks !!

Kenneth
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unknown error (95->500) when creating buckets or putting files to RGW after upgrade from Infernalis to Jewel

2016-07-27 Thread nick
I compared the pools with ours and I can see no difference to be honest. The 
issue sounds like you can not write into a specific pool (as get and delete 
works). 

Are all the filesystem permissions correct? Maybe another 'chown -R ceph:ceph' 
for all the OSD data dirs would help? Did you check the users permissions in 
rgw as well (op_mask of 'radosgw-admin user info --uid=""')?

Cheers
Nick

On Wednesday, July 27, 2016 07:55:14 AM Naruszewicz, Maciej wrote:
> Sure Nick, here they are:
> 
> # ceph osd lspools
> 72 .rgw.control,73 .rgw,74 .rgw.gc,75 .log,76 .users.uid,77 .users,78
> .users.swift,79 .rgw.buckets.index,80 .rgw.buckets.extra,81 .rgw.buckets,82
> .rgw.root.backup,83 .rgw.root,84 logs,85 default.rgw.meta,
> 
> Thanks for your help nonetheless!
> 
> -Original Message-
> From: nick [mailto:n...@nine.ch]
> Sent: Wednesday, July 27, 2016 6:31 AM
> To: Naruszewicz, Maciej 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Unknown error (95->500) when creating buckets or
> putting files to RGW after upgrade from Infernalis to Jewel
> 
> Hi Maciej,
> slowly I am running out of ideas :-) Could you send the output of 'ceph osd
> lspools' so that I can compare your pools with ours?
> 
> Maybe someone else got similiar problems and can help?
> 
> Cheers
> Nick
> 
> On Tuesday, July 26, 2016 03:56:39 PM Naruszewicz, Maciej wrote:
> > Unfortunately none of our pools are erasure-code pools - I just
> > double-checked that.
> > 
> > I found another issue with deleting (I only can't create buckets or
> > upload files, get/delete work fine) which looks almost identically
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003100.h
> > tml
> > but it was unanswered.
> > 
> > 
> > -Original Message-
> > From: nick [mailto:n...@nine.ch]
> > Sent: Tuesday, July 26, 2016 8:27 AM
> > To: Naruszewicz, Maciej 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Unknown error (95->500) when creating
> > buckets or putting files to RGW after upgrade from Infernalis to Jewel
> > 
> > Hey Maciej,
> > I compared the output of your commands with the output on our cluster
> > and they are the same. So I do not see any problems on that site.
> > After that I googled for the warning you get in the debug log: """
> > WARNING: set_req_state_err err_no=95 resorting to 500 """
> > 
> > I found some reports about problems with EC coded pools and rados gw.
> > Do you use that?
> > 
> > 
> > Cheers
> > Nick
> > 
> > On Monday, July 25, 2016 04:50:56 PM Naruszewicz, Maciej wrote:
> > > WARNING: set_req_state_err err_no=95 resorting to 500
> 
> --
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41 44
> 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
 
-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs - mds hardware recommendation for 40 million files and 500 users

2016-07-27 Thread John Spray
On Tue, Jul 26, 2016 at 9:53 PM, Mike Miller  wrote:
> Hi,
>
> we have started to migrate user homes to cephfs with the mds server 32GB
> RAM. With multiple rsync threads copying this seems to be undersized; the
> mds process consumes all memory 32GB fitting about 4 million caps.
>
> Any hardware recommendation for about 40 million files and about 500 users?

As Greg says, your working set is the important thing rather than the
overall number of files in the system.

If, for example, you are using fuse clients with the default client
cache size (16384) then your working set for 500 clients will be
around 8 million, assuming the clients are accessing unique files
(likely for home directories).  Look at the memory usage of your
existing MDS vs. the value of the mds.inodes performance counter to
work out how much RAM is being used per inode.

> Currently, we are on hammer 0.94.5 and linux ubuntu, kernel 3.13.

You should definitely update to Jewel for your cephfs rollout, and if
you're using the kernel client anywhere make sure you've got a 4.x
kernel.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD host swap usage

2016-07-27 Thread Christian Balzer

Hello,

On Wed, 27 Jul 2016 10:21:34 +0200 Kenneth Waegeman wrote:

> Hi all,
> 
> When our OSD hosts are running for some time, we start see increased 
> usage of swap on a number of them. Some OSDs don't use swap for weeks, 
> while others has a full (4G) swap, and start filling swap again after we 
> did a swapoff/swapon.

Obvious first question would be, are all these hosts really the same, HW,
SW and configuration wise?

> We have 8 8TB OSDS and 2 cache SSDs on each hosts, and 80GB of Memory. 

How full are these OSDs? 
I'm interested in # of files, not space, so a "df -i" should give us some idea.

80GB is an odd number, how are the DIMMs distributed among the CPU(s)?

> There is still about 15-20GB memory available when this happens. Running 
> Centos7; 

How do you define free memory? 
Not used at all? 
I'd expect any Ceph storage server to use all "free" RAM for SLAB and
pagecache very quickly, at the latest after the first deep scrub.

If it is really unused AND your system is swapping, something odd is going
on indeed, maybe something NUMA related that prevents part of your memory
from being used.

Of course this could also be an issue with your CentOS kernel, I'm
definitely not seeing anything like this on any of my machines.

> We had swapiness set to 0. 
I wouldn't set it lower than 1.
Also any other tuning settings, like vm/vfs_cache_pressure and
vm/min_free_kbytes?

>There is no client io right now, 
> only scrubbing. some OSDs are using 20-80% of cpu.
> 
Sounds high for pure CPU usage, unless that includes IOWAIT.

Christian

> Has somebody seen this behaviour? It doesn't have to be bad, but what 
> could explain some hosts keep on swapping, and others don't?
> Could this be some issue?
> 
> Thanks !!
> 
> Kenneth
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to list the objects stored in the specified placement group?

2016-07-27 Thread jerry
Hello everyone,


I want to list the objects stored in the specified placement group through 
rados API, do you know how to deal with it?___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to list the objects stored in the specified placement group?

2016-07-27 Thread Wido den Hollander

> Op 27 juli 2016 om 12:48 schreef jerry :
> 
> 
> Hello everyone,
> 
> 
> I want to list the objects stored in the specified placement group through 
> rados API, do you know how to deal with 
> it?___

As far as I know that's not possible. Placement Groups are something which 
happens inside the OSDs, but they are not exposed to the end-user by RADOS.

RADOS is there to store an object for you and get it back when you need it.

Why would you want to do this anyway?

Wido

> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitors not reaching quorum

2016-07-27 Thread Sergio A. de Carvalho Jr.
In my case, everything else running on the host seems to be okay. I'm
wondering if the other problems you see aren't a side-effect of Ceph
services running slow?

What do you do to get around the problem when it happens? Disable syslog in
Ceph?

What version of Ceph and OS are you using?

On Wed, Jul 27, 2016 at 12:47 AM, Sean Crosby 
wrote:

> Agreed. When I first had these problems, random stuff would just not work.
> SSH would take a while to log in, DNS server would process requests slow,
> our Batch system would freeze and not run jobs. It's now one of my first
> things to check when services are running weirdly.
>
> My failsafe check is to do
>
> # logger "sean test"
>
> and see if it appears in syslog. If it doesn't do it immediately, I have a
> problem
>
> Cheers,
> Sean
>
> On 27 July 2016 at 04:01, Sergio A. de Carvalho Jr.  > wrote:
>
>> The funny thing is that I just restarted the rsyslog daemon on the Ceph
>> hosts and I can now re-enable syslog for Ceph without any issues. It just
>> looks like the rsyslog service had a hiccup, possibly related to problem on
>> one of the central syslog servers, and this in turn prevent the monitors to
>> operate normally.
>>
>> It's just scary to think that your logging daemon can cause so much
>> damage!
>>
>> On Tue, Jul 26, 2016 at 6:48 PM, Joao Eduardo Luis  wrote:
>>
>>> On 07/26/2016 06:27 PM, Sergio A. de Carvalho Jr. wrote:
>>>
 (Just realised I originally replied to Sean directly, so reposting here
 for posterity).

 Bingo!

>>>
>>> wow. This didn't even cross my mind. D:
>>>
>>> Thanks for sharing.
>>>
>>>
 I turned off syslog and the monitors quickly reached quorum and
 everything seems back to normal. Thanks so much, Sean.

 Luckily this is a test cluster. I wonder how I could catch this in a
 production cluster before our support engineers spend a day trying to
 track the problem down.

>>>
>>> Only way I can see to deal with this sort of thing would be to log to
>>> syslog on a separate thread and have said thread monitoring the latency
>>> when writing to syslog.
>>>
>>> I don't think currently there's any support for that. I'll try to get
>>> something concocted this week, mostly for the fun of it.
>>>
>>>   -Joao
>>>
>>>
 Any ideas?

 On Tue, Jul 26, 2016 at 12:28 PM, Sean Crosby
 mailto:richardnixonsh...@gmail.com>>
 wrote:

 Hi Sergio,

 You don't happen to have rsyslog forwarding to a central log server
 by any chance? I've seen this behaviour before when my central log
 server is not keeping up with messages.

 Cheers,
 Sean

 On 26 July 2016 at 21:13, Sergio A. de Carvalho Jr.
 mailto:scarvalh...@gmail.com>> wrote:

 I left the 4 nodes running overnight and they just crawled to
 their knees... to the point that nothing has been written to the
 logs in the last 11 hours. So I stopped all monitors this
 morning and started them one by one again, but they're are still
 being extremely slow. Here are their logs:

 <
 https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><
 https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><
 https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406>
 https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406

 https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a

 https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2

 https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2

 I'm still puzzled to see logs being written with a timestamp
 that is several minutes behind the system clock. As time passes,
 the gap widens and quickly the logs are over 10 minutes behind
 the actual time, which explains why the logs  above don't seem
 to overlap.



 On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr.
 mailto:scarvalh...@gmail.com>> wrote:

 Awesome, thanks so much, Joao.

 Here's the mon_status:

 https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055

 I'm still trying to collect the logs, but while doing that I
 noticed that the log records are severely delayed compared
 to the system clock. For example, watching the logs with
 tail -f, I see records with a timestamp that is up to 28
 minutes behind the system clock!

 Also, while trying to set debug level, the monitors
 sometimes hung for several minutes, so there's obviously
 something wrong with them.


 On Mon, Jul 25, 2016 at 6:16 PM, Joao Eduardo Luis
 mailto:j...@suse.de>>

Re: [ceph-users] Monitors not reaching quorum

2016-07-27 Thread Sean Crosby
Oh, my problems weren't on Ceph nodes. I've seen this problem on non-Ceph
nodes. The symptoms you had of unexplained weirdness with services (in your
case, Ceph), and syslog lagging 10mins behind just reminded me of symptoms
I've seen before where the sending of syslog messages to a central syslog
server got stuck, and caused unusual problems on the host.

Cheers,
Sean

On 27 July 2016 at 20:57, Sergio A. de Carvalho Jr. 
wrote:

> In my case, everything else running on the host seems to be okay. I'm
> wondering if the other problems you see aren't a side-effect of Ceph
> services running slow?
>
> What do you do to get around the problem when it happens? Disable syslog
> in Ceph?
>
> What version of Ceph and OS are you using?
>
> On Wed, Jul 27, 2016 at 12:47 AM, Sean Crosby 
> wrote:
>
>> Agreed. When I first had these problems, random stuff would just not
>> work. SSH would take a while to log in, DNS server would process requests
>> slow, our Batch system would freeze and not run jobs. It's now one of my
>> first things to check when services are running weirdly.
>>
>> My failsafe check is to do
>>
>> # logger "sean test"
>>
>> and see if it appears in syslog. If it doesn't do it immediately, I have
>> a problem
>>
>> Cheers,
>> Sean
>>
>> On 27 July 2016 at 04:01, Sergio A. de Carvalho Jr. <
>> scarvalh...@gmail.com> wrote:
>>
>>> The funny thing is that I just restarted the rsyslog daemon on the Ceph
>>> hosts and I can now re-enable syslog for Ceph without any issues. It just
>>> looks like the rsyslog service had a hiccup, possibly related to problem on
>>> one of the central syslog servers, and this in turn prevent the monitors to
>>> operate normally.
>>>
>>> It's just scary to think that your logging daemon can cause so much
>>> damage!
>>>
>>> On Tue, Jul 26, 2016 at 6:48 PM, Joao Eduardo Luis  wrote:
>>>
 On 07/26/2016 06:27 PM, Sergio A. de Carvalho Jr. wrote:

> (Just realised I originally replied to Sean directly, so reposting here
> for posterity).
>
> Bingo!
>

 wow. This didn't even cross my mind. D:

 Thanks for sharing.


> I turned off syslog and the monitors quickly reached quorum and
> everything seems back to normal. Thanks so much, Sean.
>
> Luckily this is a test cluster. I wonder how I could catch this in a
> production cluster before our support engineers spend a day trying to
> track the problem down.
>

 Only way I can see to deal with this sort of thing would be to log to
 syslog on a separate thread and have said thread monitoring the latency
 when writing to syslog.

 I don't think currently there's any support for that. I'll try to get
 something concocted this week, mostly for the fun of it.

   -Joao


> Any ideas?
>
> On Tue, Jul 26, 2016 at 12:28 PM, Sean Crosby
> mailto:richardnixonsh...@gmail.com>>
> wrote:
>
> Hi Sergio,
>
> You don't happen to have rsyslog forwarding to a central log server
> by any chance? I've seen this behaviour before when my central log
> server is not keeping up with messages.
>
> Cheers,
> Sean
>
> On 26 July 2016 at 21:13, Sergio A. de Carvalho Jr.
> mailto:scarvalh...@gmail.com>> wrote:
>
> I left the 4 nodes running overnight and they just crawled to
> their knees... to the point that nothing has been written to
> the
> logs in the last 11 hours. So I stopped all monitors this
> morning and started them one by one again, but they're are
> still
> being extremely slow. Here are their logs:
>
> <
> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><
> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><
> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406>
> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406
>
> https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a
>
> https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2
>
> https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2
>
> I'm still puzzled to see logs being written with a timestamp
> that is several minutes behind the system clock. As time
> passes,
> the gap widens and quickly the logs are over 10 minutes behind
> the actual time, which explains why the logs  above don't seem
> to overlap.
>
>
>
> On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr.
> mailto:scarvalh...@gmail.com>> wrote:
>
> Awesome, thanks so much, Joao.
>
> Here's the mon_status:
>
> https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055
>
> I'm still trying to collect the l

Re: [ceph-users] syslog broke my cluster

2016-07-27 Thread Sergio A. de Carvalho Jr.
I guess the point I was trying to make is that, ideally, Ceph would isolate
its logging system in a way that a problem with writing the logs wouldn't
affect the operation of the core Ceph services.

In my case, all other services running on the machine (ssh, ntp, cron,
etc.) are operating normally, even though the logs might not be getting
pushed out to the central syslog servers.

On Wed, Jul 27, 2016 at 4:49 AM, Brad Hubbard  wrote:

> On Tue, Jul 26, 2016 at 03:48:33PM +0100, Sergio A. de Carvalho Jr. wrote:
> > As per my previous messages on the list, I was having a strange problem
> in
> > my test cluster (Hammer 0.94.6, CentOS 6.5) where my monitors were
> > literally crawling to a halt, preventing them to ever reach quorum and
> > causing all sort of problems. As it turned out, to my surprise everything
> > went back to normal as soon as I turned off syslog -- special thanks to
> > Sean!
> >
> > The slowdown with syslog on was so severe that logs were being written
> with
> > a timestamp that was several minutes (and eventually up to hours) behind
> > the system clock. The logs from my 4 monitors can be seen in the links
> > below:
> >
> > https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406
> > https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a
> > https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2
> > https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2
> >
> > I'm still trying to understand what is going on with my syslog servers
> but
> > I was wondering... is this a known/documented issue?
>
> If it is it would be known/documented by the syslog community right?
>
> >
> > Luckily this was a test cluster but I'm worried I could hit this on a
> > production cluster any time soon, and I'm wondering how I could detect it
> > before my support engineers loose their minds.
>
> This does not appear to be a ceph-specific issue and would likely affect
> any
> daemon that logs to syslog right?
>
> One thing you could try is running strace against the MON to see what
> system
> calls are taking a long time and extrapolate from there. The procedure
> would
> be the same if things were being held up by a slow disk (for whatever
> reason)
> or filesystem, etc. This is just a standard performance problem and not a
> ceph-specific issue.
>
> >
> > Thanks,
> >
> > Sergio
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Cheers,
> Brad
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] syslog broke my cluster

2016-07-27 Thread Karsten Heymann
Hi,

The syslog socket will block if it can't deliver it's logs. This happens
for example if logs are forwarded to a remote loghost via tcp and the
remote server becomes unavailable.

Best
Karsten
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unsubscribe

2016-07-27 Thread Jimmy Stemple
[ceph-users] Unsubscribe

Sent from my iPhone

> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD host swap usage

2016-07-27 Thread George Shuklin
Check NUMA status in bios'es. Sometimes linux do swap instead of task 
transfer between numa nodes (inside one host).


Set "interleave" or "disable" to see difference.


On 07/27/2016 11:21 AM, Kenneth Waegeman wrote:

Hi all,

When our OSD hosts are running for some time, we start see increased 
usage of swap on a number of them. Some OSDs don't use swap for weeks, 
while others has a full (4G) swap, and start filling swap again after 
we did a swapoff/swapon.
We have 8 8TB OSDS and 2 cache SSDs on each hosts, and 80GB of Memory. 
There is still about 15-20GB memory available when this happens. 
Running Centos7; We had swapiness set to 0. There is no client io 
right now, only scrubbing. some OSDs are using 20-80% of cpu.


Has somebody seen this behaviour? It doesn't have to be bad, but what 
could explain some hosts keep on swapping, and others don't?

Could this be some issue?

Thanks !!

Kenneth
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance pattern

2016-07-27 Thread RDS
I had a similar issue when migrating from SSD to NVMe using Ubuntu. Read 
performance tanked using NVMe. Iostat showed each NVMe performing 30x more 
physical reads compared to SSD, but the MB/s was 1/6 the speed of the SSD. I 
set "blockdev --setra 128 /dev/nvmeX” and now performance is much better with 
NVMe than using SSD. With our SSD and pcie flash cards, we used —setra 0 since 
these devices handle read look ahead internally. Our NVMe devices benefit from 
setting —setra.
Rick
> On Jul 26, 2016, at 8:09 PM, Somnath Roy  wrote:
> 
> << Ceph performance in general (without read_ahead_kb) will be lower 
> specially in all flash as the requests will be serialized within a PG
>  
> I meant to say Ceph sequential performance..Sorry for the spam..
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Somnath Roy
> Sent: Tuesday, July 26, 2016 5:08 PM
> To: EP Komarla; ceph-users@lists.ceph.com 
> Subject: Re: [ceph-users] Ceph performance pattern
>  
> Not exactly, but, we are seeing some drop with 256K compare to 64K. This is 
> with random reads though in Ubuntu. We had to bump up read_ahead_kb from 
> default 128KB to 512KB to work around that.
> But, in RHEL we saw all sorts of issues with read_ahead_kb for small block 
> random reads and I think it is already default to 4MB or so..If so, try to 
> reduce it to 512KB and see..
> Generally, for sequential reads, you need to play with read_ahead_kb to 
> achieve better performance. Ceph performance in general (without 
> read_ahead_kb) will be lower specially in all flash as the requests will be 
> serialized within a PG.
> Our test is with all flash though and take my comments with a grain of salt 
> in case of ceph + HDD..
>  
> Thanks & Regards
> Somnath
>  
>  
> From: EP Komarla [mailto:ep.koma...@flextronics.com 
> ] 
> Sent: Tuesday, July 26, 2016 4:50 PM
> To: Somnath Roy; ceph-users@lists.ceph.com 
> Subject: RE: Ceph performance pattern
>  
> Thanks Somnath.  
>  
> I am running with CentOS7.2.  Have you seen this pattern before?
>  
> - epk
>  <> 
> From: Somnath Roy [mailto:somnath@sandisk.com 
> ] 
> Sent: Tuesday, July 26, 2016 4:44 PM
> To: EP Komarla  >; ceph-users@lists.ceph.com 
> 
> Subject: RE: Ceph performance pattern
>  
> Which OS/kernel you are running with ?
> Try setting bigger read_ahead_kb for sequential runs.
>  
> Thanks & Regards
> Somnath
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of EP Komarla
> Sent: Tuesday, July 26, 2016 4:38 PM
> To: ceph-users@lists.ceph.com 
> Subject: [ceph-users] Ceph performance pattern
>  
> Hi,
>  
> I am showing below fio results for Sequential Read on my Ceph cluster.  I am 
> trying to understand this pattern:
>  
> - why there is a dip in the performance for block sizes 32k-256k?
> - is this an expected performance graph?
> - have you seen this kind of pattern before
>  
> 
>  
> My cluster details:
> Ceph: Hammer release
> Cluster: 6 nodes (dual Intel sockets) each with 20 OSDs and 4 SSDs (5 OSD 
> journals on one SSD)
> Client network: 10Gbps
> Cluster network: 10Gbps
> FIO test:
> - 2 Client servers
> - Sequential Read
> - Run time of 600 seconds
> - Filesize = 1TB
> - 10 rbd images per client
> - Queue depth=16
>  
> Any ideas on tuning this cluster?  Where should I look first?
>  
> Thanks,
>  
> - epk
>  
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential. 
> It is intended to be read only by the individual or entity to whom it is 
> addressed or by their designee. If the reader of this message is not the 
> intended recipient, you are on notice that any distribution of this message, 
> in any form, is strictly prohibited. If you have received this message in 
> error, please immediately notify the sender and delete or destroy any copy of 
> this message!
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential. 
> It is intended to be read only by the individual or enti

Re: [ceph-users] OSD host swap usage

2016-07-27 Thread Kenneth Waegeman



On 27/07/16 10:59, Christian Balzer wrote:

Hello,

On Wed, 27 Jul 2016 10:21:34 +0200 Kenneth Waegeman wrote:


Hi all,

When our OSD hosts are running for some time, we start see increased
usage of swap on a number of them. Some OSDs don't use swap for weeks,
while others has a full (4G) swap, and start filling swap again after we
did a swapoff/swapon.

Obvious first question would be, are all these hosts really the same, HW,
SW and configuration wise?
They have the same hardware, are configured the same through config mgt 
with ceph 10.2.2 and kernel 3.10.0-327.18.2.el7.ug.x86_64



We have 8 8TB OSDS and 2 cache SSDs on each hosts, and 80GB of Memory.

How full are these OSDs?
I'm interested in # of files, not space, so a "df -i" should give us some idea.


Filesystem   InodesIUsed IFree IUse% 
Mounted on
/dev/sdm7  1983232050068 197822521% 
/var/lib/ceph/osd/cache/sdm
/dev/md124194557760 19620569 174937191 11% 
/var/lib/ceph/osd/sdk0sdl
/dev/md117194557760 20377826 174179934 11% 
/var/lib/ceph/osd/sdc0sdd
/dev/md127194557760 21453957 173103803 12% 
/var/lib/ceph/osd/sda0sdb
/dev/md121194557760 20270844 174286916 11% 
/var/lib/ceph/osd/sdq0sdr
/dev/md118194557760 20476860 174080900 11% 
/var/lib/ceph/osd/sde0sdf
/dev/md120194557760 19939165 174618595 11% 
/var/lib/ceph/osd/sdo0sdp
/dev/md113194557760 22098382 172459378 12% 
/var/lib/ceph/osd/sdg0sdh
/dev/md112194557760 18209988 176347772 10% 
/var/lib/ceph/osd/sdi0sdj
/dev/sdn7  1993062447087 198835371% 
/var/lib/ceph/osd/cache/sdn





80GB is an odd number, how are the DIMMs distributed among the CPU(s)?

Only 1 socket:

Machine (79GB)
  Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#8)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
  PU L#2 (P#1)
  PU L#3 (P#9)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
  PU L#4 (P#2)
  PU L#5 (P#10)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
  PU L#6 (P#3)
  PU L#7 (P#11)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
  PU L#8 (P#4)
  PU L#9 (P#12)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
  PU L#10 (P#5)
  PU L#11 (P#13)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
  PU L#12 (P#6)
  PU L#13 (P#14)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
  PU L#14 (P#7)
  PU L#15 (P#15)

3 dimms of 16GB + 1 dimm of 8 in first set of DIMMS, 3 dimms of 8 in 
second set (as in our vendor's manual)





There is still about 15-20GB memory available when this happens. Running
Centos7;

How do you define free memory?
Not used at all?
I'd expect any Ceph storage server to use all "free" RAM for SLAB and
pagecache very quickly, at the latest after the first deep scrub.
%Cpu(s):  5.3 us,  0.1 sy,  0.0 ni, 94.1 id,  0.5 wa,  0.0 hi,  0.0 si,  
0.0 st

KiB Mem : 82375104 total,  7037032 free, 41117768 used, 34220308 buff/cache
KiB Swap:  4194300 total,  3666416 free,   527884 used. 15115612 avail Mem

PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
3979408 ceph  20   0 4115960 1.079g   5912 S  85.1  1.4 7174:16 
ceph-osd
3979417 ceph  20   0 3843488 967424   6076 S   1.7  1.2 7114:34 
ceph-osd
3979410 ceph  20   0 4089372 1.085g   5964 S   1.3  1.4 9072:56 
ceph-osd
3979419 ceph  20   0 4345000 1.116g   6168 S   1.3  1.4 9151:36 
ceph-osd


If it is really unused AND your system is swapping, something odd is going
on indeed, maybe something NUMA related that prevents part of your memory
from being used.

Of course this could also be an issue with your CentOS kernel, I'm
definitely not seeing anything like this on any of my machines.


We had swapiness set to 0.

I wouldn't set it lower than 1.
Also any other tuning settings, like vm/vfs_cache_pressure and
vm/min_free_kbytes?


vfs_cache_pressure is on the default 100,
vm.min_free_kbytes=3145728

other tuned settings:

fs.file-max=262144
kernel.msgmax=65536
kernel.msgmnb=65536
kernel.msgmni=1024
kernel.pid_max=4194303
kernel.sem=250 32000 100 1024
kernel.shmall=20971520
kernel.shmmax=34359738368
kernel.shmmni=16384
net.core.netdev_max_backlog=25
net.core.rmem_default=262144
net.core.rmem_max=4194304
net.core.somaxconn=1024
net.core.wmem_default=262144
net.core.wmem_max=4194304
net.ipv4.conf.all.arp_filter=1
net.ipv4.ip_local_port_range=32768 61000
net.ipv4.neigh.default.base_reachable_time=14400
net.ipv4.neigh.default.gc_interval=14400
net.ipv4.neigh.default.gc_stale_time=14400
net.ipv4.neigh.default.gc_thresh1=2048
net.ipv4.neigh.default.gc_thresh2

[ceph-users] Cleaning Up Failed Multipart Uploads

2016-07-27 Thread Brian Felton
Greetings,

Background: If an object storage client re-uploads parts to a multipart
object, RadosGW does not clean up all of the parts properly when the
multipart upload is aborted or completed.  You can read all of the gory
details (including reproduction steps) in this bug report:
http://tracker.ceph.com/issues/16767.

My setup: Hammer 0.94.6 cluster only used for S3-compatible object
storage.  RGW stripe size is 4MiB.

My problem: I have buckets that are reporting TB more utilization (and, in
one case, 200k more objects) than they should report.  I am trying to
remove the detritus from the multipart uploads, but removing the leftover
parts directly from the .rgw.buckets pool is having no effect on bucket
utilization (i.e. neither the object count nor the space used are
declining).

To give an example, I have a client that uploaded a very large multipart
object (8000 15MiB parts).  Due to a bug in the client, it uploaded each of
the 8000 parts 6 times.  After the sixth attempt, it gave up and aborted
the upload, at which point RGW removed the 8000 parts from the sixth
attempt.  When I list the bucket's contents with radosgw-admin
(radosgw-admin bucket list --bucket= --max-entries=), I see all of the object's 8000 parts five separate times, each
under a namespace of 'multipart'.

Since the multipart upload was aborted, I can't remove the object by name
via the S3 interface.  Since my RGW stripe size is 4MiB, I know that each
part of the object will be stored across 4 entries in the .rgw.buckets pool
-- 4 MiB in a 'multipart' file, and 4, 4, and 3 MiB in three successive
'shadow' files.  I've created a script to remove these parts (rados -p
.rgw.buckets rm __multipart_. and rados -p
.rgw.buckets rm __shadow_..[1-3]).  The
removes are completing successfully (in that additional attempts to remove
the object result in a failure), but I'm not seeing any decrease in the
bucket's space used, nor am I seeing a decrease in the bucket's object
count.  In fact, if I do another 'bucket list', all of the removed parts
are still included.

I've looked at the output of 'gc list --include-all', and the removed parts
are never showing up for garbage collection.  Garbage collection is
otherwise functioning normally and will successfully remove data for any
object properly removed via the S3 interface.

I've also gone so far as to write a script to list the contents of bucket
shards in the .rgw.buckets.index pool, check for the existence of the entry
in .rgw.buckets, and remove entries that cannot be found, but that is also
failing to decrement the size/object count counters.

What am I missing here?  Where, aside from .rgw.buckets and
.rgw.buckets.index is RGW looking to determine object count and space used
for a bucket?

Many thanks to any and all who can assist.

Brian Felton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitors not reaching quorum

2016-07-27 Thread Sergio A. de Carvalho Jr.
Got it.

Are you sending logs to the central syslog servers via TCP (@@) or
UDP (@)?

I just realised that my test cluster sends logs via UDP to our usual
central syslog server (as our productions hosts normally do), but it is
also configured to send logs via TCP to a testing Logstash VM. My suspicion
at the moment is that this VM isn't handling the volume of logs and could
be blocking rsyslog.


On Wed, Jul 27, 2016 at 12:04 PM, Sean Crosby 
wrote:

> Oh, my problems weren't on Ceph nodes. I've seen this problem on non-Ceph
> nodes. The symptoms you had of unexplained weirdness with services (in your
> case, Ceph), and syslog lagging 10mins behind just reminded me of symptoms
> I've seen before where the sending of syslog messages to a central syslog
> server got stuck, and caused unusual problems on the host.
>
> Cheers,
> Sean
>
>
> On 27 July 2016 at 20:57, Sergio A. de Carvalho Jr.  > wrote:
>
>> In my case, everything else running on the host seems to be okay. I'm
>> wondering if the other problems you see aren't a side-effect of Ceph
>> services running slow?
>>
>> What do you do to get around the problem when it happens? Disable syslog
>> in Ceph?
>>
>> What version of Ceph and OS are you using?
>>
>> On Wed, Jul 27, 2016 at 12:47 AM, Sean Crosby 
>> wrote:
>>
>>> Agreed. When I first had these problems, random stuff would just not
>>> work. SSH would take a while to log in, DNS server would process requests
>>> slow, our Batch system would freeze and not run jobs. It's now one of my
>>> first things to check when services are running weirdly.
>>>
>>> My failsafe check is to do
>>>
>>> # logger "sean test"
>>>
>>> and see if it appears in syslog. If it doesn't do it immediately, I have
>>> a problem
>>>
>>> Cheers,
>>> Sean
>>>
>>> On 27 July 2016 at 04:01, Sergio A. de Carvalho Jr. <
>>> scarvalh...@gmail.com> wrote:
>>>
 The funny thing is that I just restarted the rsyslog daemon on the Ceph
 hosts and I can now re-enable syslog for Ceph without any issues. It just
 looks like the rsyslog service had a hiccup, possibly related to problem on
 one of the central syslog servers, and this in turn prevent the monitors to
 operate normally.

 It's just scary to think that your logging daemon can cause so much
 damage!

 On Tue, Jul 26, 2016 at 6:48 PM, Joao Eduardo Luis 
 wrote:

> On 07/26/2016 06:27 PM, Sergio A. de Carvalho Jr. wrote:
>
>> (Just realised I originally replied to Sean directly, so reposting
>> here
>> for posterity).
>>
>> Bingo!
>>
>
> wow. This didn't even cross my mind. D:
>
> Thanks for sharing.
>
>
>> I turned off syslog and the monitors quickly reached quorum and
>> everything seems back to normal. Thanks so much, Sean.
>>
>> Luckily this is a test cluster. I wonder how I could catch this in a
>> production cluster before our support engineers spend a day trying to
>> track the problem down.
>>
>
> Only way I can see to deal with this sort of thing would be to log to
> syslog on a separate thread and have said thread monitoring the latency
> when writing to syslog.
>
> I don't think currently there's any support for that. I'll try to get
> something concocted this week, mostly for the fun of it.
>
>   -Joao
>
>
>> Any ideas?
>>
>> On Tue, Jul 26, 2016 at 12:28 PM, Sean Crosby
>> mailto:richardnixonsh...@gmail.com>>
>> wrote:
>>
>> Hi Sergio,
>>
>> You don't happen to have rsyslog forwarding to a central log
>> server
>> by any chance? I've seen this behaviour before when my central log
>> server is not keeping up with messages.
>>
>> Cheers,
>> Sean
>>
>> On 26 July 2016 at 21:13, Sergio A. de Carvalho Jr.
>> mailto:scarvalh...@gmail.com>> wrote:
>>
>> I left the 4 nodes running overnight and they just crawled to
>> their knees... to the point that nothing has been written to
>> the
>> logs in the last 11 hours. So I stopped all monitors this
>> morning and started them one by one again, but they're are
>> still
>> being extremely slow. Here are their logs:
>>
>> <
>> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><
>> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><
>> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406>
>> https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406
>>
>> https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a
>>
>> https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2
>>
>> https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2
>>
>> I'm still puzzled to see logs being written with a timestamp
>> that is several minutes

[ceph-users] Listing objects in a specified placement group / OSD

2016-07-27 Thread David Blundell
Hi,

I wasn't sure if this is a ceph-users or ceph-devel question as it's about the 
API (users) but the answer may involve me writing a RADOS method (devel).

At the moment in Ceph Jewel I can find which objects are held in an OSD or 
placement group by looking on the filesystem under 
/var/lib/ceph/osd/ceph-*/current

This requires access to the OSD host and may well break when using Bluestore if 
there is no filesystem to look through.  I would like to be able to list 
objects in a specified PG/OSD from outside of the OSD host using Ceph commands.

I can list all PGs hosted on OSD 1 using "ceph pg ls-by-osd osd.1" and could 
loop through this output if there was a way to list the objects in a PG.

I have checked the API and librados docs (I would be happy to hack something 
together using librados) and can't see any obvious way to list the objects in a 
PG.

I have seen a post on this mailing list from Ilya last September saying:
"Internally there is a way to list objects within a specific PG (actually more 
than one way IIRC), but I don't think anything like that is exposed in a CLI 
(it might be exposed in librados though)."

but could not find any follow up posts with details.

Does anyone have any more details on these internal methods and how to call 
them?

Cheers,

David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph libaio queue depth understanding

2016-07-27 Thread nick
Hi,
we would like to write a testplan to benchmark our ceph cluster. We want to 
use fio for it.

According to an article from Sebastian Han [1] ceph is using libaio with 
O_DIRECT for writing data to the journal. In a different blog article [2] I 
read that ceph is using D_SYNC as well for this. This basically means it is 
using a queue depth of 1 (issue one IO request and wait for it to be done), 
right? Testing this with fio can be done by using the params direct=1 and 
iodepth=1 with engine=libaio.

After this the journal gets flushed to the OSD disk. This time buffered IO is 
used (in fio terms: direct=0). My question is:
Which iodepth is used for this (so which value to use in fio)?

In the source code of ceph I can see that the io_setup() function gets called 
with '128' concurrent events available. So should I use iodepth=128 in fio for 
this?

Maybe I do have a wrong understanding of async IO as well :-)

Thanks for any clarification of this topic

Cheers
Nick
 
[1] 
https://www.sebastien-han.fr/blog/2013/10/03/quick-analysis-of-the-ceph-io-layer

[2] http://bryanapperson.com/blog/ceph-raw-disk-performance-testing/

-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance pattern

2016-07-27 Thread EP Komarla
I am using aio engine in fio.

Fio is working on rbd images

- epk

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Tuesday, July 26, 2016 6:27 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance pattern

Hi epk,

Which ioengine are you using?  if it's librbd, you might try playing with 
librbd readahead as well:

# don't disable readahead after a certain number of bytes rbd readahead disable 
after bytes = 0

# Set the librbd readahead to whatever:
rbd readahead max bytes = 4194304

If it's with kvm+guests, you may be better off playing with the guest readahead 
but you can try the librbd readahead if you want.

Another thing to watch out for is fragmentation.  btrfs OSDs for example will 
fragment terribly after small random writes to RBD images due to how 
copy-on-write works.  That can cause havoc with RBD sequential reads in general.

Mark


On 07/26/2016 06:38 PM, EP Komarla wrote:
> Hi,
>
>
>
> I am showing below fio results for Sequential Read on my Ceph cluster.
> I am trying to understand this pattern:
>
>
>
> - why there is a dip in the performance for block sizes 32k-256k?
>
> - is this an expected performance graph?
>
> - have you seen this kind of pattern before
>
>
>
>
>
> My cluster details:
>
> Ceph: Hammer release
>
> Cluster: 6 nodes (dual Intel sockets) each with 20 OSDs and 4 SSDs (5 
> OSD journals on one SSD)
>
> Client network: 10Gbps
>
> Cluster network: 10Gbps
>
> FIO test:
>
> - 2 Client servers
>
> - Sequential Read
>
> - Run time of 600 seconds
>
> - Filesize = 1TB
>
> - 10 rbd images per client
>
> - Queue depth=16
>
>
>
> Any ideas on tuning this cluster?  Where should I look first?
>
>
>
> Thanks,
>
>
>
> - epk
>
>
>
>
> Legal Disclaimer:
> The information contained in this message may be privileged and 
> confidential. It is intended to be read only by the individual or 
> entity to whom it is addressed or by their designee. If the reader of 
> this message is not the intended recipient, you are on notice that any 
> distribution of this message, in any form, is strictly prohibited. If 
> you have received this message in error, please immediately notify the 
> sender and delete or destroy any copy of this message!
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to get Active set of OSD Map in serial order of osd index

2016-07-27 Thread Samuel Just
Think of the osd numbers as names.  The plugin interface doesn't even
tell you which shard maps to which osd.  Why would it make a
difference?
-Sam

On Wed, Jul 27, 2016 at 12:45 AM, Syed Hussain  wrote:
> Fundamentally, I wanted to know what chunks are allocated in which OSDs.
> This way I can preserve the array structure required for my
> Erasure Code. If all the chunks are placed in randomly ordered OSDs (like in
> Jerasure or ISA) then I loss that array structure required in the
> Encoding/Decoding algorithm of my Plugin.
> I'm trying to develop an Erasure Code plugin for RDP (or RAID-DP) kind of
> code.
>
> Thanks,
> Syed
>
> On Wed, Jul 27, 2016 at 4:12 AM, Samuel Just  wrote:
>>
>> Why do you want them in serial increasing order?
>> -Sam
>>
>> On Tue, Jul 26, 2016 at 2:43 PM, Samuel Just  wrote:
>>>
>>> How would such a code work if there were more than 24 osds?
>>> -Sam
>>>
>>> On Tue, Jul 26, 2016 at 2:37 PM, Syed Hussain  wrote:

 Hi,

 I'm working to develop an Erasure Code plugin (variation of ISA) that
 have typical requirement that the active set of the Erasure Coded pool in
 serial order.
 For example,

 
 >ceph osd erasure-code-profile set reed_k16m8_isa k=16 m=8 plugin=isa
 > technique=reed_sol_van ruleset-failure-domain=osd
 >ceph osd pool create reed_k16m8_isa_pool 128 128 erasure reed_k16m8_isa
 >echo "ABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHIABCDEFGHI" | rados
 > --pool reed_k16m8_isa_pool put myobj16_8 -
 >ceph osd map reed_k16m8_isa_pool myobj16_8
 osdmap e86 pool 'reed_k16m8_isa_pool' (1) object 'myobj16_8' -> pg
 1.cf6ec86f (1.6f) -> up
 ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4) 
 acting
 ([4,23,22,10,9,11,15,6,19,1,7,8,17,21,16,14,18,12,13,20,3,5,0,2], p4)

 

 That means the chunks 0, 1, 2, ...23 of the erasure coding are saved int
 osd 4, 23, 22, 10, ...2 respectively as per the order given in the active
 set.

 Now my question is how I'll be able to get the PG map for object
 myobj16_8 having active set as: [0, 1, 2, ...23] so that the i-th chunk of
 the Erasure Coded object saves into
 i-th osd.

 Is there any option available in "ceph osd pool create" to do it?
 Or there may be other way available to accomplish this case.

 Appreciate your suggestions..

 Thanks,
 Syed Hussain
 NetWorld

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Error with instance snapshot in ceph storage : Image Pending Upload state.

2016-07-27 Thread Gaurav Goyal
Dear Ceph Team,

I am trying to take snapshot of my instance.

Image was stuck up in Queued state and instance is stuck up in Image
Pending Upload state.

I had to manually quit the job as it was not working since last 1 hour ..
my instance is still in Image Pending Upload state.

Is it something wrong with my ceph configuration?
can i take snapshots with ceph storage?

Regards
Gaurav Goyal
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance pattern

2016-07-27 Thread Mark Nelson
Ok.  Are you using O_DIRECT?  That will disable readahead on the client, 
but if you don't use O_DIRECT you won't get the benefit of iodepth=16. 
See fio's man page:


"Number of I/O units to keep in flight against the file. Note that 
increasing iodepth beyond 1 will not affect synchronous ioengines 
(except for small degress when verify_async is in use). Even async 
engines my impose OS restrictions causing the desired depth not to be 
achieved. This may happen on Linux when using libaio and not setting 
direct=1, since buffered IO is not async on that OS. Keep an eye on the 
IO depth distribution in the fio output to verify that the achieved 
depth is as expected. Default: 1."


IE, how you are testing could really affect the ability to do 
client-side readahead and may affect how much client-side concurrency 
you are getting.


Mark

On 07/27/2016 10:14 AM, EP Komarla wrote:

I am using aio engine in fio.

Fio is working on rbd images

- epk

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Tuesday, July 26, 2016 6:27 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance pattern

Hi epk,

Which ioengine are you using?  if it's librbd, you might try playing with 
librbd readahead as well:

# don't disable readahead after a certain number of bytes rbd readahead disable 
after bytes = 0

# Set the librbd readahead to whatever:
rbd readahead max bytes = 4194304

If it's with kvm+guests, you may be better off playing with the guest readahead 
but you can try the librbd readahead if you want.

Another thing to watch out for is fragmentation.  btrfs OSDs for example will 
fragment terribly after small random writes to RBD images due to how 
copy-on-write works.  That can cause havoc with RBD sequential reads in general.

Mark


On 07/26/2016 06:38 PM, EP Komarla wrote:

Hi,



I am showing below fio results for Sequential Read on my Ceph cluster.
I am trying to understand this pattern:



- why there is a dip in the performance for block sizes 32k-256k?

- is this an expected performance graph?

- have you seen this kind of pattern before





My cluster details:

Ceph: Hammer release

Cluster: 6 nodes (dual Intel sockets) each with 20 OSDs and 4 SSDs (5
OSD journals on one SSD)

Client network: 10Gbps

Cluster network: 10Gbps

FIO test:

- 2 Client servers

- Sequential Read

- Run time of 600 seconds

- Filesize = 1TB

- 10 rbd images per client

- Queue depth=16



Any ideas on tuning this cluster?  Where should I look first?



Thanks,



- epk




Legal Disclaimer:
The information contained in this message may be privileged and
confidential. It is intended to be read only by the individual or
entity to whom it is addressed or by their designee. If the reader of
this message is not the intended recipient, you are on notice that any
distribution of this message, in any form, is strictly prohibited. If
you have received this message in error, please immediately notify the
sender and delete or destroy any copy of this message!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance pattern

2016-07-27 Thread EP Komarla
I am using O_DIRECT=1

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: Wednesday, July 27, 2016 8:33 AM
To: EP Komarla ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance pattern

Ok.  Are you using O_DIRECT?  That will disable readahead on the client, but if 
you don't use O_DIRECT you won't get the benefit of iodepth=16. 
See fio's man page:

"Number of I/O units to keep in flight against the file. Note that increasing 
iodepth beyond 1 will not affect synchronous ioengines (except for small 
degress when verify_async is in use). Even async engines my impose OS 
restrictions causing the desired depth not to be achieved. This may happen on 
Linux when using libaio and not setting direct=1, since buffered IO is not 
async on that OS. Keep an eye on the IO depth distribution in the fio output to 
verify that the achieved depth is as expected. Default: 1."

IE, how you are testing could really affect the ability to do client-side 
readahead and may affect how much client-side concurrency you are getting.

Mark

On 07/27/2016 10:14 AM, EP Komarla wrote:
> I am using aio engine in fio.
>
> Fio is working on rbd images
>
> - epk
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Mark Nelson
> Sent: Tuesday, July 26, 2016 6:27 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph performance pattern
>
> Hi epk,
>
> Which ioengine are you using?  if it's librbd, you might try playing with 
> librbd readahead as well:
>
> # don't disable readahead after a certain number of bytes rbd 
> readahead disable after bytes = 0
>
> # Set the librbd readahead to whatever:
> rbd readahead max bytes = 4194304
>
> If it's with kvm+guests, you may be better off playing with the guest 
> readahead but you can try the librbd readahead if you want.
>
> Another thing to watch out for is fragmentation.  btrfs OSDs for example will 
> fragment terribly after small random writes to RBD images due to how 
> copy-on-write works.  That can cause havoc with RBD sequential reads in 
> general.
>
> Mark
>
>
> On 07/26/2016 06:38 PM, EP Komarla wrote:
>> Hi,
>>
>>
>>
>> I am showing below fio results for Sequential Read on my Ceph cluster.
>> I am trying to understand this pattern:
>>
>>
>>
>> - why there is a dip in the performance for block sizes 32k-256k?
>>
>> - is this an expected performance graph?
>>
>> - have you seen this kind of pattern before
>>
>>
>>
>>
>>
>> My cluster details:
>>
>> Ceph: Hammer release
>>
>> Cluster: 6 nodes (dual Intel sockets) each with 20 OSDs and 4 SSDs (5 
>> OSD journals on one SSD)
>>
>> Client network: 10Gbps
>>
>> Cluster network: 10Gbps
>>
>> FIO test:
>>
>> - 2 Client servers
>>
>> - Sequential Read
>>
>> - Run time of 600 seconds
>>
>> - Filesize = 1TB
>>
>> - 10 rbd images per client
>>
>> - Queue depth=16
>>
>>
>>
>> Any ideas on tuning this cluster?  Where should I look first?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> - epk
>>
>>
>>
>>
>> Legal Disclaimer:
>> The information contained in this message may be privileged and 
>> confidential. It is intended to be read only by the individual or 
>> entity to whom it is addressed or by their designee. If the reader of 
>> this message is not the intended recipient, you are on notice that 
>> any distribution of this message, in any form, is strictly 
>> prohibited. If you have received this message in error, please 
>> immediately notify the sender and delete or destroy any copy of this message!
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential. 
> It is intended to be read only by the individual or entity to whom it is 
> addressed or by their designee. If the reader of this message is not the 
> intended recipient, you are on notice that any distribution of this message, 
> in any form, is strictly prohibited. If you have received this message in 
> error, please immediately notify the sender and delete or destroy any copy of 
> this message!
>

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/c

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-07-27 Thread Alex Gorbachev
Hi Vlad,

On Mon, Jul 25, 2016 at 10:44 PM, Vladislav Bolkhovitin  wrote:
> Hi,
>
> I would suggest to rebuild SCST in the debug mode (after "make 2debug"), then 
> before
> calling the unmap command enable "scsi" and "debug" logging for scst and 
> scst_vdisk
> modules by 'echo add scsi >/sys/kernel/scst_tgt/trace_level; echo "add scsi"
>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level; echo "add debug"
>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level', then check, if for 
>>the unmap
> command vdisk_unmap_range() is reporting running blkdev_issue_discard() in 
> the kernel
> logs.
>
> To double check, you might also add trace statement just before 
> blkdev_issue_discard()
> in vdisk_unmap_range().

With the debug settings on, I am seeing the below output - this means
that discard is being sent to the backing (RBD) device, correct?

Including the ceph-users list to see if there is a reason RBD is not
processing this discard/unmap.

Thank you,
--
Alex Gorbachev
Storcium

Jul 26 08:23:38 e1 kernel: [  858.324715] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b552940, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:23:38 e1 kernel: [  858.324740] [20426]:
vdisk_parse_offset:2930:cmd 88201b552c00, lba_start 0, loff 0,
data_len 24
Jul 26 08:23:38 e1 kernel: [  858.324743] [20426]:
vdisk_unmap_range:3810:Unmapping lba 61779968 (blocks 8192)
Jul 26 08:23:38 e1 kernel: [  858.336218] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b552c00, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:23:38 e1 kernel: [  858.336232] [20426]:
vdisk_parse_offset:2930:cmd 88201b552ec0, lba_start 0, loff 0,
data_len 24
Jul 26 08:23:38 e1 kernel: [  858.336234] [20426]:
vdisk_unmap_range:3810:Unmapping lba 61788160 (blocks 8192)
Jul 26 08:23:38 e1 kernel: [  858.351446] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b552ec0, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:23:38 e1 kernel: [  858.351468] [20426]:
vdisk_parse_offset:2930:cmd 88201b553180, lba_start 0, loff 0,
data_len 24
Jul 26 08:23:38 e1 kernel: [  858.351471] [20426]:
vdisk_unmap_range:3810:Unmapping lba 61796352 (blocks 8192)
Jul 26 08:23:38 e1 kernel: [  858.373407] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b553180, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:23:38 e1 kernel: [  858.373422] [20426]:
vdisk_parse_offset:2930:cmd 88201b553440, lba_start 0, loff 0,
data_len 24
Jul 26 08:23:38 e1 kernel: [  858.373424] [20426]:
vdisk_unmap_range:3810:Unmapping lba 61804544 (blocks 8192)

Jul 26 08:24:04 e1 kernel: [  884.170201] [6290]: scst_cmd_init_done:829:CDB:
Jul 26 08:24:04 e1 kernel: [  884.170202]
(h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
Jul 26 08:24:04 e1 kernel: [  884.170205]0: 42 00 00 00 00 00 00
00 18 00 00 00 00 00 00 00   B...
Jul 26 08:24:04 e1 kernel: [  884.170268] [6290]: scst:
scst_parse_cmd:1312:op_name  (cmd 88201b556300),
direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24,
out_bufflen=0, (expected len data 24, expected len DIF 0, out expected
len 0), flags=0x80260, internal 0, naca 0
Jul 26 08:24:04 e1 kernel: [  884.173983] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b556b40, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:24:04 e1 kernel: [  884.173998] [20426]:
vdisk_parse_offset:2930:cmd 88201b556e00, lba_start 0, loff 0,
data_len 24
Jul 26 08:24:04 e1 kernel: [  884.174001] [20426]:
vdisk_unmap_range:3810:Unmapping lba 74231808 (blocks 8192)
Jul 26 08:24:04 e1 kernel: [  884.174224] [6290]: scst:
scst_cmd_init_done:828:NEW CDB: len 16, lun 16, initiator
iqn.1995-05.com.vihl2.ibft, target iqn.2008-10.net.storcium:scst.1,
queue_type 1, tag 4005936 (cmd 88201b5565c0, sess
880ffa2c)
Jul 26 08:24:04 e1 kernel: [  884.174227] [6290]: scst_cmd_init_done:829:CDB:
Jul 26 08:24:04 e1 kernel: [  884.174228]
(h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
Jul 26 08:24:04 e1 kernel: [  884.174231]0: 42 00 00 00 00 00 00
00 18 00 00 00 00 00 00 00   B...
Jul 26 08:24:04 e1 kernel: [  884.174256] [6290]: scst:
scst_parse_cmd:1312:op_name  (cmd 88201b5565c0),
direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24,
out_bufflen=0, (expected len data 24, expected len DIF 0, out expected
len 0), flags=0x80260, internal 0, naca 0




>
> Alex Gorbachev wrote on 07/23/2016 08:48 PM:
>> Hi Nick, Vlad, SCST Team,
>>
> I have been looking at using the rbd-nbd tool, so that the caching is
 provided by librbd and then use BLOCKIO with SCST. This will however need
 some work on the SCST resource agents to ensure the librbd cache is
 invalidated on ALUA state change.
>
> The other thing I have seen is this
>
> https://lwn.net/Articles/691871/
>
> Which may mean FILEIO will support thin provis

[ceph-users] Searchable metadata and objects in Ceph

2016-07-27 Thread Andrey Ptashnik
Hello team,

We are looking for ways to store metadata with objects and make this metadata 
searchable. 
For example if we store an image of the car in Ceph we would like to be able to 
attach metadata like model, make, year, damaged parts list, owner information. 
So later on we can run a report against specific metadata and retrieve car 
images that are more or less inline with search query.
Is there a way to implement something like this in Ceph?


Regards,

Andrey Ptashnik

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Listing objects in a specified placement group / OSD

2016-07-27 Thread Samuel Just
Well, it's kind of deliberately obfuscated because PGs aren't a
librados-level abstraction.  Why do you want to list the objects in a
PG?
-Sam

On Wed, Jul 27, 2016 at 8:10 AM, David Blundell
 wrote:
> Hi,
>
>
>
> I wasn’t sure if this is a ceph-users or ceph-devel question as it’s about
> the API (users) but the answer may involve me writing a RADOS method
> (devel).
>
>
>
> At the moment in Ceph Jewel I can find which objects are held in an OSD or
> placement group by looking on the filesystem under
> /var/lib/ceph/osd/ceph-*/current
>
>
>
> This requires access to the OSD host and may well break when using Bluestore
> if there is no filesystem to look through.  I would like to be able to list
> objects in a specified PG/OSD from outside of the OSD host using Ceph
> commands.
>
>
>
> I can list all PGs hosted on OSD 1 using “ceph pg ls-by-osd osd.1” and could
> loop through this output if there was a way to list the objects in a PG.
>
>
>
> I have checked the API and librados docs (I would be happy to hack something
> together using librados) and can’t see any obvious way to list the objects
> in a PG.
>
>
>
> I have seen a post on this mailing list from Ilya last September saying:
>
> “Internally there is a way to list objects within a specific PG (actually
> more than one way IIRC), but I don't think anything like that is exposed in
> a CLI (it might be exposed in librados though).”
>
>
>
> but could not find any follow up posts with details.
>
>
>
> Does anyone have any more details on these internal methods and how to call
> them?
>
>
>
> Cheers,
>
>
>
> David
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Noobie question about OSD fail

2016-07-27 Thread Patrick McGarry
Moving this to ceph-user.


On Wed, Jul 27, 2016 at 8:36 AM, Kostya Velychkovsky
 wrote:
> Hello. I have test CEPH cluster with 5 nodes:  3 MON and 2 OSD
>
> This is my ceph.conf
>
> [global]
> fsid = 714da611-2c40-4930-b5b9-d57e70d5cf7e
> mon_initial_members = node1
> mon_host = node1,node3,node4
>
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> osd_pool_default_size = 2
> public_network = X.X.X.X/24
>
> [mon]
> osd report timeout = 15
> osd min down reports = 2
>
> [osd]
> mon report interval max = 30
> mon heartbeat interval = 15
>
>
> So, while I run some fail tests and hard reset one OSD node, I have long
> timeout while ceph mark this OSD down, ~15 minutes
>
> and ceph -s display that cluster OK.
> ---
> cluster 714da611-2c40-4930-b5b9-d57e70d5cf7e
>  health HEALTH_OK
>  monmap e5: 3 mons at 
> election epoch 272, quorum 0,1,2 node1,node3,node4
>  osdmap e90: 2 osds: 2 up, 2 in
> ---
> Only after ~15 minutes mon nodes Mark this OSD down, and change state of
> cluster
> 
>  osdmap e86: 2 osds: 1 up, 2 in; 64 remapped pgs
> flags sortbitwise
>   pgmap v3927: 64 pgs, 1 pools, 10961 MB data, 2752 objects
> 22039 MB used, 168 GB / 189 GB avail
> 2752/5504 objects degraded (50.000%)
>   64 active+undersized+degraded
> ---
>
> I tried to ajust 'osd report timeout'  but have the same result.
>
> Can you pls help me tune my cluster to decrease this reaction time ?
>
> --
> Best Regards
>
> Kostiantyn Velychkovsky
>
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-07-27 Thread Alex Gorbachev
One other experiment: just running blkdiscard against the RBD block
device completely clears it, to the point where the rbd-diff method
reports 0 blocks utilized.  So to summarize:

- ESXi sending UNMAP via SCST does not seem to release storage from
RBD (BLOCKIO handler that is supposed to work with UNMAP)

- blkdiscard does release the space

--
Alex Gorbachev
Storcium


On Wed, Jul 27, 2016 at 11:55 AM, Alex Gorbachev  
wrote:
> Hi Vlad,
>
> On Mon, Jul 25, 2016 at 10:44 PM, Vladislav Bolkhovitin  wrote:
>> Hi,
>>
>> I would suggest to rebuild SCST in the debug mode (after "make 2debug"), 
>> then before
>> calling the unmap command enable "scsi" and "debug" logging for scst and 
>> scst_vdisk
>> modules by 'echo add scsi >/sys/kernel/scst_tgt/trace_level; echo "add scsi"
>>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level; echo "add debug"
>>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level', then check, if for 
>>>the unmap
>> command vdisk_unmap_range() is reporting running blkdev_issue_discard() in 
>> the kernel
>> logs.
>>
>> To double check, you might also add trace statement just before 
>> blkdev_issue_discard()
>> in vdisk_unmap_range().
>
> With the debug settings on, I am seeing the below output - this means
> that discard is being sent to the backing (RBD) device, correct?
>
> Including the ceph-users list to see if there is a reason RBD is not
> processing this discard/unmap.
>
> Thank you,
> --
> Alex Gorbachev
> Storcium
>
> Jul 26 08:23:38 e1 kernel: [  858.324715] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b552940, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:23:38 e1 kernel: [  858.324740] [20426]:
> vdisk_parse_offset:2930:cmd 88201b552c00, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:23:38 e1 kernel: [  858.324743] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 61779968 (blocks 8192)
> Jul 26 08:23:38 e1 kernel: [  858.336218] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b552c00, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:23:38 e1 kernel: [  858.336232] [20426]:
> vdisk_parse_offset:2930:cmd 88201b552ec0, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:23:38 e1 kernel: [  858.336234] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 61788160 (blocks 8192)
> Jul 26 08:23:38 e1 kernel: [  858.351446] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b552ec0, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:23:38 e1 kernel: [  858.351468] [20426]:
> vdisk_parse_offset:2930:cmd 88201b553180, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:23:38 e1 kernel: [  858.351471] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 61796352 (blocks 8192)
> Jul 26 08:23:38 e1 kernel: [  858.373407] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b553180, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:23:38 e1 kernel: [  858.373422] [20426]:
> vdisk_parse_offset:2930:cmd 88201b553440, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:23:38 e1 kernel: [  858.373424] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 61804544 (blocks 8192)
>
> Jul 26 08:24:04 e1 kernel: [  884.170201] [6290]: scst_cmd_init_done:829:CDB:
> Jul 26 08:24:04 e1 kernel: [  884.170202]
> (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
> Jul 26 08:24:04 e1 kernel: [  884.170205]0: 42 00 00 00 00 00 00
> 00 18 00 00 00 00 00 00 00   B...
> Jul 26 08:24:04 e1 kernel: [  884.170268] [6290]: scst:
> scst_parse_cmd:1312:op_name  (cmd 88201b556300),
> direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24,
> out_bufflen=0, (expected len data 24, expected len DIF 0, out expected
> len 0), flags=0x80260, internal 0, naca 0
> Jul 26 08:24:04 e1 kernel: [  884.173983] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b556b40, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:24:04 e1 kernel: [  884.173998] [20426]:
> vdisk_parse_offset:2930:cmd 88201b556e00, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:24:04 e1 kernel: [  884.174001] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 74231808 (blocks 8192)
> Jul 26 08:24:04 e1 kernel: [  884.174224] [6290]: scst:
> scst_cmd_init_done:828:NEW CDB: len 16, lun 16, initiator
> iqn.1995-05.com.vihl2.ibft, target iqn.2008-10.net.storcium:scst.1,
> queue_type 1, tag 4005936 (cmd 88201b5565c0, sess
> 880ffa2c)
> Jul 26 08:24:04 e1 kernel: [  884.174227] [6290]: scst_cmd_init_done:829:CDB:
> Jul 26 08:24:04 e1 kernel: [  884.174228]
> (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
> Jul 26 08:24:04 e1 kernel: [  884.174231]0: 42 00 00 00 00 00 00
> 00 18 00 00 00 00 00 00 00   B...
> Jul 26 08:24:04 e1 kernel: [  884.174256] [6290]: scst:
> scst_parse_cmd:1312:op_name  (cmd 88201b5565c0),
> direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24,
> out_bufflen=

Re: [ceph-users] Listing objects in a specified placement group / OSD

2016-07-27 Thread David Blundell
Hi Sam,

We're running a program on each OSD host that reads the contents of the objects 
on that host's OSDs (using LIBRADOS_OPERATION_LOCALIZE_READS when reading as 
eventual consistency is ok).

At the moment the simplest way of finding out which objects are local is to 
look in the local filesystem but I want to ensure that the system will work 
with BlueStore and any future changes so would love to be able to query this 
e.g. via librados.

Cheers,

David  

> -Original Message-
> From: Samuel Just [mailto:sj...@redhat.com]
> Sent: 27 July 2016 17:45
> To: David Blundell 
> Cc: ceph-us...@ceph.com; ceph-de...@vger.kernel.org
> Subject: Re: [ceph-users] Listing objects in a specified placement group / OSD
> 
> Well, it's kind of deliberately obfuscated because PGs aren't a
> librados-level abstraction.  Why do you want to list the objects in a
> PG?
> -Sam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Noobie question about OSD fail

2016-07-27 Thread Samuel Just
osd min down reports = 2

Set that to 1?
-Sam

On Wed, Jul 27, 2016 at 10:24 AM, Patrick McGarry  wrote:
> Moving this to ceph-user.
>
>
> On Wed, Jul 27, 2016 at 8:36 AM, Kostya Velychkovsky
>  wrote:
>> Hello. I have test CEPH cluster with 5 nodes:  3 MON and 2 OSD
>>
>> This is my ceph.conf
>>
>> [global]
>> fsid = 714da611-2c40-4930-b5b9-d57e70d5cf7e
>> mon_initial_members = node1
>> mon_host = node1,node3,node4
>>
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> auth_client_required = cephx
>> osd_pool_default_size = 2
>> public_network = X.X.X.X/24
>>
>> [mon]
>> osd report timeout = 15
>> osd min down reports = 2
>>
>> [osd]
>> mon report interval max = 30
>> mon heartbeat interval = 15
>>
>>
>> So, while I run some fail tests and hard reset one OSD node, I have long
>> timeout while ceph mark this OSD down, ~15 minutes
>>
>> and ceph -s display that cluster OK.
>> ---
>> cluster 714da611-2c40-4930-b5b9-d57e70d5cf7e
>>  health HEALTH_OK
>>  monmap e5: 3 mons at 
>> election epoch 272, quorum 0,1,2 node1,node3,node4
>>  osdmap e90: 2 osds: 2 up, 2 in
>> ---
>> Only after ~15 minutes mon nodes Mark this OSD down, and change state of
>> cluster
>> 
>>  osdmap e86: 2 osds: 1 up, 2 in; 64 remapped pgs
>> flags sortbitwise
>>   pgmap v3927: 64 pgs, 1 pools, 10961 MB data, 2752 objects
>> 22039 MB used, 168 GB / 189 GB avail
>> 2752/5504 objects degraded (50.000%)
>>   64 active+undersized+degraded
>> ---
>>
>> I tried to ajust 'osd report timeout'  but have the same result.
>>
>> Can you pls help me tune my cluster to decrease this reaction time ?
>>
>> --
>> Best Regards
>>
>> Kostiantyn Velychkovsky
>>
>> ___
>> Ceph-community mailing list
>> ceph-commun...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>>
>
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> Ceph-community mailing list
> ceph-commun...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to configure OSD heart beat to happen on public network

2016-07-27 Thread Venkata Manojawa Paritala
Hi,

I have configured the below 2 networks in Ceph.conf.

1. public network
2. cluster_network

Now, the heart beat for the OSDs is happening thru cluster_network. How can
I configure the heart beat to happen thru public network?

I actually configured the property "osd heartbeat address" in the global
section and provided public network's subnet, but it is not working out.

Am I doing something wrong? Appreciate your quick responses, as I need to
urgently.


Thanks & Regards,
Manoj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Searchable metadata and objects in Ceph

2016-07-27 Thread Gregory Farnum
On Wed, Jul 27, 2016 at 9:17 AM, Andrey Ptashnik  wrote:
> Hello team,
>
> We are looking for ways to store metadata with objects and make this metadata 
> searchable.
> For example if we store an image of the car in Ceph we would like to be able 
> to attach metadata like model, make, year, damaged parts list, owner 
> information. So later on we can run a report against specific metadata and 
> retrieve car images that are more or less inline with search query.
> Is there a way to implement something like this in Ceph?

Nothing like this exists right now and there isn't really
infrastructure to let you do so as a user — you'd need to build up and
maintain your own searchable index.

We've discussed (in a blue-sky sense) this and similar problems and
have ideas for adding it in via PG classes and metadata that's
organized by PG instead of object. A sufficiently-motivated developer
could work on design and implementation, but it's a serious project
that would take some time and intimate knowledge of the Ceph
internals.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance decrease after continuous run

2016-07-27 Thread RDS
I have seen this and some of our big customers have also seen this. I was using 
8TB HDDs and when running small tests using a fresh HDD setup, these tests 
resulted in very good performance. I then loaded the ceph cluster so each of 
the 8TB HDD used 4TB and reran the same tests. performance was cut in 1/2. This 
is using the default settings on how ceph creates the dirs and sub-directories 
on each OSD. You can flatten out this dir structure so the structure is more 
wide than deep and performance is improved. Check out the 
filestore_merge_threshold and filestore_split_multiple settings.
Rick
> On Jul 20, 2016, at 3:19 PM, Kane Kim  wrote:
> 
> Hello,
> 
> I was running cosbench for some time and noticed sharp consistent performance 
> decrease at some point.
> 
> Image is here: http://take.ms/rorPw 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS snapshot preferred behaviors

2016-07-27 Thread Patrick Donnelly
On Mon, Jul 25, 2016 at 5:41 PM, Gregory Farnum  wrote:
> Some specific questions:
> * Right now, we allow users to rename snapshots. (This is newish, so
> you may not be aware of it if you've been using snapshots for a
> while.) Is that an important ability to preserve?

IMO, renaming snapshots is very useful when doing regular time-based
snapshots (e.g. a "today" snapshot is renamed "yesterday"). This is a
very popular feature in ZFS.

> * If you create a hard link at "/1/2/foo/bar" pointing at "/1/3/bar"
> and then take a snapshot at "/1/2/foo", it *will not* capture the file
> data in bar. Is that okay? Doing otherwise is *exceedingly* difficult.

This is only the case if /1/2/foo/ does not have the embedded inode
for "bar", right? (That's normally the case but an intervening unlink
of "1/3/bar" may eventually cause "/1/2/foo/bar" to become the new
primary inode?)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS snapshot preferred behaviors

2016-07-27 Thread Gregory Farnum
On Wed, Jul 27, 2016 at 2:51 PM, Patrick Donnelly  wrote:
> On Mon, Jul 25, 2016 at 5:41 PM, Gregory Farnum  wrote:
>> Some specific questions:
>> * Right now, we allow users to rename snapshots. (This is newish, so
>> you may not be aware of it if you've been using snapshots for a
>> while.) Is that an important ability to preserve?
>
> IMO, renaming snapshots is very useful when doing regular time-based
> snapshots (e.g. a "today" snapshot is renamed "yesterday"). This is a
> very popular feature in ZFS.
>
>> * If you create a hard link at "/1/2/foo/bar" pointing at "/1/3/bar"
>> and then take a snapshot at "/1/2/foo", it *will not* capture the file
>> data in bar. Is that okay? Doing otherwise is *exceedingly* difficult.
>
> This is only the case if /1/2/foo/ does not have the embedded inode
> for "bar", right? (That's normally the case but an intervening unlink
> of "1/3/bar" may eventually cause "/1/2/foo/bar" to become the new
> primary inode?)

Yeah, that total lack of user-visibile status around hard links is the
very best part. :/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Days - APAC Roadshow Schedules Posted

2016-07-27 Thread Patrick McGarry
Hey cephers,

Just wanted to let you know that the schedules for all Ceph Days in
the APAC roadshow have now been published. If you are going to be in
the region 20-29 Aug, check out the schedule and come join us!

http://ceph.com/cephdays/



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-fuse (jewel 10.2.2): No such file or directory issues

2016-07-27 Thread Goncalo Borges

Dear cephfsers :-)

We saw some weirdness in cephfs that we do not understand.

We were helping some user which complained that her batch system job 
outputs were not produced in cephfs.


Please note that we are using ceph-fuse (jewel 10.2.2) as client

We log in into the machine where her jobs run, and saw the following 
behavior:


   # ls /coepp/cephfs/mel/user/foo/bar/stuff
   ls: cannot access '/coepp/cephfs/mel/user/foo/bar/stuff': No such
   file or directory


If we went back 1 directory, still No such file

   # ls /coepp/cephfs/mel/user/foo/bar
   ls: cannot access '/coepp/cephfs/mel/user/foo/bar': No such file or
   directory


But if I did an ls in the user directory it was fine

   # ls /coepp/cephfs/mel/user
   

And then trying to ls to the directories which failed previous worked fine

It seems like a cache issue and I wonder if there is a way to mitigate it.

It is also worthwhile to mention that this seems to happen while we are 
adding a new storage server to the underlying ceph infrastructure, so 
there was some data movement happening in the background.


Any suggestion on how to mitigate it?

Cheers
Goncalo and Sean





--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse (jewel 10.2.2): No such file or directory issues

2016-07-27 Thread Gregory Farnum
On Wed, Jul 27, 2016 at 6:13 PM, Goncalo Borges
 wrote:
> Dear cephfsers :-)
>
> We saw some weirdness in cephfs that we do not understand.
>
> We were helping some user which complained that her batch system job outputs
> were not produced in cephfs.
>
> Please note that we are using ceph-fuse (jewel 10.2.2) as client
>
> We log in into the machine where her jobs run, and saw the following
> behavior:
>
> # ls /coepp/cephfs/mel/user/foo/bar/stuff
> ls: cannot access '/coepp/cephfs/mel/user/foo/bar/stuff': No such file or
> directory
>
>
> If we went back 1 directory, still No such file
>
> # ls /coepp/cephfs/mel/user/foo/bar
> ls: cannot access '/coepp/cephfs/mel/user/foo/bar': No such file or
> directory
>
>
> But if I did an ls in the user directory it was fine
>
> # ls /coepp/cephfs/mel/user
> 
>
> And then trying to ls to the directories which failed previous worked fine
>
> It seems like a cache issue and I wonder if there is a way to mitigate it.
>
> It is also worthwhile to mention that this seems to happen while we are
> adding a new storage server to the underlying ceph infrastructure, so there
> was some data movement happening in the background.
>
> Any suggestion on how to mitigate it?

If you're really using 10.2.2 and not something earlier, I don't think
this is a bug we've heard about. It sounds like you could work around
it by dropping caches or listing down from the root gratuitously, but
otherwise we'll need to do some debugging. Can you narrow in on what
makes this user's workload different from the others? Did you try
doing any tracing to see where the ENOENT was coming from?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse (jewel 10.2.2): No such file or directory issues

2016-07-27 Thread Goncalo Borges

Hi Greg

Thanks for replying. Answer inline.



Dear cephfsers :-)

We saw some weirdness in cephfs that we do not understand.

We were helping some user which complained that her batch system job outputs
were not produced in cephfs.

Please note that we are using ceph-fuse (jewel 10.2.2) as client

We log in into the machine where her jobs run, and saw the following
behavior:

# ls /coepp/cephfs/mel/user/foo/bar/stuff
ls: cannot access '/coepp/cephfs/mel/user/foo/bar/stuff': No such file or
directory


If we went back 1 directory, still No such file

# ls /coepp/cephfs/mel/user/foo/bar
ls: cannot access '/coepp/cephfs/mel/user/foo/bar': No such file or
directory


But if I did an ls in the user directory it was fine

# ls /coepp/cephfs/mel/user


And then trying to ls to the directories which failed previous worked fine

It seems like a cache issue and I wonder if there is a way to mitigate it.

It is also worthwhile to mention that this seems to happen while we are
adding a new storage server to the underlying ceph infrastructure, so there
was some data movement happening in the background.

Any suggestion on how to mitigate it?

If you're really using 10.2.2 and not something earlier, I don't think
this is a bug we've heard about. It sounds like you could work around
it by dropping caches or listing down from the root gratuitously, but
otherwise we'll need to do some debugging. Can you narrow in on what
makes this user's workload different from the others? Did you try
doing any tracing to see where the ENOENT was coming from?


Really using 10.2.2 everywhere.

To debug it a bit further we have to wait for the next time it happens. 
Than we can attach strace to the ceph-fuse process and get the 
information which you are asking for.


Relative to the user workload, there is nothing special happening in 
those directories. It is just a directory used to store logs (stderr and 
stdout) from the batch system jobs.


We were thinking if setting

fuse_disable_pagecache = true

would actually solve the problem. In this way you force ceph-fuse to 
read directly from osds, right?!


We understand about the performance issues that it might imply but we 
are more concerned in having data coherence in the client.


Thoughts?

Cheers



--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon_osd_nearfull_ratio (unchangeable) ?

2016-07-27 Thread Goncalo Borges

Hi David

Thanks for replying. Unfortunately, at the end, I did not test this. We 
solved our near full problems by adding a new host and now it doesn't 
make sense to test it anymore.


Thanks for suggestion. Will keep it in mind next time.

Cheers

Goncalo




On 07/26/2016 06:09 PM, David wrote:


Try:

ceph pg set_nearfull_ratio 0.9


On 26 Jul 2016 08:16, "Goncalo Borges" > wrote:


Hello...

I do not think that these settings are working properly in jewel.
Maybe someone else can confirm.

So, to summarize:

1./ I've restarted mon and osd services (systemctl restart
ceph.target) after setting

# grep nearfull /etc/ceph/ceph.conf
mon osd nearfull ratio = 0.90

2./ Thos configs seems active in the daemons configurations

# ceph --admin-daemon /var/run/ceph/ceph-mon.rccephmon1.asok
config show |grep mon_osd_nearfull_ratio
"mon_osd_nearfull_ratio": "0.9",
[
# ceph daemon mon.rccephmon1 config show | grep
mon_osd_nearfull_ratio
"mon_osd_nearfull_ratio": "0.9",

3./ However, I still receive a warning of near full osds if they
are above 85%


4./ A ceph pg dump does show:

# ceph pg dump
dumped all in format plain
version 12415999
stamp 2016-07-26 07:15:29.018848
last_osdmap_epoch 2546
last_pg_scan 2546
full_ratio 0.95
*nearfull_ratio 0.85*


Cheers
G.


On 07/26/2016 12:39 PM, Brad Hubbard wrote:

On Tue, Jul 26, 2016 at 12:16:35PM +1000, Goncalo Borges wrote:

Hi Brad

Thanks for replying.

Answers inline.



I am a bit confused about the 'unchachable' message we get in Jewel 10.2.2
when I try to change some cluster configs.

For example:

1./ if I try to change mon_osd_nearfull_ratio from 0.85 to 0.90, I get

 # ceph tell mon.* injectargs "--mon_osd_nearfull_ratio 0.90"
 mon.rccephmon1: injectargs:mon_osd_nearfull_ratio = '0.9'
 (unchangeable)
 mon.rccephmon3: injectargs:mon_osd_nearfull_ratio = '0.9'
 (unchangeable)
 mon.rccephmon2: injectargs:mon_osd_nearfull_ratio = '0.9'
 (unchangeable)

This is telling you that this variable has no observers (i.e. nothing 
monitors
it dynamically) so changing it at runtime has no effect. IOW it is read at
start-up and not referred to again after that IIUC.


but the 0.85 default values continues to be showed in

  ceph --show-config --conf /dev/null | grep mon_osd_nearfull_ratio
  mon_osd_nearfull_ratio = 0.85

Try something like the following.

$ ceph daemon mon.a config show|grep mon_osd_nearfull_ratio


and I continue to have health warnings regarding near full osds.

So the actual config value has been changed but has no affect and will not
persist. IOW, this value needs to be modified in the conf file and the 
daemon
restarted.


2./ If I change in the ceph.conf and restart services, I get the same
behaviour as in 1./ However, if I check the daemon configuration, I see:

Please clarify what you mean by "the same behaviour"?

So, in my ceph.conf I've set 'mon osd nearfull ratio = 0.90' and restarted
mon and osd (not sure if those were needed) daemons everywhere.

After restarting, I am still getting the health warnings regarding near full
osds above 85%. If the new value was active, I should not get such warnings.


  # ceph daemon mon.rccephmon2 config show | grep mon_osd_nearfull_ratio
  "mon_osd_nearfull_ratio": "0.9",

Use the daemon command I showed above.

Isn't it the same as you suggested? That was run after restarting services

Yes, it is. I assumed wrongly that you were using the "--show-config" 
command
again here.


so it is still unclear to me why the new value is not picked up and why
running 'ceph --show-config --conf /dev/null | grep mon_osd_nearfull_ratio'

That command shows the default ceph config, try something like this.

$ ceph -n mon.rccephmon2 --show-config|grep mon_osd_nearfull_ratio


still shows 0.85

Maybe a restart if services is not what has to be done but a stop/start
instead?

You can certainly try it but I would have thought a restart would involve
stop/start of the MON daemon. This thread includes additional information 
that
may be relevant to you atm.

http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/23391


Cheers
Goncalo


-- 
Goncalo Borges

Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T:+61 2 93511937 


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Gonca

[ceph-users] rbd-nbd, failed to bind the UNIX domain socket

2016-07-27 Thread joecyw
請教一下各位不知是否有人遇過類似的問題,最近透過ceph-deploy(1.5.34) 佈署了 ceph(10.2.2),建立了一個pool ( name: 
dp) 及image ( img001),利用rbd-nbd mapping 至 /dev/nbd0,但是在使用 rbd-nbd list-mapped 
查看mapping 狀態的時候均會有如下例錯誤訊息:

[root於ceph01  ~]# rbd-nbd 
list-mapped
/dev/nbd0
2016-07-27 18:31:35.052554 7f0ce10c1e00 -1 asok(0x7f0cea7fa540) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) 
File exists

雖然不影響block device 的使用,但是就是一直不曉得是哪邊設定上有問題,所以想請教一下是否有人有同樣的問題,謝謝。

P.S 當中的/var/run/ceph/ceph-client.admin.asok 確認是由 ceph-mon service 啟動後所產生的socket 
file。

ceph.conf 設定如下:

[root於ceph01  ~]# cat 
/etc/ceph/ceph.conf
[global]
fsid = 19c473ab-d154-41a6-8642-8a26c00f4db0
mon_initial_members = ceph01, ceph02, ceph03
mon_host = x.x.x.x, x.x.x.x, x.x.x.x
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

[clients]
rbd_cache_writethrough_until_flush = True
rbd_cache = True

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com