Re: [ceph-users] clock skew

2017-04-04 Thread lists

Hi John, list,

On 1-4-2017 16:18, John Petrini wrote:

Just ntp.


Just to follow-up on this: we have yet experienced a clock skew since we 
starting using chrony. Just three days ago, I know, bit still...


Perhaps you should try it too, and report if it (seems to) work better 
for you as well.


But again, just three days, could be I cheer too early.

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Why is cls_log_add logging so much?

2017-04-04 Thread Jens Rosenboom
On a busy cluster, I'm seeing a couple of OSDs logging millions of
lines like this:

2017-04-04 06:35:18.240136 7f40ff873700  0 
cls/log/cls_log.cc:129: storing entry at
1_1491287718.237118_57657708.1
2017-04-04 06:35:18.244453 7f4102078700  0 
cls/log/cls_log.cc:129: storing entry at
1_1491287718.241622_57657709.1
2017-04-04 06:35:18.296092 7f40ff873700  0 
cls/log/cls_log.cc:129: storing entry at
1_1491287718.296308_57657710.1

1. Can someone explain what these messages mean? It seems strange to
me that only a few OSD generate these.

2. Why are they being generated at debug level 0, meaning that they
cannot be filtered? This should happen for a non-error message that
can be generated at least 50 times per second.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting incomplete PG's

2017-04-04 Thread nokia ceph
Hello Sage and Brad,

Many thanks for the information

>incomplete PGs can be extracted from the drive if the bad sector(s) don't
>happen to affect those pgs.  The ceph-objectstore-tool --op export command
>can be used for this (extract it from the affected drive and add it to
>some other osd).


==
#ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 1.fs1
--op export --file /tmp/test
Exporting 1.fs1
Export successful

#ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1  --op import
--file /tmp/test
Importing pgid 1.fs1
Import successful
==

I will try this for next issue reoccurance.

Need your suggestion to fixing the unfound errors which happened on other
environment. v11.2.0 , bluestore, 4+1

===
1 active+degraded, 8191 active+clean; 29494 GB data, 39323 GB used, 1180 TB
/ 1218 TB avail; 2/66917305 objects degraded (0.000%); 1/13383461 unfound
(0.000%)
===

===
pg 1.93f is active+degraded, acting [206,99,11,290,169], 1 unfound
===

What we tried...

#Restart all the associated OSD's for that PG's 206,99,11,290,169
#Here all the OSD's are up and running state
#ceph pg repair 1.93f
#ceph pg deep-scrub 1.93f

At last
#ceph pg 1.93f mark_unfound_lost delete   { data loss }


Need your views on this, to how to clear the unfound issues without data
loss.


Thanks
Jayaram



On Mon, Apr 3, 2017 at 6:50 PM, Sage Weil  wrote:

> On Fri, 31 Mar 2017, nokia ceph wrote:
> > Hello Brad,
> > Many thanks of the info :)
> >
> > ENV:-- Kracken - bluestore - EC 4+1 - 5 node cluster : RHEL7
> >
> > What is the status of the down+out osd? Only one osd osd.6 down and out
> from
> > cluster.
> > What role did/does it play? Mostimportantly, is it osd.6? Yes, due to
> > underlying I/O error issue we removed this device from the cluster.
>
> Is the device completely destroyed or is it only returning errors
> when reading certain data?  It is likely that some (or all) of the
> incomplete PGs can be extracted from the drive if the bad sector(s) don't
> happen to affect those pgs.  The ceph-objectstore-tool --op export command
> can be used for this (extract it from the affected drive and add it to
> some other osd).
>
> > I put this parameter " osd_find_best_info_ignore_history_les = true" in
> > ceph.conf, and find those 22 PG's were changed to "down+remapped" . Now
> all
> > are reverted to "remapped+incomplete" state.
>
> This is usually not a great idea unless you're out of options, by the way!
>
> > #ceph pg stat 2> /dev/null
> > v2731828: 4096 pgs: 1 incomplete, 21 remapped+incomplete, 4074
> active+clean;
> > 268 TB data, 371 TB used, 267 TB / 638 TB avail
> >
> > ## ceph -s
> > 2017-03-30 19:02:14.350242 7f8b0415f700 -1 WARNING: the following
> dangerous
> > and experimental features are enabled: bluestore,rocksdb
> > 2017-03-30 19:02:14.366545 7f8b0415f700 -1 WARNING: the following
> dangerous
> > and experimental features are enabled: bluestore,rocksdb
> > cluster bd8adcd0-c36d-4367-9efe-f48f5ab5f108
> >  health HEALTH_ERR
> > 22 pgs are stuck inactive for more than 300 seconds
> > 22 pgs incomplete
> > 22 pgs stuck inactive
> > 22 pgs stuck unclean
> >  monmap e2: 5 mons at{au-adelaide=10.50.21.24:
> 6789/0,au-brisbane=10.50.21.22:6789/0,au-canberra=
> > 10.50.21.23:6789/0,au-melbourne=10.50.21.21:6789/0,
> au-sydney=10.50.21.20:67
> > 89/0}
> > election epoch 180, quorum 0,1,2,3,4
> > au-sydney,au-melbourne,au-brisbane,au-canberra,au-adelaide
> > mgr active: au-adelaide
> >  osdmap e6506: 117 osds: 117 up, 117 in; 21 remapped pgs
> > flags sortbitwise,require_jewel_osds,require_kraken_osds
> >   pgmap v2731828: 4096 pgs, 1 pools, 268 TB data, 197 Mobjects
> > 371 TB used, 267 TB / 638 TB avail
> > 4074 active+clean
> >   21 remapped+incomplete
> >1 incomplete
> >
> >
> > ## ceph osd dump 2>/dev/null | grep cdvr
> > pool 1 'cdvr_ec' erasure size 5 min_size 4 crush_ruleset 1 object_hash
> > rjenkins pg_num 4096 pgp_num 4096 last_change 456 flags
> > hashpspool,nodeep-scrub stripe_width 65536
> >
> > Inspecting affected PG 1.e4b
> >
> > # ceph pg dump 2> /dev/null | grep 1.e4b
> > 1.e4b 50832  00 0   0 73013340821
> > 1000610006 remapped+incomplete 2017-03-30 14:14:26.297098 3844'161662
> >  6506:325748 [113,66,15,73,103]113  [NONE,NONE,NONE,73,NONE]
>
> > 73 1643'139486 2017-03-21 04:56:16.683953 0'0 2017-02-21
> > 10:33:50.012922
> >
> > When I trigger below command.
> >
> > #ceph pg force_create_pg 1.e4b
> > pg 1.e4b now creating, ok
> >
> > As it went to creating state, no change after that. Can you explain why
> this
> > PG showing null values after triggering "force_create_pg",?
> >
> > ]# ceph pg dump 2> /dev/null | grep 1.e4b
> > 1.e4b 0  00 0   0
> 0
> >   00creating 2017-03-30 19:07:00.982178 

[ceph-users] Librbd logging

2017-04-04 Thread Laszlo Budai

Hello cephers,

I have a situation where from time to time the write operation to the seph 
storage hangs for 3-5 seconds. For testing we have a simple line like:
while sleep 1; date >> logfile; done &

with this we can see that rarely there are 3 seconds or more differences 
between the consecutive outputs of date.
Initially we have suspected the deep scrub and we have tuned its parameters, so 
right now I'm confident that the reason is something different than the deep 
scrubbing.

I would like to know if any of you has encountered a similar situation, and 
what was the solution for it.
I am suspecting the network between the compute nodes and the storage, but I 
need to prove this. I am thinking on enabling client side logging for librbd, 
but I see there are many subsystems where the logging can be enabled. Can 
anyone tell me which subsystem should I log, and at which level to be able to 
see whether the network is causing write issues?
We're using ceph 0.94.10.

Thank you,
Laszlo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Get/set/list rbd image using python librbd

2017-04-04 Thread Sayid Munawar
Ok, thanks Jason for submitting the feature request ticket. But i am afraid
i couldn't contribute now, lack of C/C++ :D



On Mon, Apr 3, 2017 at 9:09 PM, Jason Dillaman  wrote:

> We try to keep the C/C++ and Python APIs in-sync, but it looks like
> these functions were missed and are not currently available via the
> Python API. I created a tracker ticket [1] the issue. If you are
> interested, feel free to contribute a pull request for the missing
> APIs.
>
> [1] http://tracker.ceph.com/issues/19451
>
> On Sun, Apr 2, 2017 at 8:17 PM, Sayid Munawar 
> wrote:
> > Hi,
> >
> > Using rbd command line, we can set / get / list image-meta of an rbd
> image
> > as described in the man page.
> >
> > # rbd image-meta list mypool/myimage
> >
> >
> > How can we do the same using python librbdpy ? i can't find it in the
> > documentation.
> >
> > with rados.Rados(conffile='my_ceph.conf') as cluster:
> > with cluster.open_ioctx('mypool') as ioctx:
> > with rbd.Image(ioctx, 'myimage') as image:
> > image._some_method_to_set_metadata() ???
> >
> >
> > Thank you
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Librbd logging

2017-04-04 Thread Jason Dillaman
Couple options:

1) you can enable LTTng-UST tracing [1][2] against your VM for an
extremely light-weight way to track IO latencies.
2) you can enable "debug rbd = 20" and grep through the logs for
matching "AioCompletion.*(set_request_count|finalize)" log entries
3) use the asok file during one of these events to dump the objecter requests

[1] http://docs.ceph.com/docs/jewel/rbd/rbd-replay/
[2] http://tracker.ceph.com/issues/14629

On Tue, Apr 4, 2017 at 7:36 AM, Laszlo Budai  wrote:
> Hello cephers,
>
> I have a situation where from time to time the write operation to the seph
> storage hangs for 3-5 seconds. For testing we have a simple line like:
> while sleep 1; date >> logfile; done &
>
> with this we can see that rarely there are 3 seconds or more differences
> between the consecutive outputs of date.
> Initially we have suspected the deep scrub and we have tuned its parameters,
> so right now I'm confident that the reason is something different than the
> deep scrubbing.
>
> I would like to know if any of you has encountered a similar situation, and
> what was the solution for it.
> I am suspecting the network between the compute nodes and the storage, but I
> need to prove this. I am thinking on enabling client side logging for
> librbd, but I see there are many subsystems where the logging can be
> enabled. Can anyone tell me which subsystem should I log, and at which level
> to be able to see whether the network is causing write issues?
> We're using ceph 0.94.10.
>
> Thank you,
> Laszlo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph pg inconsistencies - omap data lost

2017-04-04 Thread Ben Morrice

Hi all,

We have a weird issue with a few inconsistent PGs. We are running ceph 
11.2 on RHEL7.


As an example inconsistent PG we have:

# rados -p volumes list-inconsistent-obj 4.19
{"epoch":83986,"inconsistents":[{"object":{"name":"rbd_header.08f7fa43a49c7f","nspace":"","locator":"","snap":"head","version":28785242},"errors":[],"union_shard_errors":["omap_digest_mismatch_oi"],"selected_object_info":"4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 
client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 
dd  od  alloc_hint [0 0 
0])","shards":[{"osd":10,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"},{"osd":20,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"},{"osd":29,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"}]}]}


If I try to repair this PG, I get the following in the OSD logs:

2017-04-04 14:31:37.825833 7f2d7f802700 -1 log_channel(cluster) log 
[ERR] : 4.19 shard 10: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head 
omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 
4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 
client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 
dd  od  alloc_hint [0 0 0])
2017-04-04 14:31:37.825863 7f2d7f802700 -1 log_channel(cluster) log 
[ERR] : 4.19 shard 20: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head 
omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 
4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 
client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 
dd  od  alloc_hint [0 0 0])
2017-04-04 14:31:37.825870 7f2d7f802700 -1 log_channel(cluster) log 
[ERR] : 4.19 shard 29: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head 
omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 
4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 
client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 
dd  od  alloc_hint [0 0 0])
2017-04-04 14:31:37.825877 7f2d7f802700 -1 log_channel(cluster) log 
[ERR] : 4.19 soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head: failed to 
pick suitable auth object
2017-04-04 14:32:37.926980 7f2d7cffd700 -1 log_channel(cluster) log 
[ERR] : 4.19 deep-scrub 3 errors


If I list the omapvalues, they are null

# rados -p volumes listomapvals rbd_header.08f7fa43a49c7f |wc -l
0


If I list the extended attributes on the filesystem of each OSD that 
hosts this file, they are indeed empty (all 3 OSDs are the same, but 
just listing one for brevity)


getfattr 
/var/lib/ceph/osd/ceph-29/current/4.19_head/DIR_9/DIR_1/DIR_2/rbd\\uheader.08f7fa43a49c7f__head_6C8FC219__4

getfattr: Removing leading '/' from absolute path names
# file: 
var/lib/ceph/osd/ceph-29/current/4.19_head/DIR_9/DIR_1/DIR_2/rbd\134uheader.08f7fa43a49c7f__head_6C8FC219__4

user.ceph._
user.ceph._@1
user.ceph._lock.rbd_lock
user.ceph.snapset
user.cephos.spill_out


Is there anything I can do to recover from this situation?


--
Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client's read affinity

2017-04-04 Thread Brian Andrus
Jason, I haven't heard much about this feature.

Will the localization have effect if the crush location configuration is
set in the [osd] section, or does it need to apply globally for clients as
well?

On Fri, Mar 31, 2017 at 6:38 AM, Jason Dillaman  wrote:

> Assuming you are asking about RBD-back VMs, it is not possible to
> localize the all reads to the VM image. You can, however, enable
> localization of the parent image since that is a read-only data set.
> To enable that feature, set "rbd localize parent reads = true" and
> populate the "crush location = host=X rack=Y etc=Z" in your ceph.conf.
>
> On Fri, Mar 31, 2017 at 9:00 AM, Alejandro Comisario
>  wrote:
> > any experiences ?
> >
> > On Wed, Mar 29, 2017 at 2:02 PM, Alejandro Comisario
> >  wrote:
> >> Guys hi.
> >> I have a Jewel Cluster divided into two racks which is configured on
> >> the crush map.
> >> I have clients (openstack compute nodes) that are closer from one rack
> >> than to another.
> >>
> >> I would love to (if is possible) to specify in some way the clients to
> >> read first from the nodes on a specific rack then try the other one if
> >> is not possible.
> >>
> >> Is that doable ? can somebody explain me how to do it ?
> >> best.
> >>
> >> --
> >> Alejandrito
> >
> >
> >
> > --
> > Alejandro Comisario
> > CTO | NUBELIU
> > E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> > _
> > www.nubeliu.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-04-04 Thread Patrick Donnelly
On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen  wrote:
> On 1-4-2017 21:59, Wido den Hollander wrote:
>>
>>> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen :
>>>
>>>
>>> On 31-3-2017 17:32, Wido den Hollander wrote:
 Hi Willem Jan,

> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
> :
>
>
> Hi,
>
> I'm pleased to announce that my efforts to port to FreeBSD have
> resulted in a ceph-devel port commit in the ports tree.
>
> https://www.freshports.org/net/ceph-devel/
>

 Awesome work! I don't touch FreeBSD that much, but I can imagine that
 people want this.

 Out of curiosity, does this run on ZFS under FreeBSD? Or what
 Filesystem would you use behind FileStore with this? Or does
 BlueStore work?
>>>
>>> Since I'm a huge ZFS fan, that is what I run it on.
>>
>> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting!
>
> Right, ZIL is magic, and more or equal to the journal now used with OSDs
> for exactly the same reason. Sad thing is that a write is now 3*
> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used
> bandwidth to the SSDs is double of what it could be.
>
> Had some discussion about this, but disabling the Ceph journal is not
> just setting an option. Although I would like to test performance of an
> OSD with just the ZFS journal. But I expect that the OSD journal is
> rather firmly integrated.

Disabling the OSD journal will never be viable. The journal is also
necessary for transactions and batch updates which cannot be done
atomically in FileStore.

This is great work Willem. I'm especially looking forward to seeing
BlueStore performance on a ZVol.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-04-04 Thread Gregory Farnum
On Tue, Apr 4, 2017 at 12:28 PM, Patrick Donnelly  wrote:
> On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen  wrote:
>> On 1-4-2017 21:59, Wido den Hollander wrote:
>>>
 Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen :


 On 31-3-2017 17:32, Wido den Hollander wrote:
> Hi Willem Jan,
>
>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
>> :
>>
>>
>> Hi,
>>
>> I'm pleased to announce that my efforts to port to FreeBSD have
>> resulted in a ceph-devel port commit in the ports tree.
>>
>> https://www.freshports.org/net/ceph-devel/
>>
>
> Awesome work! I don't touch FreeBSD that much, but I can imagine that
> people want this.
>
> Out of curiosity, does this run on ZFS under FreeBSD? Or what
> Filesystem would you use behind FileStore with this? Or does
> BlueStore work?

 Since I'm a huge ZFS fan, that is what I run it on.
>>>
>>> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting!
>>
>> Right, ZIL is magic, and more or equal to the journal now used with OSDs
>> for exactly the same reason. Sad thing is that a write is now 3*
>> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used
>> bandwidth to the SSDs is double of what it could be.
>>
>> Had some discussion about this, but disabling the Ceph journal is not
>> just setting an option. Although I would like to test performance of an
>> OSD with just the ZFS journal. But I expect that the OSD journal is
>> rather firmly integrated.
>
> Disabling the OSD journal will never be viable. The journal is also
> necessary for transactions and batch updates which cannot be done
> atomically in FileStore.
>
> This is great work Willem. I'm especially looking forward to seeing
> BlueStore performance on a ZVol.
>
> --
> Patrick Donnelly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-04-04 Thread Gregory Farnum
[ Sorry for the empty email there. :o ]

On Tue, Apr 4, 2017 at 12:28 PM, Patrick Donnelly  wrote:
> On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen  wrote:
>> On 1-4-2017 21:59, Wido den Hollander wrote:
>>>
 Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen :


 On 31-3-2017 17:32, Wido den Hollander wrote:
> Hi Willem Jan,
>
>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
>> :
>>
>>
>> Hi,
>>
>> I'm pleased to announce that my efforts to port to FreeBSD have
>> resulted in a ceph-devel port commit in the ports tree.
>>
>> https://www.freshports.org/net/ceph-devel/
>>
>
> Awesome work! I don't touch FreeBSD that much, but I can imagine that
> people want this.
>
> Out of curiosity, does this run on ZFS under FreeBSD? Or what
> Filesystem would you use behind FileStore with this? Or does
> BlueStore work?

 Since I'm a huge ZFS fan, that is what I run it on.
>>>
>>> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting!
>>
>> Right, ZIL is magic, and more or equal to the journal now used with OSDs
>> for exactly the same reason. Sad thing is that a write is now 3*
>> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used
>> bandwidth to the SSDs is double of what it could be.
>>
>> Had some discussion about this, but disabling the Ceph journal is not
>> just setting an option. Although I would like to test performance of an
>> OSD with just the ZFS journal. But I expect that the OSD journal is
>> rather firmly integrated.
>
> Disabling the OSD journal will never be viable. The journal is also
> necessary for transactions and batch updates which cannot be done
> atomically in FileStore.

To expand on Patrick's statement: You shouldn't get confused by the
presence of options to disable journaling. They exist but only work on
btrfs-backed FileStores and are *not* performant. You could do the
same on zfs, but in order to provide the guarantees of the RADOS
protocol, when in that mode the OSD just holds replies on all
operations until it knows they've been persisted to disk and
snapshotted, then sends back a commit. You can probably imagine the
horrible IO patterns and bursty application throughput that result.
o_0
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD port net/ceph-devel released

2017-04-04 Thread Willem Jan Withagen
On 4-4-2017 21:05, Gregory Farnum wrote:
> [ Sorry for the empty email there. :o ]
> 
> On Tue, Apr 4, 2017 at 12:28 PM, Patrick Donnelly  wrote:
>> On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen  wrote:
>>> On 1-4-2017 21:59, Wido den Hollander wrote:

> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen :
>
>
> On 31-3-2017 17:32, Wido den Hollander wrote:
>> Hi Willem Jan,
>>
>>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
>>> :
>>>
>>>
>>> Hi,
>>>
>>> I'm pleased to announce that my efforts to port to FreeBSD have
>>> resulted in a ceph-devel port commit in the ports tree.
>>>
>>> https://www.freshports.org/net/ceph-devel/
>>>
>>
>> Awesome work! I don't touch FreeBSD that much, but I can imagine that
>> people want this.
>>
>> Out of curiosity, does this run on ZFS under FreeBSD? Or what
>> Filesystem would you use behind FileStore with this? Or does
>> BlueStore work?
>
> Since I'm a huge ZFS fan, that is what I run it on.

 Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting!
>>>
>>> Right, ZIL is magic, and more or equal to the journal now used with OSDs
>>> for exactly the same reason. Sad thing is that a write is now 3*
>>> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used
>>> bandwidth to the SSDs is double of what it could be.
>>>
>>> Had some discussion about this, but disabling the Ceph journal is not
>>> just setting an option. Although I would like to test performance of an
>>> OSD with just the ZFS journal. But I expect that the OSD journal is
>>> rather firmly integrated.
>>
>> Disabling the OSD journal will never be viable. The journal is also
>> necessary for transactions and batch updates which cannot be done
>> atomically in FileStore.
> 
> To expand on Patrick's statement: You shouldn't get confused by the
> presence of options to disable journaling. They exist but only work on
> btrfs-backed FileStores and are *not* performant. You could do the
> same on zfs, but in order to provide the guarantees of the RADOS
> protocol, when in that mode the OSD just holds replies on all
> operations until it knows they've been persisted to disk and
> snapshotted, then sends back a commit. You can probably imagine the
> horrible IO patterns and bursty application throughput that result.

When I talked about this with Sage in CERN, I got the same answer. So
this is at least consistent. ;-)

And I have to admit that I do not understand the intricate details of
this part of Ceph. So at the moment I'm looking at it from a more global
view

What, i guess, needs to be done, is to get ride of at least one of the
SSD writes.
Which is possible by mounting the journal disk as a separate VDEV (2
SSDs in mirror) and get the max speed out of this.
Problem with this all is that the number of SSDs sort of blows up, and
very likely there is a lot of waste because the journals need not be
very large.

And yes the other way would be to do BlueStore on ZVOL, where the
underlying VDEVs are carefully crafted. But first we need to get AIO
working. And I have not (yet) looked at that at all...

First objective was to get a port of any sorts, which I did last week.
Second is to take Luminous and make a "stable" port which is less of a
moving target.
Only then AIO is on the radar

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client's read affinity

2017-04-04 Thread Jason Dillaman
AFAIK, the OSDs should discover their location in the CRUSH map
automatically -- therefore, this "crush location" config override
would be used for librbd client configuration ("i.e. [client]
section") to describe their location in the CRUSH map relative to
racks, hosts, etc.

On Tue, Apr 4, 2017 at 3:12 PM, Brian Andrus  wrote:
> Jason, I haven't heard much about this feature.
>
> Will the localization have effect if the crush location configuration is set
> in the [osd] section, or does it need to apply globally for clients as well?
>
> On Fri, Mar 31, 2017 at 6:38 AM, Jason Dillaman  wrote:
>>
>> Assuming you are asking about RBD-back VMs, it is not possible to
>> localize the all reads to the VM image. You can, however, enable
>> localization of the parent image since that is a read-only data set.
>> To enable that feature, set "rbd localize parent reads = true" and
>> populate the "crush location = host=X rack=Y etc=Z" in your ceph.conf.
>>
>> On Fri, Mar 31, 2017 at 9:00 AM, Alejandro Comisario
>>  wrote:
>> > any experiences ?
>> >
>> > On Wed, Mar 29, 2017 at 2:02 PM, Alejandro Comisario
>> >  wrote:
>> >> Guys hi.
>> >> I have a Jewel Cluster divided into two racks which is configured on
>> >> the crush map.
>> >> I have clients (openstack compute nodes) that are closer from one rack
>> >> than to another.
>> >>
>> >> I would love to (if is possible) to specify in some way the clients to
>> >> read first from the nodes on a specific rack then try the other one if
>> >> is not possible.
>> >>
>> >> Is that doable ? can somebody explain me how to do it ?
>> >> best.
>> >>
>> >> --
>> >> Alejandrito
>> >
>> >
>> >
>> > --
>> > Alejandro Comisario
>> > CTO | NUBELIU
>> > E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>> > _
>> > www.nubeliu.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Brian Andrus | Cloud Systems Engineer | DreamHost
> brian.and...@dreamhost.com | www.dreamhost.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Apply for an official mirror at CN

2017-04-04 Thread SJ Zhu
Wido, ping?

On Sat, Apr 1, 2017 at 8:40 PM, SJ Zhu  wrote:
> On Sat, Apr 1, 2017 at 8:10 PM, Wido den Hollander  wrote:
>> Great! Very good to hear. We can CNAME cn.ceph.com to that location?
>
>
> Yes, please CNAME to mirrors.ustc.edu.cn, and I will set vhost in our
> nginx for the
> ceph directory.
>
> Thanks
>
> --
> Regards,
> Shengjing Zhu



-- 
Regards,
Shengjing Zhu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com