Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread jorpilo

I get confused there because on the 
documentation:http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
"If there is more, provisioning a DB device makes more sense. The BlueStore 
journal will always be placed on the fastest device available, so using a DB 
device will provide the same benefit that the WAL device would while also 
allowing additional metadata to be stored there"
So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, only 
with DB, the biggest you can, would be enough, unless you have 2 different 
kinds of SSD (for example a tiny Nvme and a SSD)
Am I right? Or would I get any benefit from setting implicit WAL partition on 
the same SSD?

 Mensaje original De: Nick Fisk  Fecha: 
8/11/17  10:16 p. m.  (GMT+01:00) Para: 'Mark Nelson' , 
'Wolfgang Lendl'  Cc: 
ceph-users@lists.ceph.com Asunto: Re: [ceph-users] bluestore - wal,db on faster 
devices? 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 08 November 2017 19:46
> To: Wolfgang Lendl 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
> Hi Wolfgang,
> 
> You've got the right idea.  RBD is probably going to benefit less since
you
> have a small number of large objects and little extra OMAP data.
> Having the allocation and object metadata on flash certainly shouldn't
hurt,
> and you should still have less overhead for small (<64k) writes.
> With RGW however you also have to worry about bucket index updates
> during writes and that's a big potential bottleneck that you don't need to
> worry about with RBD.

If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.


> 
> Mark
> 
> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
> > Hi Mark,
> >
> > thanks for your reply!
> > I'm a big fan of keeping things simple - this means that there has to
> > be a very good reason to put the WAL and DB on a separate device
> > otherwise I'll keep it collocated (and simpler).
> >
> > as far as I understood - putting the WAL,DB on a faster (than hdd)
> > device makes more sense in cephfs and rgw environments (more
> metadata)
> > - and less sense in rbd environments - correct?
> >
> > br
> > wolfgang
> >
> > On 11/08/2017 02:21 PM, Mark Nelson wrote:
> >> Hi Wolfgang,
> >>
> >> In bluestore the WAL serves sort of a similar purpose to filestore's
> >> journal, but bluestore isn't dependent on it for guaranteeing
> >> durability of large writes.  With bluestore you can often get higher
> >> large-write throughput than with filestore when using HDD-only or
> >> flash-only OSDs.
> >>
> >> Bluestore also stores allocation, object, and cluster metadata in the
> >> DB.  That, in combination with the way bluestore stores objects,
> >> dramatically improves behavior during certain workloads.  A big one
> >> is creating millions of small objects as quickly as possible.  In
> >> filestore, PG splitting has a huge impact on performance and tail
> >> latency.  Bluestore is much better just on HDD, and putting the DB
> >> and WAL on flash makes it better still since metadata no longer is a
> >> bottleneck.
> >>
> >> Bluestore does have a couple of shortcomings vs filestore currently.
> >> The allocator is not as good as XFS's and can fragment more over time.
> >> There is no server-side readahead so small sequential read
> >> performance is very dependent on client-side readahead.  There's
> >> still a number of optimizations to various things ranging from
> >> threading and locking in the shardedopwq to pglog and dup_ops that
> >> potentially could improve performance.
> >>
> >> I have a blog post that we've been working on that explores some of
> >> these things but I'm still waiting on review before I publish it.
> >>
> >> Mark
> >>
> >> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
> >>> Hello,
> >>>
> >>> it's clear to me getting a performance gain from putting the journal
> >>> on a fast device (ssd,nvme) when using filestore backend.
> >>> it's not when it comes to bluestore - are there any resources,
> >>> performance test, etc. out there how a fast wal,db device impacts
> >>> performance?
> >>>
> >>>
> >>> br
> >>> wolfgang
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> 

Re: [ceph-users] High osd cpu usage

2017-11-09 Thread Alon Avrahami
Hi,

Yes, im using bluestore.
there is no I/O on the ceph cluster. it's totally idle.
All the CPU usage are by OSD who don't have any workload on it.

Thanks!

On Thu, Nov 9, 2017 at 9:37 AM, Vy Nguyen Tan 
wrote:

> Hello,
>
> I think it not normal behavior in Luminous. I'm testing 3 nodes, each node
> have 3 x 1TB HDD, 1 SSD for wal + db, E5-2620 v3, 32GB of RAM, 10Gbps NIC.
>
> I use fio for  I/O performance measurements. When I ran "fio
> --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test
> --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw
> --rwmixread=75" I get % CPU each ceph-osd as shown bellow:
>
>2452 ceph  20   0 2667088 1.813g  15724 S  22.8  5.8  34:41.02
> /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
>2178 ceph  20   0 2872152 2.005g  15916 S  22.2  6.4  43:22.80
> /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
>1820 ceph  20   0 2713428 1.865g  15064 S  13.2  5.9  34:19.56
> /usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
>
> Are you using bluestore? How many IOPS / disk throughput did you get with
> your cluster ?
>
>
> Regards,
>
> On Wed, Nov 8, 2017 at 8:13 PM, Alon Avrahami 
> wrote:
>
>> Hello Guys
>>
>> We  have a fresh 'luminous'  (  12.2.0 ) 
>> (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
>> luminous (rc)   ( installed using ceph-ansible )
>>
>> the cluster contains 6 *  Intel  server board  S2600WTTR  (  96 osds and
>> 3 mons )
>>
>> We have 6 nodes  ( Intel server board  S2600WTTR ) , Mem - 64G , CPU
>> -> Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz , 32 cores .
>> Each server   has 16 * 1.6TB  Dell SSD drives ( SSDSC2BB016T7R )  , total
>> of 96 osds , 3 mons
>>
>> The main usage  is rbd's for our  OpenStack environment ( Okata )
>>
>> We're at the beginning of our production tests and it looks like the
>> osd's are too busy although  we don't generate  too much iops at this stage
>> ( almost nothing )
>> All ceph-osds using 50% of CPU usage and I can't figure out why are they
>> so busy :
>>
>> top - 07:41:55 up 49 days,  2:54,  2 users,  load average: 6.85, 6.40,
>> 6.37
>>
>> Tasks: 518 total,   1 running, 517 sleeping,   0 stopped,   0 zombie
>> %Cpu(s): 14.8 us,  4.3 sy,  0.0 ni, 80.3 id,  0.0 wa,  0.0 hi,  0.6 si,
>> 0.0 st
>> KiB Mem : 65853584 total, 23953788 free, 40342680 used,  1557116
>> buff/cache
>> KiB Swap:  3997692 total,  3997692 free,0 used. 18020584 avail Mem
>>
>> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> COMMAND
>>   36713 ceph  20   0 3869588 2.826g  28896 S  47.2  4.5   6079:20
>> ceph-osd
>>   53981 ceph  20   0 3998732 2.666g  28628 S  45.8  4.2   5939:28
>> ceph-osd
>>   55879 ceph  20   0 3707004 2.286g  28844 S  44.2  3.6   5854:29
>> ceph-osd
>>   46026 ceph  20   0 3631136 1.930g  29100 S  43.2  3.1   6008:50
>> ceph-osd
>>   39021 ceph  20   0 4091452 2.698g  28936 S  42.9  4.3   5687:39
>> ceph-osd
>>   47210 ceph  20   0 3598572 1.871g  29092 S  42.9  3.0   5759:19
>> ceph-osd
>>   52763 ceph  20   0 3843216 2.410g  28896 S  42.2  3.8   5540:11
>> ceph-osd
>>   49317 ceph  20   0 3794760 2.142g  28932 S  41.5  3.4   5872:24
>> ceph-osd
>>   42653 ceph  20   0 3915476 2.489g  28840 S  41.2  4.0   5605:13
>> ceph-osd
>>   41560 ceph  20   0 3460900 1.801g  28660 S  38.5  2.9   5128:01
>> ceph-osd
>>   50675 ceph  20   0 3590288 1.827g  28840 S  37.9  2.9   5196:58
>> ceph-osd
>>   37897 ceph  20   0 4034180 2.814g  29000 S  34.9  4.5   4789:10
>> ceph-osd
>>   50237 ceph  20   0 3379780 1.930g  28892 S  34.6  3.1   4846:36
>> ceph-osd
>>   48608 ceph  20   0 3893684 2.721g  28880 S  33.9  4.3   4752:43
>> ceph-osd
>>   40323 ceph  20   0 4227864 2.959g  28800 S  33.6  4.7   4712:36
>> ceph-osd
>>   44638 ceph  20   0 3656780 2.437g  28896 S  33.2  3.9   4793:58
>> ceph-osd
>>   61639 ceph  20   0  527512 114300  20988 S   2.7  0.2   2722:03
>> ceph-mgr
>>   31586 ceph  20   0  765672 304140  21816 S   0.7  0.5 409:06.09
>> ceph-mon
>>  68 root  20   0   0  0  0 S   0.3  0.0   3:09.69
>> ksoftirqd/12
>>
>> strace  doesn't show anything suspicious
>>
>> root@ecprdbcph10-opens:~# strace -p 36713
>> strace: Process 36713 attached
>> futex(0x563343c56764, FUTEX_WAIT_PRIVATE, 1, NUL
>>
>> Ceph logs don't reveal anything?
>> Is this "normal" behavior in Luminous?
>> Looking out in older threads I can only find a thread about time gaps
>> which is not our case
>>
>> Thanks,
>> Alon
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure pool

2017-11-09 Thread Caspar Smit
2017-11-08 22:05 GMT+01:00 Marc Roos :

>
> Can anyone advice on a erasure pool config to store
>
> - files between 500MB and 8GB, total 8TB
> - just for archiving, not much reading (few files a week)
> - hdd pool
> - now 3 node cluster (4th coming)
> - would like to save on storage space
>
> I was thinking of a profile with jerasure  k=3 m=2, but maybe this lrc
> is better? Or wait for 4th node and choose k=4 m=2?
>
>
Just to keep in mind:

In a three node setup with k=3 and m=2 you will have to set the failure
domain to 'osd' (the default failure domain of 'host' would require 5 nodes)
Furthermore when using 'osd' as failure domain you would probably have
(some) inaccessable data when a node reboots and/or fails since there is a
chance 3 (or more) out of 5 chunks are on the same node.
Same goes for 4 nodes and k=4 m=2 (failure domain host would require 6
nodes)

Caspar


> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph inconsistent pg missing ec object

2017-11-09 Thread Kenneth Waegeman

Hi Greg,

Thanks! This seems to have worked for at least 1 of 2 inconsistent pgs: 
The inconsistency disappeared after a new scrub. Still waiting for the 
result of the second pg. I tried to force deep-scrub with `ceph pg 
deep-scrub ` yesterday, but today the last deep scrub is still from 
a week ago. Is there a way to actually deep-scrub immediately?


Thanks again!

Kenneth


On 02/11/17 19:27, Gregory Farnum wrote:
Okay, after consulting with a colleague this appears to be an instance 
of http://tracker.ceph.com/issues/21382. Assuming the object is one 
that doesn't have snapshots, your easiest resolution is to use rados 
get to retrieve the object (which, unlike recovery, should work) and 
then "rados put" it back in to place.


This fix might be backported to Jewel for a later release, but it's 
tricky so wasn't done proactively.

-Greg

On Fri, Oct 20, 2017 at 12:27 AM Stijn De Weirdt 
mailto:stijn.dewei...@ugent.be>> wrote:


hi gregory,

we more or less followed the instructions on the site (famous last
words, i know ;)

grepping for the error in the osd logs of the osds of the pg, the
primary logs had "5.5e3s0 shard 59(5) missing
5:c7ae919b:::10014d3184b.:head"

we looked for the object using the find command, we got

> [root@osd003 ~]# find
/var/lib/ceph/osd/ceph-35/current/5.5e3s0_head/ -name
"*10014d3184b.*"
>
>

/var/lib/ceph/osd/ceph-35/current/5.5e3s0_head/DIR_3/DIR_E/DIR_5/DIR_7/DIR_9/10014d3184b.__head_D98975E3__5__0

then we ran this find on all 11 osds from the pg, and 10 out of 11
osds
gave similar path (the suffix _[0-9a] matched the index of the osd in
the list of osds reported by the pg, so i assumed that was the ec
splitting up the data in 11 pieces)

on one osd in the list of osds, there was no such object (the 6th one,
index 5, so more assuming form our side that this was the 5 in 5:...
from the logfile). so we assumed this was the missing object that the
error reported. we have absolutely no clue why it was missing or what
happened, nothing in any logs.

what we did then was stop the osd that had the missing object,
flush the
journal and start the osd and ran repair. (the guide mentioned to
delete
an object, we did not delete anything, because we assumed the
issue was
the already missing object from the 6th osd)

flushing the journal segfaulted, but the osd started fine again.

the scrub errors did not disappear, so we did the same again on the
primary (no deleting of anything; and again, the flush segfaulted).

wrt the segfault, i attached the output of a segfaulting flush with
debug on another osd.


stijn


On 10/20/2017 02:56 AM, Gregory Farnum wrote:
> Okay, you're going to need to explain in very clear terms
exactly what
> happened to your cluster, and *exactly* what operations you
performed
> manually.
>
> The PG shards seem to have different views of the PG in
question. The
> primary has a different log_tail, last_user_version, and
last_epoch_clean
> from the others. Plus different log sizes? It's not making a ton
of sense
> at first glance.
> -Greg
>
> On Thu, Oct 19, 2017 at 1:08 AM Stijn De Weirdt
mailto:stijn.dewei...@ugent.be>>
> wrote:
>
>> hi greg,
>>
>> i attached the gzip output of the query and some more info
below. if you
>> need more, let me know.
>>
>> stijn
>>
>>> [root@mds01 ~]# ceph -s
>>>     cluster 92beef0a-1239-4000-bacf-4453ab630e47
>>>      health HEALTH_ERR
>>>             1 pgs inconsistent
>>>             40 requests are blocked > 512 sec
>>>             1 scrub errors
>>>             mds0: Behind on trimming (2793/30)
>>>      monmap e1: 3 mons at {mds01=
>> 1.2.3.4:6789/0,mds02=1.2.3.5:6789/0,mds03=1.2.3.6:6789/0
}
>>>             election epoch 326, quorum 0,1,2 mds01,mds02,mds03
>>>       fsmap e238677: 1/1/1 up {0=mds02=up:active}, 2 up:standby
>>>      osdmap e79554: 156 osds: 156 up, 156 in
>>>             flags sortbitwise,require_jewel_osds
>>>       pgmap v51003893: 4096 pgs, 3 pools, 387 TB data, 243
Mobjects
>>>             545 TB used, 329 TB / 874 TB avail
>>>                 4091 active+clean
>>>                    4 active+clean+scrubbing+deep
>>>                    1 active+clean+inconsistent
>>>   client io 284 kB/s rd, 146 MB/s wr, 145 op/s rd, 177 op/s wr
>>>   cache io 115 MB/s flush, 153 MB/s evict, 14 op/s promote, 3
PG(s)
>> flushing
>>
>>> [root@mds01 ~]# ceph health detail
>>> HEALTH_ERR 1 pgs inconsistent; 52 requests are blocked > 512
sec; 5 osds
>> have slow requests; 1 scrub errors; mds0: Behind on trimming
(2782/30)
>>> pg 5.5e3 is active+

Re: [ceph-users] Luminous ceph pool %USED calculation

2017-11-09 Thread Alwin Antreich
On Fri, Nov 03, 2017 at 12:09:03PM +0100, Alwin Antreich wrote:
> Hi,
>
> I am confused by the %USED calculation in the output 'ceph df' in luminous. 
> In the example below the pools use 2.92% "%USED" but with my calculation, 
> taken from the source code it gives me a 8.28%. On a hammer cluster my 
> calculation gives the same result as in the 'ceph df' output.
>
>  Am I taking the right values? Or do I miss something on the calculation?
>
> This tracker introduced the calculation: http://tracker.ceph.com/issues/16933
> # https://github.com/ceph/ceph/blob/master/src/mon/PGMap.cc
> curr_object_copies_rate = (float)(sum.num_object_copies - 
> sum.num_objects_degraded) / sum.num_object_copies;
> used = sum.num_bytes * curr_object_copies_rate;
> used /= used + avail;
>
> curr_object_copies_rate  = (num_object_copies: 2118 - num_objects_degraded: 
> 0) / num_object_copies: 2118;
> used = num_bytes: 4437573656 * curr_object_copies_rate
> used /= used + max_avail: 73689653248
>
> # my own calculation
> Name   size   min_size pg_num %-used  
>used
> default   3  2 64   8.28   
> 4437573656
> test1 3  2 64   8.28   
> 4437573656
> test2 2  1 64   5.68   
> 4437573656
>
> # ceph df detail
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED OBJECTS
> 191G  151G   40551M 20.693177
> POOLS:
> NAMEID QUOTA OBJECTS QUOTA BYTES USED  %USED 
> MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED
> default 1  N/A   N/A 4232M  2.92  
>   46850M1059  10590  1059   12696M
> test1   4  N/A   N/A 4232M  2.92  
>   46850M1059  10590  1059   12696M
> test2   5  N/A   N/A 4232M  2.92  
>   70275M1059  10590  10598464M
>
> # ceph pg dump pools
> dumped pools
> POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES  LOG  
> DISK_LOG
> 5 1059  00 0   0 4437573656 1059  
>1059
> 4 1059  00 0   0 4437573656 1059  
>1059
> 1 1059  00 0   0 4437573656 1059  
>1059
>
> # ceph versions
> {
> "mon": {
> "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) 
> luminous (stable)": 3
> },
> "mgr": {
> "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) 
> luminous (stable)": 3
> },
> "osd": {
> "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) 
> luminous (stable)": 6
> },
> "mds": {},
> "overall": {
> "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) 
> luminous (stable)": 12
> }
> }
>
> Some more data in the attachment.
>
> Thanks in adavance.
> --
> Cheers,
> Alwin

> # ceph df detail
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED OBJECTS
> 191G  151G   40551M 20.693177
> POOLS:
> NAMEID QUOTA OBJECTS QUOTA BYTES USED  %USED 
> MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED
> default 1  N/A   N/A 4232M  2.92  
>   46850M1059  10590  1059   12696M
> test1   4  N/A   N/A 4232M  2.92  
>   46850M1059  10590  1059   12696M
> test2   5  N/A   N/A 4232M  2.92  
>   70275M1059  10590  10598464M
>
> # ceph pg dump pools
> dumped pools
> POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES  LOG  
> DISK_LOG
> 5 1059  00 0   0 4437573656 1059  
>1059
> 4 1059  00 0   0 4437573656 1059  
>1059
> 1 1059  00 0   0 4437573656 1059  
>1059
>
> # ceph osd dump
> epoch 97
> fsid 1c6a05cf-f93c-49a3-939d-877bb61107c3
> created 2017-10-27 13:15:55.049914
> modified 2017-11-03 10:14:58.231071
> flags sortbitwise,recovery_deletes,purged_snapdirs
> crush_version 13
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.85
> require_min_compat_client jewel
> min_compat_client jewel
> require_osd_release luminous
> pool 1 'default' replicated size 3 min_size 2 crush_rule 0 object_hash 
> rjenkins pg_num 64 pgp_num 64 last_change 6 flags hashpspool stripe_width 0 
> application rbd
> pool 4 'test1' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
> pg_num 64 pgp_num 64 last_change 49 flags hashpspool stripe_width 0 
> 

Re: [ceph-users] Fwd: Luminous RadosGW issue

2017-11-09 Thread Hans van den Bogert
> On Nov 9, 2017, at 5:25 AM, Sam Huracan  wrote:
> 
> root@radosgw system]# ceph --admin-daemon 
> /var/run/ceph/ceph-client.rgw.radosgw.asok config show | grep log_file
> "log_file": "/var/log/ceph/ceph-client.rgw.radosgw.log”,

The .asok filename resembles what should be used in your config. If Im right 
you should use ‘client.rgw.radosgw’ in your ceph.conf.



> On Nov 9, 2017, at 5:25 AM, Sam Huracan  wrote:
> 
> @Hans: Yes, I tried to redeploy RGW, and ensure client.radosgw.gateway is the 
> same in ceph.conf.
> Everything go well, service radosgw running, port 7480 is opened, but all my 
> config of radosgw in ceph.conf can't be set, rgw_dns_name is still empty, and 
> log file keeps default value.
> 
> [root@radosgw system]# ceph --admin-daemon 
> /var/run/ceph/ceph-client.rgw.radosgw.asok config show | grep log_file
> "log_file": "/var/log/ceph/ceph-client.rgw.radosgw.log",
> 
> 
> [root@radosgw system]# cat /etc/ceph/ceph.client.radosgw.keyring 
> [client.radosgw.gateway]
> key = AQCsywNaqQdDHxAAC24O8CJ0A9Gn6qeiPalEYg==
> caps mon = "allow rwx"
> caps osd = "allow rwx"
> 
> 
> 2017-11-09 6:11 GMT+07:00 Hans van den Bogert  >:
> Are you sure you deployed it with the client.radosgw.gateway name as
> well? Try to redeploy the RGW and make sure the name you give it
> corresponds to the name you give in the ceph.conf. Also, do not forget
> to push the ceph.conf to the RGW machine.
> 
> On Wed, Nov 8, 2017 at 11:44 PM, Sam Huracan  > wrote:
> >
> >
> > Hi Cephers,
> >
> > I'm testing RadosGW in Luminous version.  I've already installed done in 
> > separate host, service is running but RadosGW did not accept any my 
> > configuration in ceph.conf.
> >
> > My Config:
> > [client.radosgw.gateway]
> > host = radosgw
> > keyring = /etc/ceph/ceph.client.radosgw.keyring
> > rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
> > log file = /var/log/radosgw/client.radosgw.gateway.log
> > rgw dns name = radosgw.demo.com 
> > rgw print continue = false
> >
> >
> > When I show config of radosgw socket:
> > [root@radosgw ~]# ceph --admin-daemon 
> > /var/run/ceph/ceph-client.rgw.radosgw.asok config show | grep dns
> > "mon_dns_srv_name": "",
> > "rgw_dns_name": "",
> > "rgw_dns_s3website_name": "",
> >
> > rgw_dns_name is empty, hence S3 API is unable to access Ceph Object Storage.
> >
> >
> > Do anyone meet this issue?
> >
> > My ceph version I'm  using is ceph-radosgw-12.2.1-0.el7.x86_64
> >
> > Thanks in advance
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> >
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] who is using nfs-ganesha and cephfs?

2017-11-09 Thread Lars Marowsky-Bree
On 2017-11-08T21:41:41, Sage Weil  wrote:

> Who is running nfs-ganesha's FSAL to export CephFS?  What has your 
> experience been?
> 
> (We are working on building proper testing and support for this into 
> Mimic, but the ganesha FSAL has been around for years.)

We use it currently, and it works, but let's not discuss the performance
;-)

How else do you want to build this into Mimic?

Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph auth profile definitions

2017-11-09 Thread Marc Roos
 
How/where can I see how eg. 'profile rbd' is defined?

As in 
[client.rbd.client1]
key = xxx==
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd"





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Richard Hesketh
You're correct, if you were going to put the WAL and DB on the same device you 
should just make one partition and allocate the DB to it, the WAL will 
automatically be stored with the DB. It only makes sense to specify them 
separately if they are going to go on different devices, and that itself only 
makes sense if the WAL device will be much faster than the DB device, otherwise 
you're just making your setup more complex for no gain.

On 09/11/17 08:05, jorpilo wrote:
> 
> I get confused there because on the documentation:
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
> 
> "If there is more, provisioning a DB device makes more sense. The BlueStore 
> journal will always be placed on the fastest device available, so using a DB 
> device will provide the same benefit that the WAL device would while also 
> allowing additional metadata to be stored there"
> 
> So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, 
> only with DB, the biggest you can, would be enough, unless you have 2 
> different kinds of SSD (for example a tiny Nvme and a SSD)
> 
> Am I right? Or would I get any benefit from setting implicit WAL partition on 
> the same SSD?
> 
> 
>  Mensaje original 
> De: Nick Fisk 
> Fecha: 8/11/17 10:16 p. m. (GMT+01:00)
> Para: 'Mark Nelson' , 'Wolfgang Lendl' 
> 
> Cc: ceph-users@lists.ceph.com
> Asunto: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Mark Nelson
>> Sent: 08 November 2017 19:46
>> To: Wolfgang Lendl 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
>>
>> Hi Wolfgang,
>>
>> You've got the right idea.  RBD is probably going to benefit less since
> you
>> have a small number of large objects and little extra OMAP data.
>> Having the allocation and object metadata on flash certainly shouldn't
> hurt,
>> and you should still have less overhead for small (<64k) writes.
>> With RGW however you also have to worry about bucket index updates
>> during writes and that's a big potential bottleneck that you don't need to
>> worry about with RBD.
> 
> If you are running anything which is sensitive to sync write latency, like
> databases. You will see a big performance improvement in using WAL on SSD.
> As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
> 1-2us difference. It will also batch lots of these small writes
> together and write them to disk in bigger chunks much more effectively. If
> you want to run active workloads on RBD and want them to match enterprise
> storage array with BBWC type performance, I would say DB and WAL on SSD is a
> requirement.
> 
> 
>>
>> Mark
>>
>> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
>> > Hi Mark,
>> >
>> > thanks for your reply!
>> > I'm a big fan of keeping things simple - this means that there has to
>> > be a very good reason to put the WAL and DB on a separate device
>> > otherwise I'll keep it collocated (and simpler).
>> >
>> > as far as I understood - putting the WAL,DB on a faster (than hdd)
>> > device makes more sense in cephfs and rgw environments (more
>> metadata)
>> > - and less sense in rbd environments - correct?
>> >
>> > br
>> > wolfgang
>> >
>> > On 11/08/2017 02:21 PM, Mark Nelson wrote:
>> >> Hi Wolfgang,
>> >>
>> >> In bluestore the WAL serves sort of a similar purpose to filestore's
>> >> journal, but bluestore isn't dependent on it for guaranteeing
>> >> durability of large writes.  With bluestore you can often get higher
>> >> large-write throughput than with filestore when using HDD-only or
>> >> flash-only OSDs.
>> >>
>> >> Bluestore also stores allocation, object, and cluster metadata in the
>> >> DB.  That, in combination with the way bluestore stores objects,
>> >> dramatically improves behavior during certain workloads.  A big one
>> >> is creating millions of small objects as quickly as possible.  In
>> >> filestore, PG splitting has a huge impact on performance and tail
>> >> latency.  Bluestore is much better just on HDD, and putting the DB
>> >> and WAL on flash makes it better still since metadata no longer is a
>> >> bottleneck.
>> >>
>> >> Bluestore does have a couple of shortcomings vs filestore currently.
>> >> The allocator is not as good as XFS's and can fragment more over time.
>> >> There is no server-side readahead so small sequential read
>> >> performance is very dependent on client-side readahead.  There's
>> >> still a number of optimizations to various things ranging from
>> >> threading and locking in the shardedopwq to pglog and dup_ops that
>> >> potentially could improve performance.
>> >>
>> >> I have a blog post that we've been working on that explores some of
>> >> these things but I'm still waiting on review before I publish it.
>> >>
>> >> Mark
>> >>
>> >> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
>> >>> Hello,
>> >>>
>> >>

Re: [ceph-users] Ceph auth profile definitions

2017-11-09 Thread John Spray
On Thu, Nov 9, 2017 at 10:12 AM, Marc Roos  wrote:
>
> How/where can I see how eg. 'profile rbd' is defined?
>
> As in
> [client.rbd.client1]
> key = xxx==
> caps mon = "profile rbd"
> caps osd = "profile rbd pool=rbd"

The profiles are defined internally and are subject to change, but you
can peek at them in the code:
https://github.com/ceph/ceph/blob/master/src/mon/MonCap.cc#L285
https://github.com/ceph/ceph/blob/master/src/osd/OSDCap.cc#L250

John

>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ganesha NFS

2017-11-09 Thread jorpilo
Hi!I would like to export my cephfs using Ganesha NFS to export a NFSv3 o NFSv4
I am a little lost while doing it. I managed to make it working with NFSv4 but 
I can't make it work with NFSv3 as the server refuses the connection.
Has anyone managed to do it? ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] who is using nfs-ganesha and cephfs?

2017-11-09 Thread Supriti Singh
Hi Sage,

As Lars mentioned, at SUSE, we use ganesha 2.5.2/luminous. We did a preliminary 
performance comparison of cephfs client
and nfs-ganesha client. I have attached the results. The results are aggregate 
bandwidth over 10 clients.

1. Test Setup:
We use fio to read/write to a single 5GB file per thread for 300 seconds. A 
single job (represented in x-axis) is of
type {number_of_worker_thread}rw_{block_size}_{op}, where, 
number_of_worker_threads: 1, 4, 8, 16
Block size: 4K,64K,1M,4M,8M
op: rw 

 
2. NFS-Ganesha configuration:
Parameters set (other than default):
1. Graceless = True
2. MaxRPCSendBufferSize/MaxRPCRecvBufferSize is set to max value.

3. Observations:
-  For single thread (on each client) and 4k block size, the b/w is around 45% 
of cephfs 
- As number of threads increases, the performance drops. It could be related to 
nfs-ganesha parameter
"Dispatch_Max_Reqs_Xprt", which defaults to 512. Note, this parameter is 
important only for v2.5. 
- We did run with both nfs-ganesha mdcache enabled/disabled. But there were no 
significant improvements with caching.
Not sure but it could be related to this issue: 
https://github.com/nfs-ganesha/nfs-ganesha/issues/223
  
The results are still preliminary, and I guess with proper tuning of 
nfs-ganesha parameters, it could be better.

Thanks,
Supriti 

--
Supriti Singh SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham 
Norton,
HRB 21284 (AG Nürnberg)
 



>>> Lars Marowsky-Bree  11/09/17 11:07 AM >>>
On 2017-11-08T21:41:41, Sage Weil  wrote:

> Who is running nfs-ganesha's FSAL to export CephFS?  What has your 
> experience been?
> 
> (We are working on building proper testing and support for this into 
> Mimic, but the ganesha FSAL has been around for years.)

We use it currently, and it works, but let's not discuss the performance
;-)

How else do you want to build this into Mimic?

Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





NFS_Ganesha_vs_CephFS.ods
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovery operations and ioprio options

2017-11-09 Thread Захаров Алексей
Hi, Nick
Thank you for the answer!

It's still unclear for me, do those options have no effect at all?
Or disk thread is used for some other operations?

09.11.2017, 04:18, "Nick Fisk" :
>>  -Original Message-
>>  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>  ??? ???
>>  Sent: 08 November 2017 16:21
>>  To: ceph-users@lists.ceph.com
>>  Subject: [ceph-users] Recovery operations and ioprio options
>>
>>  Hello,
>>  Today we use ceph jewel with:
>>    osd disk thread ioprio class=idle
>>    osd disk thread ioprio priority=7
>>  and "nodeep-scrub" flag is set.
>>
>>  We want to change scheduler from CFQ to deadline, so these options will
>>  lose effect.
>>  I've tried to find out what operations are performed in "disk thread".
>
> What I
>>  found is that only scrubbing and snap-trimming operations are performed in
>>  "disk thread".
>
> In jewel those operations are now in the main OSD thread and setting the
> ioprio's will have no effect. Use the scrub and snap trim sleep options to
> throttle them.
>
>>  Do these options affect recovery operations?
>>  Are there any other operations in "disk thread", except scrubbing and
>
> snap-
>>  trimming?
>>
>>  --
>>  Regards,
>>  Aleksei Zakharov
>>  ___
>>  ceph-users mailing list
>>  ceph-users@lists.ceph.com
>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Regards,
Aleksei Zakharov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Librbd, qemu, libvirt xml

2017-11-09 Thread Marc Roos
 
What would be the correct way to convert the xml file rbdmapped images 
to librbd?

I had this:
 

  
  
  
  
  


And for librbd this:


  
  

  
  



  
  
  
  


But this will give me a qemu format drive option:
-drive 
file=rbd:rbd/vps-test2:id=rbd.vps:key=XWHYISTHISEVENHERE==:auth_
supported=cephx\;none:mon_host=192.168.10.111\:6789\;192.168.10.112\:678
9\;192.168.10.113\:6789,format=raw,if=none,id=drive-scsi0-0-0-0,cache=wr
iteback

And not format rbd:
-drive format=rbd,file=rbd:data/squeeze,cache=writeback
As specified here, http://docs.ceph.com/docs/luminous/rbd/qemu-rbd/

If I change type='raw' to type='rbd', I get 
error: unsupported configuration: unknown driver format value 'rbd'



Linux c01 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linu
ceph-mgr-12.2.1-0.el7.x86_64
ceph-12.2.1-0.el7.x86_64
libcephfs2-12.2.1-0.el7.x86_64
python-cephfs-12.2.1-0.el7.x86_64
ceph-common-12.2.1-0.el7.x86_64
ceph-selinux-12.2.1-0.el7.x86_64
ceph-mon-12.2.1-0.el7.x86_64
ceph-mds-12.2.1-0.el7.x86_64
collectd-ceph-5.7.1-2.el7.x86_64
ceph-base-12.2.1-0.el7.x86_64
ceph-osd-12.2.1-0.el7.x86_64
ceph-deploy-1.5.39-0.noarch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Denes Dolhay

-sorry, wrong address


Hi Richard,

I have seen a few lectures about bluestore, and they made it abundantly 
clear, that bluestore is superior to filestore in that manner, that it 
writes data to the disc only once (this way they could achieve a 2x-3x 
speed increase).


So this is true if there is no separate wal (and db) device? How does 
this work?



Thanks!

Denke.
On 11/09/2017 11:16 AM, Richard Hesketh wrote:

You're correct, if you were going to put the WAL and DB on the same device you 
should just make one partition and allocate the DB to it, the WAL will 
automatically be stored with the DB. It only makes sense to specify them 
separately if they are going to go on different devices, and that itself only 
makes sense if the WAL device will be much faster than the DB device, otherwise 
you're just making your setup more complex for no gain.

On 09/11/17 08:05, jorpilo wrote:

I get confused there because on the documentation:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

"If there is more, provisioning a DB device makes more sense. The BlueStore journal 
will always be placed on the fastest device available, so using a DB device will provide 
the same benefit that the WAL device would while also allowing additional metadata to be 
stored there"

So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, only 
with DB, the biggest you can, would be enough, unless you have 2 different 
kinds of SSD (for example a tiny Nvme and a SSD)

Am I right? Or would I get any benefit from setting implicit WAL partition on 
the same SSD?


 Mensaje original 
De: Nick Fisk 
Fecha: 8/11/17 10:16 p. m. (GMT+01:00)
Para: 'Mark Nelson' , 'Wolfgang Lendl' 

Cc: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] bluestore - wal,db on faster devices?


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 08 November 2017 19:46
To: Wolfgang Lendl 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore - wal,db on faster devices?

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since

you

have a small number of large objects and little extra OMAP data.
Having the allocation and object metadata on flash certainly shouldn't

hurt,

and you should still have less overhead for small (<64k) writes.
With RGW however you also have to worry about bucket index updates
during writes and that's a big potential bottleneck that you don't need to
worry about with RBD.

If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.



Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to
be a very good reason to put the WAL and DB on a separate device
otherwise I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more

metadata)

- and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one
is creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB
and WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read
performance is very dependent on client-side readahead.  There's
still a number of optimizations to various things ranging from
threading and locking in the shardedopwq to pglog and dup_ops that
potentially could improve performance.

I have a blog post that we've been working on that explores some of
these things but I'm still waiting on review before I publish it.

Mark

On 11/08/2017 05:53 AM, Wolfg

Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Mark Nelson
One small point:  It's a bit easier to observe distinct WAL and DB 
behavior when they are on separate partitions.  I often do this for 
benchmarking and testing though I don't know that it would be enough of 
a benefit to do it in production.


Mark

On 11/09/2017 04:16 AM, Richard Hesketh wrote:

You're correct, if you were going to put the WAL and DB on the same device you 
should just make one partition and allocate the DB to it, the WAL will 
automatically be stored with the DB. It only makes sense to specify them 
separately if they are going to go on different devices, and that itself only 
makes sense if the WAL device will be much faster than the DB device, otherwise 
you're just making your setup more complex for no gain.

On 09/11/17 08:05, jorpilo wrote:


I get confused there because on the documentation:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

"If there is more, provisioning a DB device makes more sense. The BlueStore journal 
will always be placed on the fastest device available, so using a DB device will provide 
the same benefit that the WAL device would while also allowing additional metadata to be 
stored there"

So I guess it doesn't make any sense to implicit put WAL and DB on a SSD, only 
with DB, the biggest you can, would be enough, unless you have 2 different 
kinds of SSD (for example a tiny Nvme and a SSD)

Am I right? Or would I get any benefit from setting implicit WAL partition on 
the same SSD?


 Mensaje original 
De: Nick Fisk 
Fecha: 8/11/17 10:16 p. m. (GMT+01:00)
Para: 'Mark Nelson' , 'Wolfgang Lendl' 

Cc: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] bluestore - wal,db on faster devices?


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 08 November 2017 19:46
To: Wolfgang Lendl 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bluestore - wal,db on faster devices?

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since

you

have a small number of large objects and little extra OMAP data.
Having the allocation and object metadata on flash certainly shouldn't

hurt,

and you should still have less overhead for small (<64k) writes.
With RGW however you also have to worry about bucket index updates
during writes and that's a big potential bottleneck that you don't need to
worry about with RBD.


If you are running anything which is sensitive to sync write latency, like
databases. You will see a big performance improvement in using WAL on SSD.
As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
1-2us difference. It will also batch lots of these small writes
together and write them to disk in bigger chunks much more effectively. If
you want to run active workloads on RBD and want them to match enterprise
storage array with BBWC type performance, I would say DB and WAL on SSD is a
requirement.




Mark

On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:

Hi Mark,

thanks for your reply!
I'm a big fan of keeping things simple - this means that there has to
be a very good reason to put the WAL and DB on a separate device
otherwise I'll keep it collocated (and simpler).

as far as I understood - putting the WAL,DB on a faster (than hdd)
device makes more sense in cephfs and rgw environments (more

metadata)

- and less sense in rbd environments - correct?

br
wolfgang

On 11/08/2017 02:21 PM, Mark Nelson wrote:

Hi Wolfgang,

In bluestore the WAL serves sort of a similar purpose to filestore's
journal, but bluestore isn't dependent on it for guaranteeing
durability of large writes.  With bluestore you can often get higher
large-write throughput than with filestore when using HDD-only or
flash-only OSDs.

Bluestore also stores allocation, object, and cluster metadata in the
DB.  That, in combination with the way bluestore stores objects,
dramatically improves behavior during certain workloads.  A big one
is creating millions of small objects as quickly as possible.  In
filestore, PG splitting has a huge impact on performance and tail
latency.  Bluestore is much better just on HDD, and putting the DB
and WAL on flash makes it better still since metadata no longer is a
bottleneck.

Bluestore does have a couple of shortcomings vs filestore currently.
The allocator is not as good as XFS's and can fragment more over time.
There is no server-side readahead so small sequential read
performance is very dependent on client-side readahead.  There's
still a number of optimizations to various things ranging from
threading and locking in the shardedopwq to pglog and dup_ops that
potentially could improve performance.

I have a blog post that we've been working on that explores some of
these things but I'm still waiting on review before I publish it.

Mark

On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:

Hello,

it's clear to me getting a performance gain from putting the journal
on a fast device (ssd,nvm

Re: [ceph-users] who is using nfs-ganesha and cephfs?

2017-11-09 Thread Supriti Singh
The email was not delivered to ceph-de...@vger.kernel.org. So, re-sending it. 

Few more things regarding the hardware and clients used in our benchmarking 
setup:
- The cephfs benchmark were done using kernel cephfs client. 
- NFS-Ganesha was mounted using nfs version 4. 
- Single nfs-ganesha server was used. 

Ceph and Client setup:
- Each client node has 16 cores and 16 GB RAM.
- MDS and Ganesha server is running on the same node. 
- Network interconnect between client and ceph nodes is 40Gbit/s. 
- Ceph on 8 nodes: (each node has 24 cores/128 GB RAM). 
  - 5 OSD nodes
  - 3 MON/MDS nodes
  - 6 OSD daemons per node - Blustore - SSD/NVME journal 


--
Supriti Singh SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham 
Norton,
HRB 21284 (AG Nürnberg)
 



>>> Supriti Singh 11/09/17 12:15 PM >>>
Hi Sage,

As Lars mentioned, at SUSE, we use ganesha 2.5.2/luminous. We did a preliminary 
performance comparison of cephfs client
and nfs-ganesha client. I have attached the results. The results are aggregate 
bandwidth over 10 clients.

1. Test Setup:
We use fio to read/write to a single 5GB file per thread for 300 seconds. A 
single job (represented in x-axis) is of
type {number_of_worker_thread}rw_{block_size}_{op}, where, 
number_of_worker_threads: 1, 4, 8, 16
Block size: 4K,64K,1M,4M,8M
op: rw 

 
2. NFS-Ganesha configuration:
Parameters set (other than default):
1. Graceless = True
2. MaxRPCSendBufferSize/MaxRPCRecvBufferSize is set to max value.

3. Observations:
-  For single thread (on each client) and 4k block size, the b/w is around 45% 
of cephfs 
- As number of threads increases, the performance drops. It could be related to 
nfs-ganesha parameter
"Dispatch_Max_Reqs_Xprt", which defaults to 512. Note, this parameter is 
important only for v2.5. 
- We did run with both nfs-ganesha mdcache enabled/disabled. But there were no 
significant improvements with caching.
Not sure but it could be related to this issue: 
https://github.com/nfs-ganesha/nfs-ganesha/issues/223
  
The results are still preliminary, and I guess with proper tuning of 
nfs-ganesha parameters, it could be better.

Thanks,
Supriti 

--
Supriti Singh SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham 
Norton,
HRB 21284 (AG Nürnberg)
 



>>> Lars Marowsky-Bree  11/09/17 11:07 AM >>>
On 2017-11-08T21:41:41, Sage Weil  wrote:

> Who is running nfs-ganesha's FSAL to export CephFS?  What has your 
> experience been?
> 
> (We are working on building proper testing and support for this into 
> Mimic, but the ganesha FSAL has been around for years.)

We use it currently, and it works, but let's not discuss the performance
;-)

How else do you want to build this into Mimic?

Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html







NFS_Ganesha_vs_CephFS.ods
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph auth profile definitions

2017-11-09 Thread Jason Dillaman
They are currently defined to the following (translated to cap syntax):

mon: 'allow service mon r, allow service osd r, allow service pg r,
allow command "osd blacklist" with blacklistop=add addr regex
"^[^/]+/[0-9]+$"'
osd: 'allow class-read object_prefix rbd_children, allow class-read
object_prefix rbd_mirroring, allow [pool ] rwx'


On Thu, Nov 9, 2017 at 5:24 AM, John Spray  wrote:
>
> On Thu, Nov 9, 2017 at 10:12 AM, Marc Roos  wrote:
> >
> > How/where can I see how eg. 'profile rbd' is defined?
> >
> > As in
> > [client.rbd.client1]
> > key = xxx==
> > caps mon = "profile rbd"
> > caps osd = "profile rbd pool=rbd"
>
> The profiles are defined internally and are subject to change, but you
> can peek at them in the code:
> https://github.com/ceph/ceph/blob/master/src/mon/MonCap.cc#L285
> https://github.com/ceph/ceph/blob/master/src/osd/OSDCap.cc#L250
>
> John
>
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to upgrade CEPH journal?

2017-11-09 Thread Rudi Ahlers
Hi,

Can someone please tell me what the correct procedure is to upgrade a CEPH
journal?

I'm running ceph: 12.2.1 on Proxmox 5.1, which runs on Debian 9.1

For a journal I have a 400GB Intel SSD drive and it seems CEPH created a
1GB journal:

Disk /dev/sdf: 372.6 GiB, 400088457216 bytes, 781422768 sectors
/dev/sdf1 2048 2099199 2097152   1G unknown
/dev/sdf2  2099200 4196351 2097152   1G unknown

root@virt2:~# fdisk -l | grep sde
Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
/dev/sde1   2048 2099199 2097152   1G unknown


/dev/sda :
 /dev/sda1 ceph data, active, cluster ceph, osd.3, block /dev/sda2,
block.db /dev/sde1
 /dev/sda2 ceph block, for /dev/sda1
/dev/sdb :
 /dev/sdb1 ceph data, active, cluster ceph, osd.4, block /dev/sdb2,
block.db /dev/sdf1
 /dev/sdb2 ceph block, for /dev/sdb1
/dev/sdc :
 /dev/sdc1 ceph data, active, cluster ceph, osd.5, block /dev/sdc2,
block.db /dev/sdf2
 /dev/sdc2 ceph block, for /dev/sdc1
/dev/sdd :
 /dev/sdd1 other, xfs, mounted on /data/brick1
 /dev/sdd2 other, xfs, mounted on /data/brick2
/dev/sde :
 /dev/sde1 ceph block.db, for /dev/sda1
/dev/sdf :
 /dev/sdf1 ceph block.db, for /dev/sdb1
 /dev/sdf2 ceph block.db, for /dev/sdc1
/dev/sdg :


resizing the partition through fdisk didn't work. What is the correct
procedure, please?

Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster hang

2017-11-09 Thread Matteo Dacrema
Hi all,

I’ve experienced a strange issue with my cluster.
The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 4 
SSDs nodes with 5 SSDs each.
All the nodes are behind 3 monitors and 2 different crush maps.
All the cluster is on 10.2.7 

About 20 days ago I started to notice that long backups hangs with "task 
jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
About few days ago another VM start to have high iowait without doing iops also 
on the HDD crush map.

Today about a hundreds VMs wasn’t able to read/write from many volumes all of 
them on HDD crush map. Ceph health was ok and no significant log entries were 
found.
Not all the VMs experienced this problem and in the meanwhile the iops on the 
journal and HDDs was very low even if I was able to do significant iops on the 
working VMs.

After two hours of debug I decided to reboot one of the OSD nodes and the 
cluster start to respond again. Now the OSD node is back in the cluster and the 
problem is disappeared.

Can someone help me to understand what happened?
I see strange entries in the log files like:

accept replacing existing (lossy) channel (new one lossy=1)
fault with nothing to send, going to standby
leveldb manual compact 

I can share all the logs that can help to identify the issue.

Thank you.
Regards,

Matteo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to upgrade CEPH journal?

2017-11-09 Thread Alwin Antreich
Hi Rudi,
On Thu, Nov 09, 2017 at 04:09:04PM +0200, Rudi Ahlers wrote:
> Hi,
>
> Can someone please tell me what the correct procedure is to upgrade a CEPH
> journal?
>
> I'm running ceph: 12.2.1 on Proxmox 5.1, which runs on Debian 9.1
>
> For a journal I have a 400GB Intel SSD drive and it seems CEPH created a
> 1GB journal:
>
> Disk /dev/sdf: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> /dev/sdf1 2048 2099199 2097152   1G unknown
> /dev/sdf2  2099200 4196351 2097152   1G unknown
>
> root@virt2:~# fdisk -l | grep sde
> Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> /dev/sde1   2048 2099199 2097152   1G unknown
>
>
> /dev/sda :
>  /dev/sda1 ceph data, active, cluster ceph, osd.3, block /dev/sda2,
> block.db /dev/sde1
>  /dev/sda2 ceph block, for /dev/sda1
> /dev/sdb :
>  /dev/sdb1 ceph data, active, cluster ceph, osd.4, block /dev/sdb2,
> block.db /dev/sdf1
>  /dev/sdb2 ceph block, for /dev/sdb1
> /dev/sdc :
>  /dev/sdc1 ceph data, active, cluster ceph, osd.5, block /dev/sdc2,
> block.db /dev/sdf2
>  /dev/sdc2 ceph block, for /dev/sdc1
> /dev/sdd :
>  /dev/sdd1 other, xfs, mounted on /data/brick1
>  /dev/sdd2 other, xfs, mounted on /data/brick2
> /dev/sde :
>  /dev/sde1 ceph block.db, for /dev/sda1
> /dev/sdf :
>  /dev/sdf1 ceph block.db, for /dev/sdb1
>  /dev/sdf2 ceph block.db, for /dev/sdc1
> /dev/sdg :
>
>
> resizing the partition through fdisk didn't work. What is the correct
> procedure, please?
>
> Kind Regards
> Rudi Ahlers
> Website: http://www.rudiahlers.co.za

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
For Bluestore OSDs you need to set bluestore_block_size to geat a bigger
partition for the DB and bluestore_block_wal_size for the WAL.

ceph-disk prepare --bluestore \
--block.db /dev/sde --block.wal /dev/sde /dev/sdX

This gives you in total four partitions on two different disks.

I think it will be less hassle to remove the OSD and prepare it again.

--
Cheers,
Alwin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to upgrade CEPH journal?

2017-11-09 Thread Rudi Ahlers
Hi Alwin,

Thanx for the help.

I see now that I used the wrong wording in my email. I want to resize the
journal, not upgrade.

So, following your commands, I still sit with a 1GB journal:



oot@virt1:~# ceph-disk prepare --bluestore \
> --block.db /dev/sde --block.wal /dev/sde1 /dev/sda
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
prepare_device: OSD will not be hot-swappable if block.db is not the same
device as the osd data
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
The operation has completed successfully.
prepare_device: OSD will not be hot-swappable if block.wal is not the same
device as the osd data
prepare_device: Block.wal /dev/sde1 was not prepared with ceph-disk.
Symlinking directly.
Setting name!
partNum is 1
REALLY setting name!
The operation has completed successfully.
The operation has completed successfully.
meta-data=/dev/sda1  isize=2048   agcount=4, agsize=6400 blks
 =   sectsz=4096  attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=0, rmapbt=0,
reflink=0
data =   bsize=4096   blocks=25600, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
log  =internal log   bsize=4096   blocks=1608, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.

root@virt1:~# partprobe


root@virt1:~# fdisk -l | grep sde
Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
/dev/sde1   2048 195311615 195309568 93.1G Linux filesystem
/dev/sde2  195311616 197408767   20971521G unknown



On Thu, Nov 9, 2017 at 6:02 PM, Alwin Antreich 
wrote:

> Hi Rudi,
> On Thu, Nov 09, 2017 at 04:09:04PM +0200, Rudi Ahlers wrote:
> > Hi,
> >
> > Can someone please tell me what the correct procedure is to upgrade a
> CEPH
> > journal?
> >
> > I'm running ceph: 12.2.1 on Proxmox 5.1, which runs on Debian 9.1
> >
> > For a journal I have a 400GB Intel SSD drive and it seems CEPH created a
> > 1GB journal:
> >
> > Disk /dev/sdf: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> > /dev/sdf1 2048 2099199 2097152   1G unknown
> > /dev/sdf2  2099200 4196351 2097152   1G unknown
> >
> > root@virt2:~# fdisk -l | grep sde
> > Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> > /dev/sde1   2048 2099199 2097152   1G unknown
> >
> >
> > /dev/sda :
> >  /dev/sda1 ceph data, active, cluster ceph, osd.3, block /dev/sda2,
> > block.db /dev/sde1
> >  /dev/sda2 ceph block, for /dev/sda1
> > /dev/sdb :
> >  /dev/sdb1 ceph data, active, cluster ceph, osd.4, block /dev/sdb2,
> > block.db /dev/sdf1
> >  /dev/sdb2 ceph block, for /dev/sdb1
> > /dev/sdc :
> >  /dev/sdc1 ceph data, active, cluster ceph, osd.5, block /dev/sdc2,
> > block.db /dev/sdf2
> >  /dev/sdc2 ceph block, for /dev/sdc1
> > /dev/sdd :
> >  /dev/sdd1 other, xfs, mounted on /data/brick1
> >  /dev/sdd2 other, xfs, mounted on /data/brick2
> > /dev/sde :
> >  /dev/sde1 ceph block.db, for /dev/sda1
> > /dev/sdf :
> >  /dev/sdf1 ceph block.db, for /dev/sdb1
> >  /dev/sdf2 ceph block.db, for /dev/sdc1
> > /dev/sdg :
> >
> >
> > resizing the partition through fdisk didn't work. What is the correct
> > procedure, please?
> >
> > Kind Regards
> > Rudi Ahlers
> > Website: http://www.rudiahlers.co.za
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> For Bluestore OSDs you need to set bluestore_block_size to geat a bigger
> partition for the DB and bluestore_block_wal_size for the WAL.
>
> ceph-disk prepare --bluestore \
> --block.db /dev/sde --block.wal /dev/sde /dev/sdX
>
> This gives you in total four partitions on two different disks.
>
> I think it will be less hassle to remove the OSD and prepare it again.
>
> --
> Cheers,
> Alwin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster hang

2017-11-09 Thread Matteo Dacrema
Update:  I noticed that there was a pg that remained scrubbing from the first 
day I found the issue to when I reboot the node and problem disappeared.
Can this cause the behaviour I described before?


> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema  ha 
> scritto:
> 
> Hi all,
> 
> I’ve experienced a strange issue with my cluster.
> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 
> 4 SSDs nodes with 5 SSDs each.
> All the nodes are behind 3 monitors and 2 different crush maps.
> All the cluster is on 10.2.7 
> 
> About 20 days ago I started to notice that long backups hangs with "task 
> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
> About few days ago another VM start to have high iowait without doing iops 
> also on the HDD crush map.
> 
> Today about a hundreds VMs wasn’t able to read/write from many volumes all of 
> them on HDD crush map. Ceph health was ok and no significant log entries were 
> found.
> Not all the VMs experienced this problem and in the meanwhile the iops on the 
> journal and HDDs was very low even if I was able to do significant iops on 
> the working VMs.
> 
> After two hours of debug I decided to reboot one of the OSD nodes and the 
> cluster start to respond again. Now the OSD node is back in the cluster and 
> the problem is disappeared.
> 
> Can someone help me to understand what happened?
> I see strange entries in the log files like:
> 
> accept replacing existing (lossy) channel (new one lossy=1)
> fault with nothing to send, going to standby
> leveldb manual compact 
> 
> I can share all the logs that can help to identify the issue.
> 
> Thank you.
> Regards,
> 
> Matteo
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to upgrade CEPH journal?

2017-11-09 Thread Caspar Smit
Rudi,

You can set the size of block.db and block.wal partitions in the ceph.conf
configuration file using:

bluestore_block_db_size = 16106127360 (which is 15GB, just calculate the
correct number for your needs)
bluestore_block_wal_size = 16106127360

Kind regards,
Caspar


2017-11-09 17:19 GMT+01:00 Rudi Ahlers :

> Hi Alwin,
>
> Thanx for the help.
>
> I see now that I used the wrong wording in my email. I want to resize the
> journal, not upgrade.
>
> So, following your commands, I still sit with a 1GB journal:
>
>
>
> oot@virt1:~# ceph-disk prepare --bluestore \
> > --block.db /dev/sde --block.wal /dev/sde1 /dev/sda
> Setting name!
> partNum is 0
> REALLY setting name!
> The operation has completed successfully.
> prepare_device: OSD will not be hot-swappable if block.db is not the same
> device as the osd data
> Setting name!
> partNum is 1
> REALLY setting name!
> The operation has completed successfully.
> The operation has completed successfully.
> prepare_device: OSD will not be hot-swappable if block.wal is not the same
> device as the osd data
> prepare_device: Block.wal /dev/sde1 was not prepared with ceph-disk.
> Symlinking directly.
> Setting name!
> partNum is 1
> REALLY setting name!
> The operation has completed successfully.
> The operation has completed successfully.
> meta-data=/dev/sda1  isize=2048   agcount=4, agsize=6400 blks
>  =   sectsz=4096  attr=2, projid32bit=1
>  =   crc=1finobt=1, sparse=0,
> rmapbt=0, reflink=0
> data =   bsize=4096   blocks=25600, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> log  =internal log   bsize=4096   blocks=1608, version=2
>  =   sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> Warning: The kernel is still using the old partition table.
> The new table will be used at the next reboot or after you
> run partprobe(8) or kpartx(8)
> The operation has completed successfully.
>
> root@virt1:~# partprobe
>
>
> root@virt1:~# fdisk -l | grep sde
> Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> /dev/sde1   2048 195311615 195309568 93.1G Linux filesystem
> /dev/sde2  195311616 197408767   20971521G unknown
>
>
>
> On Thu, Nov 9, 2017 at 6:02 PM, Alwin Antreich 
> wrote:
>
>> Hi Rudi,
>> On Thu, Nov 09, 2017 at 04:09:04PM +0200, Rudi Ahlers wrote:
>> > Hi,
>> >
>> > Can someone please tell me what the correct procedure is to upgrade a
>> CEPH
>> > journal?
>> >
>> > I'm running ceph: 12.2.1 on Proxmox 5.1, which runs on Debian 9.1
>> >
>> > For a journal I have a 400GB Intel SSD drive and it seems CEPH created a
>> > 1GB journal:
>> >
>> > Disk /dev/sdf: 372.6 GiB, 400088457216 bytes, 781422768 sectors
>> > /dev/sdf1 2048 2099199 2097152   1G unknown
>> > /dev/sdf2  2099200 4196351 2097152   1G unknown
>> >
>> > root@virt2:~# fdisk -l | grep sde
>> > Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
>> > /dev/sde1   2048 2099199 2097152   1G unknown
>> >
>> >
>> > /dev/sda :
>> >  /dev/sda1 ceph data, active, cluster ceph, osd.3, block /dev/sda2,
>> > block.db /dev/sde1
>> >  /dev/sda2 ceph block, for /dev/sda1
>> > /dev/sdb :
>> >  /dev/sdb1 ceph data, active, cluster ceph, osd.4, block /dev/sdb2,
>> > block.db /dev/sdf1
>> >  /dev/sdb2 ceph block, for /dev/sdb1
>> > /dev/sdc :
>> >  /dev/sdc1 ceph data, active, cluster ceph, osd.5, block /dev/sdc2,
>> > block.db /dev/sdf2
>> >  /dev/sdc2 ceph block, for /dev/sdc1
>> > /dev/sdd :
>> >  /dev/sdd1 other, xfs, mounted on /data/brick1
>> >  /dev/sdd2 other, xfs, mounted on /data/brick2
>> > /dev/sde :
>> >  /dev/sde1 ceph block.db, for /dev/sda1
>> > /dev/sdf :
>> >  /dev/sdf1 ceph block.db, for /dev/sdb1
>> >  /dev/sdf2 ceph block.db, for /dev/sdc1
>> > /dev/sdg :
>> >
>> >
>> > resizing the partition through fdisk didn't work. What is the correct
>> > procedure, please?
>> >
>> > Kind Regards
>> > Rudi Ahlers
>> > Website: http://www.rudiahlers.co.za
>>
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> For Bluestore OSDs you need to set bluestore_block_size to geat a bigger
>> partition for the DB and bluestore_block_wal_size for the WAL.
>>
>> ceph-disk prepare --bluestore \
>> --block.db /dev/sde --block.wal /dev/sde /dev/sdX
>>
>> This gives you in total four partitions on two different disks.
>>
>> I think it will be less hassle to remove the OSD and prepare it again.
>>
>> --
>> Cheers,
>> Alwin
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kind Regards
> Rudi Ahlers
> Website: http://www.rudiahlers.co.za
>

Re: [ceph-users] how to upgrade CEPH journal?

2017-11-09 Thread Caspar Smit
2017-11-09 17:02 GMT+01:00 Alwin Antreich :

> Hi Rudi,
> On Thu, Nov 09, 2017 at 04:09:04PM +0200, Rudi Ahlers wrote:
> > Hi,
> >
> > Can someone please tell me what the correct procedure is to upgrade a
> CEPH
> > journal?
> >
> > I'm running ceph: 12.2.1 on Proxmox 5.1, which runs on Debian 9.1
> >
> > For a journal I have a 400GB Intel SSD drive and it seems CEPH created a
> > 1GB journal:
> >
> > Disk /dev/sdf: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> > /dev/sdf1 2048 2099199 2097152   1G unknown
> > /dev/sdf2  2099200 4196351 2097152   1G unknown
> >
> > root@virt2:~# fdisk -l | grep sde
> > Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> > /dev/sde1   2048 2099199 2097152   1G unknown
> >
> >
> > /dev/sda :
> >  /dev/sda1 ceph data, active, cluster ceph, osd.3, block /dev/sda2,
> > block.db /dev/sde1
> >  /dev/sda2 ceph block, for /dev/sda1
> > /dev/sdb :
> >  /dev/sdb1 ceph data, active, cluster ceph, osd.4, block /dev/sdb2,
> > block.db /dev/sdf1
> >  /dev/sdb2 ceph block, for /dev/sdb1
> > /dev/sdc :
> >  /dev/sdc1 ceph data, active, cluster ceph, osd.5, block /dev/sdc2,
> > block.db /dev/sdf2
> >  /dev/sdc2 ceph block, for /dev/sdc1
> > /dev/sdd :
> >  /dev/sdd1 other, xfs, mounted on /data/brick1
> >  /dev/sdd2 other, xfs, mounted on /data/brick2
> > /dev/sde :
> >  /dev/sde1 ceph block.db, for /dev/sda1
> > /dev/sdf :
> >  /dev/sdf1 ceph block.db, for /dev/sdb1
> >  /dev/sdf2 ceph block.db, for /dev/sdc1
> > /dev/sdg :
> >
> >
> > resizing the partition through fdisk didn't work. What is the correct
> > procedure, please?
> >
> > Kind Regards
> > Rudi Ahlers
> > Website: http://www.rudiahlers.co.za
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> For Bluestore OSDs you need to set bluestore_block_size to geat a bigger
> partition for the DB and bluestore_block_wal_size for the WAL.
>
>
I think you mean the bluestore_block_db_size in stead of the
bluestore_block_size parameter.



> ceph-disk prepare --bluestore \
> --block.db /dev/sde --block.wal /dev/sde /dev/sdX
>
>
Furthermore using the same drive for db and wal is not nessecary since the
wal will always use the fastest storage available. In this case only
specify a block.db device and the wal will go there too.
If you have an even faster device then the Intel SSD (like an NVME device)
you can specify that as a wal.

So after you set bluestore_block_db_size in ceph.conf issue:

ceph-disk prepare --bluestore --block.db /dev/sde /dev/sdX

Kind regards,
Caspar

This gives you in total four partitions on two different disks.
>
> I think it will be less hassle to remove the OSD and prepare it again.
>
> --
> Cheers,
> Alwin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to upgrade CEPH journal?

2017-11-09 Thread Richard Hesketh
Please bear in mind that unless you've got a very good reason for separating 
the WAL/DB into two partitions (i.e. you are testing/debugging and want to 
observe their behaviour separately or they're actually going to go on different 
devices which have different speeds) you should probably stick to using one 
large partition and specifying block.db only, the WAL will automatically be 
included with the DB.

Personally, I found specifying these options in the config overly fiddly; if 
you manually partition your DB device with gdisk or whatever, and then specify 
the partitions as arguments, ceph-disk will just use that entire partition 
regardless of what other size settings are configured.

ceph-disk prepare --bluestore /dev/sda --block.db /dev/disk/by-partuuid/[UUID 
STRING]

(you should refer to the partition by UUID, rather than device letter, because 
ceph-disk will just symlink to the provided argument, and you cannot guarantee 
that device letters will be consistent between reboots, but you can be pretty 
dang sure the UUID will not change or collide)

Rich

On 09/11/17 16:26, Caspar Smit wrote:
> Rudi,
> 
> You can set the size of block.db and block.wal partitions in the ceph.conf 
> configuration file using:
> 
> bluestore_block_db_size = 16106127360 (which is 15GB, just calculate the 
> correct number for your needs)
> bluestore_block_wal_size = 16106127360
> 
> Kind regards,
> Caspar
> 
> 
> 2017-11-09 17:19 GMT+01:00 Rudi Ahlers  >:
> 
> Hi Alwin, 
> 
> Thanx for the help. 
> 
> I see now that I used the wrong wording in my email. I want to resize the 
> journal, not upgrade.
> 
> So, following your commands, I still sit with a 1GB journal:
> 
> 
> 
> oot@virt1:~# ceph-disk prepare --bluestore \
> > --block.db /dev/sde --block.wal /dev/sde1 /dev/sda
> Setting name!
> partNum is 0
> REALLY setting name!
> The operation has completed successfully.
> prepare_device: OSD will not be hot-swappable if block.db is not the same 
> device as the osd data
> Setting name!
> partNum is 1
> REALLY setting name!
> The operation has completed successfully.
> The operation has completed successfully.
> prepare_device: OSD will not be hot-swappable if block.wal is not the 
> same device as the osd data
> prepare_device: Block.wal /dev/sde1 was not prepared with ceph-disk. 
> Symlinking directly.
> Setting name!
> partNum is 1
> REALLY setting name!
> The operation has completed successfully.
> The operation has completed successfully.
> meta-data=/dev/sda1              isize=2048   agcount=4, agsize=6400 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0, 
> rmapbt=0, reflink=0
> data     =                       bsize=4096   blocks=25600, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=1608, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> Warning: The kernel is still using the old partition table.
> The new table will be used at the next reboot or after you
> run partprobe(8) or kpartx(8)
> The operation has completed successfully.
> 
> root@virt1:~# partprobe
> 
> 
> root@virt1:~# fdisk -l | grep sde
> Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> /dev/sde1       2048 195311615 195309568 93.1G Linux filesystem
> /dev/sde2  195311616 197408767   2097152    1G unknown
> 
> 
> 
> On Thu, Nov 9, 2017 at 6:02 PM, Alwin Antreich  > wrote:
> 
> Hi Rudi,
> On Thu, Nov 09, 2017 at 04:09:04PM +0200, Rudi Ahlers wrote:
> > Hi,
> >
> > Can someone please tell me what the correct procedure is to upgrade 
> a CEPH
> > journal?
> >
> > I'm running ceph: 12.2.1 on Proxmox 5.1, which runs on Debian 9.1
> >
> > For a journal I have a 400GB Intel SSD drive and it seems CEPH 
> created a
> > 1GB journal:
> >
> > Disk /dev/sdf: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> > /dev/sdf1     2048 2099199 2097152   1G unknown
> > /dev/sdf2  2099200 4196351 2097152   1G unknown
> >
> > root@virt2:~# fdisk -l | grep sde
> > Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> > /dev/sde1   2048 2099199 2097152   1G unknown
> >
> >
> > /dev/sda :
> >  /dev/sda1 ceph data, active, cluster ceph, osd.3, block /dev/sda2,
> > block.db /dev/sde1
> >  /dev/sda2 ceph block, for /dev/sda1
> > /dev/sdb :
> >  /dev/sdb1 ceph data, act

Re: [ceph-users] how to upgrade CEPH journal?

2017-11-09 Thread Rudi Ahlers
Hi Caspar,

Is this in the [global] or [osd] section of ceph.conf?

I am new to ceph so this is all still very vague to me.
What is the difference betwen the WAL and the DB?


And, lastly, if I want to setup the OSD in Proxmox beforehand and add the
journal to it, can I make these changes afterward?

And, how do I partition the SSD drive then?

On Thu, Nov 9, 2017 at 6:26 PM, Caspar Smit  wrote:

> Rudi,
>
> You can set the size of block.db and block.wal partitions in the ceph.conf
> configuration file using:
>
> bluestore_block_db_size = 16106127360 (which is 15GB, just calculate the
> correct number for your needs)
> bluestore_block_wal_size = 16106127360
>
> Kind regards,
> Caspar
>
>
> 2017-11-09 17:19 GMT+01:00 Rudi Ahlers :
>
>> Hi Alwin,
>>
>> Thanx for the help.
>>
>> I see now that I used the wrong wording in my email. I want to resize the
>> journal, not upgrade.
>>
>> So, following your commands, I still sit with a 1GB journal:
>>
>>
>>
>> oot@virt1:~# ceph-disk prepare --bluestore \
>> > --block.db /dev/sde --block.wal /dev/sde1 /dev/sda
>> Setting name!
>> partNum is 0
>> REALLY setting name!
>> The operation has completed successfully.
>> prepare_device: OSD will not be hot-swappable if block.db is not the same
>> device as the osd data
>> Setting name!
>> partNum is 1
>> REALLY setting name!
>> The operation has completed successfully.
>> The operation has completed successfully.
>> prepare_device: OSD will not be hot-swappable if block.wal is not the
>> same device as the osd data
>> prepare_device: Block.wal /dev/sde1 was not prepared with ceph-disk.
>> Symlinking directly.
>> Setting name!
>> partNum is 1
>> REALLY setting name!
>> The operation has completed successfully.
>> The operation has completed successfully.
>> meta-data=/dev/sda1  isize=2048   agcount=4, agsize=6400 blks
>>  =   sectsz=4096  attr=2, projid32bit=1
>>  =   crc=1finobt=1, sparse=0,
>> rmapbt=0, reflink=0
>> data =   bsize=4096   blocks=25600, imaxpct=25
>>  =   sunit=0  swidth=0 blks
>> naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
>> log  =internal log   bsize=4096   blocks=1608, version=2
>>  =   sectsz=4096  sunit=1 blks, lazy-count=1
>> realtime =none   extsz=4096   blocks=0, rtextents=0
>> Warning: The kernel is still using the old partition table.
>> The new table will be used at the next reboot or after you
>> run partprobe(8) or kpartx(8)
>> The operation has completed successfully.
>>
>> root@virt1:~# partprobe
>>
>>
>> root@virt1:~# fdisk -l | grep sde
>> Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
>> /dev/sde1   2048 195311615 195309568 93.1G Linux filesystem
>> /dev/sde2  195311616 197408767   20971521G unknown
>>
>>
>>
>> On Thu, Nov 9, 2017 at 6:02 PM, Alwin Antreich 
>> wrote:
>>
>>> Hi Rudi,
>>> On Thu, Nov 09, 2017 at 04:09:04PM +0200, Rudi Ahlers wrote:
>>> > Hi,
>>> >
>>> > Can someone please tell me what the correct procedure is to upgrade a
>>> CEPH
>>> > journal?
>>> >
>>> > I'm running ceph: 12.2.1 on Proxmox 5.1, which runs on Debian 9.1
>>> >
>>> > For a journal I have a 400GB Intel SSD drive and it seems CEPH created
>>> a
>>> > 1GB journal:
>>> >
>>> > Disk /dev/sdf: 372.6 GiB, 400088457216 bytes, 781422768 sectors
>>> > /dev/sdf1 2048 2099199 2097152   1G unknown
>>> > /dev/sdf2  2099200 4196351 2097152   1G unknown
>>> >
>>> > root@virt2:~# fdisk -l | grep sde
>>> > Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
>>> > /dev/sde1   2048 2099199 2097152   1G unknown
>>> >
>>> >
>>> > /dev/sda :
>>> >  /dev/sda1 ceph data, active, cluster ceph, osd.3, block /dev/sda2,
>>> > block.db /dev/sde1
>>> >  /dev/sda2 ceph block, for /dev/sda1
>>> > /dev/sdb :
>>> >  /dev/sdb1 ceph data, active, cluster ceph, osd.4, block /dev/sdb2,
>>> > block.db /dev/sdf1
>>> >  /dev/sdb2 ceph block, for /dev/sdb1
>>> > /dev/sdc :
>>> >  /dev/sdc1 ceph data, active, cluster ceph, osd.5, block /dev/sdc2,
>>> > block.db /dev/sdf2
>>> >  /dev/sdc2 ceph block, for /dev/sdc1
>>> > /dev/sdd :
>>> >  /dev/sdd1 other, xfs, mounted on /data/brick1
>>> >  /dev/sdd2 other, xfs, mounted on /data/brick2
>>> > /dev/sde :
>>> >  /dev/sde1 ceph block.db, for /dev/sda1
>>> > /dev/sdf :
>>> >  /dev/sdf1 ceph block.db, for /dev/sdb1
>>> >  /dev/sdf2 ceph block.db, for /dev/sdc1
>>> > /dev/sdg :
>>> >
>>> >
>>> > resizing the partition through fdisk didn't work. What is the correct
>>> > procedure, please?
>>> >
>>> > Kind Regards
>>> > Rudi Ahlers
>>> > Website: http://www.rudiahlers.co.za
>>>
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> For Bluestore OSDs you need to set bluestore_block_size to geat a bigger
>>> partition for the DB and bluestore_

Re: [ceph-users] Fwd: What's the fastest way to try out object classes?

2017-11-09 Thread Zheyuan Chen
I installed rados-objclass-dev and objclass.h was installed successfully.
However, I failed to run the objclass following the steps as below:

1. copy https://github.com/ceph/ceph/blob/master/src/cls/sdk/cls_sdk.cc
into my machine. (cls_test.cpp)
2. make some changes on cls_test.cpp: 1) rename all "sdk" into "test". 2)
add "namespace ceph {..}" wrapping the whole code.
3. compile it using the g++: g++ -std=c++11 -fPIC cls_test.cpp --shared -o
libcls_test.so
4. copy libcls_test.so to all osds:/usr/lib/rados-classes
5. add two lines in ceph.conf: "osd class load list = *" and "osd class
default list = *" and copy to all nodes.
6. restart all nodes in the cluster
7. call the objclass from python code
~~~
ioctx.execute('oid', 'test', 'test_coverage_write', "test")
~~~
I got this error:
~~~
...
File "rados.pyx", line 498, in rados.requires.wrapper.validate_func
(/build/ceph-12.2.1/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:4922)
File "rados.pyx", line 2751, in rados.Ioctx.execute
(/build/ceph-12.2.1/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:35467)
rados.OSError: [errno 95] Ioctx.read(test): failed to read oid
~~~
8. calling sdk gave me no error
~~~
ioctx.execute('oid', 'sdk', 'test_coverage_write', "test")
~~~

Did I do anything wrong here? I hope anyone can help me with this.

Thank you very much,
Zheyuan

On Mon, Oct 30, 2017 at 4:20 PM, Neha Ojha  wrote:

> Should be rados-objclass-dev or rados-objclass-devel. Try and let me
> know how it goes. Honestly, I've always done it from source :)
>
> On Mon, Oct 30, 2017 at 4:12 PM, Zheyuan Chen  wrote:
> > Do you know which package should I install?
> >
> > On Mon, Oct 30, 2017 at 3:54 PM, Neha Ojha  wrote:
> >>
> >> I am not sure about a docker image, but you should be able to install
> >> it through packages.
> >>
> >> On Mon, Oct 30, 2017 at 3:20 PM, Zheyuan Chen 
> wrote:
> >> > Hi Neha,
> >> >
> >> > Thanks for answering.
> >> > Building from source just takes too much time. So I was wondering if
> >> > there's
> >> > any docker image or prebuilt package already containing objclass.h
> >> > If that's the only way, I have to go ahead with it.
> >> >
> >> > On Mon, Oct 30, 2017 at 3:05 PM, Neha Ojha  wrote:
> >> >>
> >> >> Hi Zheyuan,
> >> >>
> >> >> You can build Ceph from source and run make install. This should
> place
> >> >> objclass.h in /include/rados/ .
> >> >>
> >> >> Thanks,
> >> >> Neha
> >> >>
> >> >> On Mon, Oct 30, 2017 at 2:18 PM, Zheyuan Chen 
> >> >> wrote:
> >> >> >
> >> >> > -- Forwarded message --
> >> >> > From: Zheyuan Chen 
> >> >> > Date: Mon, Oct 30, 2017 at 2:16 PM
> >> >> > Subject: What's the fastest way to try out object classes?
> >> >> > To: ceph-users@lists.ceph.com
> >> >> >
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > I'd like to try out object classes.
> >> >> > http://docs.ceph.com/docs/master/rados/api/objclass-sdk/
> >> >> > I used this docker image: https://hub.docker.com/r/ceph/demo/, but
> >> >> > found
> >> >> > the
> >> >> > object class sdk is not included (couldn't find
> >> >> > /usr/local/include/rados/objectclass.h) even after I installed
> >> >> > librados-devel manually.
> >> >> >
> >> >> > Do I have to build from the source code if I want to have
> >> >> > objectclass.h?
> >> >> > What is the fastest way to set up the environment if I want to try
> >> >> > out
> >> >> > object classes?
> >> >> >
> >> >> > Thank you very much!
> >> >> > Zheyuan
> >> >> >
> >> >> >
> >> >> > ___
> >> >> > ceph-users mailing list
> >> >> > ceph-users@lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >
> >> >
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: What's the fastest way to try out object classes?

2017-11-09 Thread Yehuda Sadeh-Weinraub
On Thu, Nov 9, 2017 at 10:05 AM, Zheyuan Chen  wrote:
> I installed rados-objclass-dev and objclass.h was installed successfully.
> However, I failed to run the objclass following the steps as below:
>
> 1. copy https://github.com/ceph/ceph/blob/master/src/cls/sdk/cls_sdk.cc into
> my machine. (cls_test.cpp)
> 2. make some changes on cls_test.cpp: 1) rename all "sdk" into "test". 2)
> add "namespace ceph {..}" wrapping the whole code.
> 3. compile it using the g++: g++ -std=c++11 -fPIC cls_test.cpp --shared -o
> libcls_test.so
> 4. copy libcls_test.so to all osds:/usr/lib/rados-classes
> 5. add two lines in ceph.conf: "osd class load list = *" and "osd class
> default list = *" and copy to all nodes.
> 6. restart all nodes in the cluster
> 7. call the objclass from python code
> ~~~
> ioctx.execute('oid', 'test', 'test_coverage_write', "test")
> ~~~
> I got this error:
> ~~~
> ...
> File "rados.pyx", line 498, in rados.requires.wrapper.validate_func
> (/build/ceph-12.2.1/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:4922)
> File "rados.pyx", line 2751, in rados.Ioctx.execute
> (/build/ceph-12.2.1/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:35467)
> rados.OSError: [errno 95] Ioctx.read(test): failed to read oid


errno 95: Not supported. Either the test object class failed to load,
or it couldn't find the test_coverage_write method. Try looking at
your osd logs (set 'debug objclass = 20' in your ceph.conf).

Yehuda

> ~~~
> 8. calling sdk gave me no error
> ~~~
> ioctx.execute('oid', 'sdk', 'test_coverage_write', "test")
> ~~~
>
> Did I do anything wrong here? I hope anyone can help me with this.
>
> Thank you very much,
> Zheyuan
>
> On Mon, Oct 30, 2017 at 4:20 PM, Neha Ojha  wrote:
>>
>> Should be rados-objclass-dev or rados-objclass-devel. Try and let me
>> know how it goes. Honestly, I've always done it from source :)
>>
>> On Mon, Oct 30, 2017 at 4:12 PM, Zheyuan Chen  wrote:
>> > Do you know which package should I install?
>> >
>> > On Mon, Oct 30, 2017 at 3:54 PM, Neha Ojha  wrote:
>> >>
>> >> I am not sure about a docker image, but you should be able to install
>> >> it through packages.
>> >>
>> >> On Mon, Oct 30, 2017 at 3:20 PM, Zheyuan Chen 
>> >> wrote:
>> >> > Hi Neha,
>> >> >
>> >> > Thanks for answering.
>> >> > Building from source just takes too much time. So I was wondering if
>> >> > there's
>> >> > any docker image or prebuilt package already containing objclass.h
>> >> > If that's the only way, I have to go ahead with it.
>>
>> >> >
>> >> > On Mon, Oct 30, 2017 at 3:05 PM, Neha Ojha  wrote:
>> >> >>
>> >> >> Hi Zheyuan,
>> >> >>
>> >> >> You can build Ceph from source and run make install. This should
>> >> >> place
>> >> >> objclass.h in /include/rados/ .
>> >> >>
>> >> >> Thanks,
>> >> >> Neha
>> >> >>
>> >> >> On Mon, Oct 30, 2017 at 2:18 PM, Zheyuan Chen 
>> >> >> wrote:
>> >> >> >
>> >> >> > -- Forwarded message --
>> >> >> > From: Zheyuan Chen 
>> >> >> > Date: Mon, Oct 30, 2017 at 2:16 PM
>> >> >> > Subject: What's the fastest way to try out object classes?
>> >> >> > To: ceph-users@lists.ceph.com
>> >> >> >
>> >> >> >
>> >> >> > Hi All,
>> >> >> >
>> >> >> > I'd like to try out object classes.
>> >> >> > http://docs.ceph.com/docs/master/rados/api/objclass-sdk/
>> >> >> > I used this docker image: https://hub.docker.com/r/ceph/demo/, but
>> >> >> > found
>> >> >> > the
>> >> >> > object class sdk is not included (couldn't find
>> >> >> > /usr/local/include/rados/objectclass.h) even after I installed
>> >> >> > librados-devel manually.
>> >> >> >
>> >> >> > Do I have to build from the source code if I want to have
>> >> >> > objectclass.h?
>> >> >> > What is the fastest way to set up the environment if I want to try
>> >> >> > out
>> >> >> > object classes?
>> >> >> >
>> >> >> > Thank you very much!
>> >> >> > Zheyuan
>> >> >> >
>> >> >> >
>> >> >> > ___
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users@lists.ceph.com
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: What's the fastest way to try out object classes?

2017-11-09 Thread Zheyuan Chen
I changed this line into CLS_LOG(0, "loading cls_test");
https://github.com/ceph/ceph/blob/master/src/cls/sdk/cls_sdk.cc#L120

I don't think the test object class is loaded correctly since I don't have
the loading information in the log.

However I can see "loading cls_sdk" in the osd log.

On Thu, Nov 9, 2017 at 10:19 AM, Yehuda Sadeh-Weinraub 
wrote:

> On Thu, Nov 9, 2017 at 10:05 AM, Zheyuan Chen  wrote:
> > I installed rados-objclass-dev and objclass.h was installed successfully.
> > However, I failed to run the objclass following the steps as below:
> >
> > 1. copy https://github.com/ceph/ceph/blob/master/src/cls/sdk/cls_sdk.cc
> into
> > my machine. (cls_test.cpp)
> > 2. make some changes on cls_test.cpp: 1) rename all "sdk" into "test". 2)
> > add "namespace ceph {..}" wrapping the whole code.
> > 3. compile it using the g++: g++ -std=c++11 -fPIC cls_test.cpp --shared
> -o
> > libcls_test.so
> > 4. copy libcls_test.so to all osds:/usr/lib/rados-classes
> > 5. add two lines in ceph.conf: "osd class load list = *" and "osd class
> > default list = *" and copy to all nodes.
> > 6. restart all nodes in the cluster
> > 7. call the objclass from python code
> > ~~~
> > ioctx.execute('oid', 'test', 'test_coverage_write', "test")
> > ~~~
> > I got this error:
> > ~~~
> > ...
> > File "rados.pyx", line 498, in rados.requires.wrapper.validate_func
> > (/build/ceph-12.2.1/obj-x86_64-linux-gnu/src/pybind/rados/
> pyrex/rados.c:4922)
> > File "rados.pyx", line 2751, in rados.Ioctx.execute
> > (/build/ceph-12.2.1/obj-x86_64-linux-gnu/src/pybind/rados/
> pyrex/rados.c:35467)
> > rados.OSError: [errno 95] Ioctx.read(test): failed to read oid
>
>
> errno 95: Not supported. Either the test object class failed to load,
> or it couldn't find the test_coverage_write method. Try looking at
> your osd logs (set 'debug objclass = 20' in your ceph.conf).
>
> Yehuda
>
> > ~~~
> > 8. calling sdk gave me no error
> > ~~~
> > ioctx.execute('oid', 'sdk', 'test_coverage_write', "test")
> > ~~~
> >
> > Did I do anything wrong here? I hope anyone can help me with this.
> >
> > Thank you very much,
> > Zheyuan
> >
> > On Mon, Oct 30, 2017 at 4:20 PM, Neha Ojha  wrote:
> >>
> >> Should be rados-objclass-dev or rados-objclass-devel. Try and let me
> >> know how it goes. Honestly, I've always done it from source :)
> >>
> >> On Mon, Oct 30, 2017 at 4:12 PM, Zheyuan Chen 
> wrote:
> >> > Do you know which package should I install?
> >> >
> >> > On Mon, Oct 30, 2017 at 3:54 PM, Neha Ojha  wrote:
> >> >>
> >> >> I am not sure about a docker image, but you should be able to install
> >> >> it through packages.
> >> >>
> >> >> On Mon, Oct 30, 2017 at 3:20 PM, Zheyuan Chen 
> >> >> wrote:
> >> >> > Hi Neha,
> >> >> >
> >> >> > Thanks for answering.
> >> >> > Building from source just takes too much time. So I was wondering
> if
> >> >> > there's
> >> >> > any docker image or prebuilt package already containing objclass.h
> >> >> > If that's the only way, I have to go ahead with it.
> >>
> >> >> >
> >> >> > On Mon, Oct 30, 2017 at 3:05 PM, Neha Ojha 
> wrote:
> >> >> >>
> >> >> >> Hi Zheyuan,
> >> >> >>
> >> >> >> You can build Ceph from source and run make install. This should
> >> >> >> place
> >> >> >> objclass.h in /include/rados/ .
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Neha
> >> >> >>
> >> >> >> On Mon, Oct 30, 2017 at 2:18 PM, Zheyuan Chen 
> >> >> >> wrote:
> >> >> >> >
> >> >> >> > -- Forwarded message --
> >> >> >> > From: Zheyuan Chen 
> >> >> >> > Date: Mon, Oct 30, 2017 at 2:16 PM
> >> >> >> > Subject: What's the fastest way to try out object classes?
> >> >> >> > To: ceph-users@lists.ceph.com
> >> >> >> >
> >> >> >> >
> >> >> >> > Hi All,
> >> >> >> >
> >> >> >> > I'd like to try out object classes.
> >> >> >> > http://docs.ceph.com/docs/master/rados/api/objclass-sdk/
> >> >> >> > I used this docker image: https://hub.docker.com/r/ceph/demo/,
> but
> >> >> >> > found
> >> >> >> > the
> >> >> >> > object class sdk is not included (couldn't find
> >> >> >> > /usr/local/include/rados/objectclass.h) even after I installed
> >> >> >> > librados-devel manually.
> >> >> >> >
> >> >> >> > Do I have to build from the source code if I want to have
> >> >> >> > objectclass.h?
> >> >> >> > What is the fastest way to set up the environment if I want to
> try
> >> >> >> > out
> >> >> >> > object classes?
> >> >> >> >
> >> >> >> > Thank you very much!
> >> >> >> > Zheyuan
> >> >> >> >
> >> >> >> >
> >> >> >> > ___
> >> >> >> > ceph-users mailing list
> >> >> >> > ceph-users@lists.ceph.com
> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com

[ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Marc Roos
 
I would like store objects with

rados -p ec32 put test2G.img test2G.img

error putting ec32/test2G.img: (27) File too large

Changing the pool application from custom to rgw did not help









___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Kevin Hrpcek

Marc,

If you're running luminous you may need to increase osd_max_object_size. 
This snippet is from the Luminous change log.


"The default maximum size for a single RADOS object has been reduced 
from 100GB to 128MB. The 100GB limit was completely impractical in 
practice while the 128MB limit is a bit high but not unreasonable. If 
you have an application written directly to librados that is using 
objects larger than 128MB you may need to adjust osd_max_object_size"


Kevin

On 11/09/2017 02:01 PM, Marc Roos wrote:
  
I would like store objects with


rados -p ec32 put test2G.img test2G.img

error putting ec32/test2G.img: (27) File too large

Changing the pool application from custom to rgw did not help









___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Christian Wuerdig
It should be noted that the general advise is to not use such large
objects since cluster performance will suffer, see also this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021051.html

libradosstriper might be an option which will automatically break the
object into smaller chunks

On Fri, Nov 10, 2017 at 9:08 AM, Kevin Hrpcek
 wrote:
> Marc,
>
> If you're running luminous you may need to increase osd_max_object_size.
> This snippet is from the Luminous change log.
>
> "The default maximum size for a single RADOS object has been reduced from
> 100GB to 128MB. The 100GB limit was completely impractical in practice while
> the 128MB limit is a bit high but not unreasonable. If you have an
> application written directly to librados that is using objects larger than
> 128MB you may need to adjust osd_max_object_size"
>
> Kevin
>
> On 11/09/2017 02:01 PM, Marc Roos wrote:
>
>
> I would like store objects with
>
> rados -p ec32 put test2G.img test2G.img
>
> error putting ec32/test2G.img: (27) File too large
>
> Changing the pool application from custom to rgw did not help
>
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance, and how much wiggle room there is with tunables

2017-11-09 Thread Robert Stanford
 In my cluster, rados bench shows about 1GB/s bandwidth.  I've done some
tuning:

[osd]
osd op threads = 8
osd disk threads = 4
osd recovery max active = 7


I was hoping to get much better bandwidth.  My network can handle it, and
my disks are pretty fast as well.  Are there any major tunables I can play
with to increase what will be reported by "rados bench"?  Am I pretty much
stuck around the bandwidth it reported?

 Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Needed help to setup a 3-way replication between 2 datacenters

2017-11-09 Thread Sébastien VIGNERON
Hi everyone,

Beginner with Ceph, i’m looking for a way to do a 3-way replication between 2 
datacenters as mention in ceph docs (but not describe).

My goal is to keep access to the data (at least read-only access) even when the 
link between the 2 datacenters is cut and make sure at least one copy of the 
data exists in each datacenter.

I’m not sure how to implement such 3-way replication. With a rule?
Based on the CEPH docs, I think of a rule:
rule 3-way-replication_with_2_DC {
ruleset 1
type replicated
min_size 2
max_size 3
step take DC-1
step choose firstn 1 type host
step chooseleaf firstn 1 type osd
step emit
step take DC-2
step choose firstn 1 type host
step chooseleaf firstn 1 type osd
step emit
step take default
step choose firstn 1 type host
step chooseleaf firstn 1 type osd
step emit
}
but what should happen if the link between the 2 datacenters is cut? If someone 
has a better solution, I interested by any resources about it (examples, …).

The default rule (see below) keep the pool working when we mark each node of 
DC-2 as down (typically maintenance) but if we shut the link down between the 2 
datacenters, the pool/rbd hangs (frozen writing dd tool for example).
Does anyone have some insight on how to setup a 3-way replication between 2 
datacenters?

Thanks in advance for any advice on the topic.

Current situation:

Mons : host-1, host-2, host-4

Quick network topology:

USERS NETWORK
 |
   2x10G
 |
  DC-1-SWITCH <——— 40G ——> DC-2-SWITCH
| | |   | | |
host-1 _| | |   host-4 _| | |
host-2 ___| |   host-5 ___| |
host-3 _|   host-6 _|



crushmap :
# ceph osd tree
ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
 -1   147.33325 root default
-2073.3 datacenter DC-1
-1573.3 rack DC-1-RACK-1
 -924.4 host host-1
 27   hdd   2.72839 osd.27  up  1.0 1.0
 28   hdd   2.72839 osd.28  up  1.0 1.0
 29   hdd   2.72839 osd.29  up  1.0 1.0
 30   hdd   2.72839 osd.30  up  1.0 1.0
 31   hdd   2.72839 osd.31  up  1.0 1.0
 32   hdd   2.72839 osd.32  up  1.0 1.0
 33   hdd   2.72839 osd.33  up  1.0 1.0
 34   hdd   2.72839 osd.34  up  1.0 1.0
 36   hdd   2.72839 osd.36  up  1.0 1.0
-1124.4 host host-2
 35   hdd   2.72839 osd.35  up  1.0 1.0
 37   hdd   2.72839 osd.37  up  1.0 1.0
 38   hdd   2.72839 osd.38  up  1.0 1.0
 39   hdd   2.72839 osd.39  up  1.0 1.0
 40   hdd   2.72839 osd.40  up  1.0 1.0
 41   hdd   2.72839 osd.41  up  1.0 1.0
 42   hdd   2.72839 osd.42  up  1.0 1.0
 43   hdd   2.72839 osd.43  up  1.0 1.0
 46   hdd   2.72839 osd.46  up  1.0 1.0
-1324.4 host host-3
 44   hdd   2.72839 osd.44  up  1.0 1.0
 45   hdd   2.72839 osd.45  up  1.0 1.0
 47   hdd   2.72839 osd.47  up  1.0 1.0
 48   hdd   2.72839 osd.48  up  1.0 1.0
 49   hdd   2.72839 osd.49  up  1.0 1.0
 50   hdd   2.72839 osd.50  up  1.0 1.0
 51   hdd   2.72839 osd.51  up  1.0 1.0
 52   hdd   2.72839 osd.52  up  1.0 1.0
 53   hdd   2.72839 osd.53  up  1.0 1.0
-1973.3 datacenter DC-2
-1673.3 rack DC-2-RACK-1
 -324.4 host host-4
  0   hdd   2.72839 osd.0   up  1.0 1.0
  1   hdd   2.72839 osd.1   up  1.0 1.0
  2   hdd   2.72839 osd.2   up  1.0 1.0
  3   hdd   2.72839 osd.3   up  1.0 1.0
  4   hdd   2.72839 osd.4   up  1.0 1.0
  5   hdd   2.72839 osd.5   up  1.0 1.0
  6   hdd   2.72839 osd.6   up  1.0 1.0
  7   hdd   2.72839 osd.7   up  1.0 1.0
  8   hdd   2.72839 osd.8   up  1.0 1.0
 -524.4 host host-5
  9   hdd   2.72839 osd.9   up  1.0 1.0
 10   hdd   2.72839 osd.1

[ceph-users] Undersized fix for small cluster, other than adding a 4th node?

2017-11-09 Thread Marc Roos
 
I added an erasure k=3,m=2 coded pool on a 3 node test cluster and am 
getting these errors. 

   pg 48.0 is stuck undersized for 23867.00, current state 
active+undersized+degraded, last acting [9,13,2147483647,7,2147483647]
pg 48.1 is stuck undersized for 27479.944212, current state 
active+undersized+degraded, last acting [12,1,2147483647,8,2147483647]
pg 48.2 is stuck undersized for 27479.944514, current state 
active+undersized+degraded, last acting [12,1,2147483647,3,2147483647]
pg 48.3 is stuck undersized for 27479.943845, current state 
active+undersized+degraded, last acting [11,0,2147483647,2147483647,5]
pg 48.4 is stuck undersized for 27479.947473, current state 
active+undersized+degraded, last acting [8,4,2147483647,2147483647,5]
pg 48.5 is stuck undersized for 27479.940289, current state 
active+undersized+degraded, last acting [6,5,11,2147483647,2147483647]
pg 48.6 is stuck undersized for 27479.947125, current state 
active+undersized+degraded, last acting [5,8,2147483647,1,2147483647]
pg 48.7 is stuck undersized for 23866.977708, current state 
active+undersized+degraded, last acting [13,11,2147483647,0,2147483647]

Mentioned here 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009572.html 
is that the problem was resolved by adding an extra node, I already 
changed the min_size to 3. Or should I change to k=2,m=2 but do I still 
then have good saving on storage then? How can you calculate saving 
storage of erasure pool?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Marc Roos
 
Yes, I actually changed it back to the default after reading somewhat 
about it (https://github.com/ceph/ceph/pull/15520). I wanted to store 
5GB and 12GB files, that makes recovery not to nice. I thought there was 
a setting to split them up automatically like with rbd pools. 



-Original Message-
From: Kevin Hrpcek [mailto:kevin.hrp...@ssec.wisc.edu] 
Sent: donderdag 9 november 2017 21:09
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Pool shard/stripe settings for file too large 
files?

Marc,

If you're running luminous you may need to increase osd_max_object_size. 
This snippet is from the Luminous change log.

"The default maximum size for a single RADOS object has been reduced 
from 100GB to 128MB. The 100GB limit was completely impractical in 
practice while the 128MB limit is a bit high but not unreasonable. If 
you have an application written directly to librados that is using 
objects larger than 128MB you may need to adjust osd_max_object_size"

Kevin


On 11/09/2017 02:01 PM, Marc Roos wrote:


 
I would like store objects with

rados -p ec32 put test2G.img test2G.img

error putting ec32/test2G.img: (27) File too large

Changing the pool application from custom to rgw did not help









___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Marc Roos
 
Do you know of a rados client that uses this? Maybe a simple 'mount' so 
I can cp the files on it?






-Original Message-
From: Christian Wuerdig [mailto:christian.wuer...@gmail.com] 
Sent: donderdag 9 november 2017 22:01
To: Kevin Hrpcek
Cc: Marc Roos; ceph-users
Subject: Re: [ceph-users] Pool shard/stripe settings for file too large 
files?

It should be noted that the general advise is to not use such large 
objects since cluster performance will suffer, see also this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021051.html

libradosstriper might be an option which will automatically break the 
object into smaller chunks

On Fri, Nov 10, 2017 at 9:08 AM, Kevin Hrpcek 
 wrote:
> Marc,
>
> If you're running luminous you may need to increase 
osd_max_object_size.
> This snippet is from the Luminous change log.
>
> "The default maximum size for a single RADOS object has been reduced 
> from 100GB to 128MB. The 100GB limit was completely impractical in 
> practice while the 128MB limit is a bit high but not unreasonable. If 
> you have an application written directly to librados that is using 
> objects larger than 128MB you may need to adjust osd_max_object_size"
>
> Kevin
>
> On 11/09/2017 02:01 PM, Marc Roos wrote:
>
>
> I would like store objects with
>
> rados -p ec32 put test2G.img test2G.img
>
> error putting ec32/test2G.img: (27) File too large
>
> Changing the pool application from custom to rgw did not help
>
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com