date:20160411

Re: [ceph-users] Adding new disk/OSD to ceph cluster

2016-04-11 Thread Eneko Lacunza


Hi Mad,

El 09/04/16 a las 14:39, Mad Th escribió:

We have a 3 node proxmox/ceph cluster ... each with 4 x4 TB disks


Are you using 3-way replication? I guess you are. :)
1) If we want to add more disks , what are the things that we need to 
be careful about?



Will the following steps automatically add it to ceph.conf?
ceph-disk zap /dev/sd[X]
pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y]
where X is new disk and Y is the journal disk.

Yes, this is the same as adding it from web GUI.


2) Is it safe to run different number of OSDs in the cluster, say one 
server with 5 OSD and other two servers with 4OSD ? Though we have 
plan to add one OSD to each server.


It is safe as long as none of your nodes OSDs are near-full. If you're 
asking this because you're adding a new OSD to each node, step by step; 
yes, it is safe.
Be prepared for data moving around when you add new disks. (performance 
will suffer unless you have tuned some parameters in ceph.conf)


3) How do we safely add the new OSD to an existing storage pool?
New OSD will be used automatically by existing ceph pools unless you 
have changed CRUSH map.


Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
  943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How can I monitor current ceph operation at cluster

2016-04-11 Thread nick

Hi,
> We're parsing the output of 'ceph daemon osd.N perf dump' for the admin
> sockets in /var/run/ceph/ceph-osd.*.asok on each node in our cluster.
> We then push that data into carbon-cache/graphite and using grafana for
> visualization.
which of those values are you using for monitoring? I can see a lot of numbers 
when doing a 'ceph daemon osd.N perf dump'. Do you know if there is some 
documentation what each value means? I could only find: 
http://docs.ceph.com/docs/hammer/dev/perf_counters/ which describes the 
schema.

Best Regards
Nick

> Our numbers are much more consistent than yours appear.
> 
> Bob
> 
> On Thu, Apr 7, 2016 at 2:34 AM, David Riedl  wrote:
> > Hi.
> > 
> > I use this for my zabbix environment:
> > 
> > https://github.com/thelan/ceph-zabbix/
> > 
> > It works really well for me.
> > 
> > 
> > Regards
> > 
> > David
> > 
> > On 07.04.2016 11:20, Nick Fisk wrote:
> >   Hi.
> > 
> > I have small question about monitoring performance at ceph cluster.
> > 
> > We have cluster with 5 nodes and 8 drives on each node, and 5 monitor on
> > every node. For monitoring cluster we use zabbix. It asked every node for
> > 
> > 30
> > 
> > second about current ceph operation and get different result from every
> > node.
> > first node: 350op/s
> > second node: 900op/s
> > third node: 200ops/s
> > fourth node:   700op/s
> > fifth node: 1200ops/
> > 
> > I don't understand how I can receive the total value of performance ceph
> > cluster?
> > 
> > Easy Answer
> > Capture and parse the output from "ceph -s", not 100% accurate, but
> > probably good enough for a graph
> > 
> > Complex Answer
> > Use something like Graphite to capture all the counters for every OSD and
> > then use something like sumSeries to add all the op/s counters together.
> > 
> > 
> > 
> > 
> > ___
> > ceph-users mailing
> > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-user
> > s-ceph.com
> > 
> > ___
> > ceph-users mailing
> > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-user
> > s-ceph.com
> > 
> > 
> > --
> > Mit freundlichen Grüßen
> > 
> > David Riedl
> > 
> > 
> > 
> > WINGcon GmbH Wireless New Generation - Consulting & Solutions
> > 
> > Phone: +49 (0) 7543 9661 - 26
> > E-Mail: david.ri...@wingcon.com
> > Web: http://www.wingcon.com
> > 
> > Sitz der Gesellschaft: Langenargen
> > Registergericht: ULM, HRB 632019
> > USt-Id.: DE232931635, WEEE-Id.: DE74015979
> > Geschäftsführer: Thomas Ehrle, Fritz R. Paul
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
-- 
Sebastian Nickel
Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread James Page

Hi

On Sun, 10 Apr 2016 at 16:39 hp cre  wrote:

> Hello all,
>
> I was just installing jewel 10.1.0 on ubuntu xenial beta 2.
> I got an error when trying to create a mon about failure to find command
> 'initctl' which is in upstart.
> Tried to install upstart, then got an error 'com.ubuntu...' not found.
>
> Anyway,  i thought that with jewel release it would support systemd on
> ubuntu xenial like it said in the release notes?!
>
I've been working with the systemd support in the Jewel release and its
working OK for me - how are you trying to create your mon?  I'm wondering
whether its your deployment process rather than the ceph packages that have
a problem here.

Cheers

James
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] [ceph-mds] mds service can not start after shutdown in 10.1.0

2016-04-11 Thread 施柏安

Hi cephers,

I was testing CephFS's HA. So I shutdown the active mds server.
Then the one of standby mds turn to be active. Everything seems work
properly.
But I boot the mds server which was shutdown in test. It can't join cluster
automatically.
And I use command 'sudo service ceph-mds start id=0'. It can't start and
just show 'ceph-mds stop/waiting'

Is that the bug or I do wrong operation?

-- 

Best regards,

施柏安 Desmond Shih
技術研發部 Technical Development
 
迎棧科技股份有限公司
│ 886-975-857-982
│ desmond.s@inwinstack 
│ 886-2-7738-2858 #7441
│ 新北市220板橋區遠東路3號5樓C室
Rm.C, 5F., No.3, Yuandong Rd.,
Banqiao Dist., New Taipei City 220, Taiwan (R.O.C)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How can I monitor current ceph operation at cluster

2016-04-11 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> nick
> Sent: 11 April 2016 08:26
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] How can I monitor current ceph operation at
> cluster
> 
> Hi,
> > We're parsing the output of 'ceph daemon osd.N perf dump' for the
> > admin sockets in /var/run/ceph/ceph-osd.*.asok on each node in our
> cluster.
> > We then push that data into carbon-cache/graphite and using grafana
> > for visualization.
> which of those values are you using for monitoring? I can see a lot of
> numbers when doing a 'ceph daemon osd.N perf dump'. Do you know if
> there is some documentation what each value means? I could only find:
> http://docs.ceph.com/docs/hammer/dev/perf_counters/ which describes
> the schema.

I'm currently going through them and trying to write a short doc explaining
what each one measures. Are you just interested in total number of read and
write ops over the whole cluster?


> 
> Best Regards
> Nick
> 
> > Our numbers are much more consistent than yours appear.
> >
> > Bob
> >
> > On Thu, Apr 7, 2016 at 2:34 AM, David Riedl 
> wrote:
> > > Hi.
> > >
> > > I use this for my zabbix environment:
> > >
> > > https://github.com/thelan/ceph-zabbix/
> > >
> > > It works really well for me.
> > >
> > >
> > > Regards
> > >
> > > David
> > >
> > > On 07.04.2016 11:20, Nick Fisk wrote:
> > >   Hi.
> > >
> > > I have small question about monitoring performance at ceph cluster.
> > >
> > > We have cluster with 5 nodes and 8 drives on each node, and 5
> > > monitor on every node. For monitoring cluster we use zabbix. It
> > > asked every node for
> > >
> > > 30
> > >
> > > second about current ceph operation and get different result from
> > > every node.
> > > first node: 350op/s
> > > second node: 900op/s
> > > third node: 200ops/s
> > > fourth node:   700op/s
> > > fifth node: 1200ops/
> > >
> > > I don't understand how I can receive the total value of performance
> > > ceph cluster?
> > >
> > > Easy Answer
> > > Capture and parse the output from "ceph -s", not 100% accurate, but
> > > probably good enough for a graph
> > >
> > > Complex Answer
> > > Use something like Graphite to capture all the counters for every
> > > OSD and then use something like sumSeries to add all the op/s counters
> together.
> > >
> > >
> > >
> > >
> > > ___
> > > ceph-users mailing
> > > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph
> > > -user
> > > s-ceph.com
> > >
> > > ___
> > > ceph-users mailing
> > > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph
> > > -user
> > > s-ceph.com
> > >
> > >
> > > --
> > > Mit freundlichen Grüßen
> > >
> > > David Riedl
> > >
> > >
> > >
> > > WINGcon GmbH Wireless New Generation - Consulting & Solutions
> > >
> > > Phone: +49 (0) 7543 9661 - 26
> > > E-Mail: david.ri...@wingcon.com
> > > Web: http://www.wingcon.com
> > >
> > > Sitz der Gesellschaft: Langenargen
> > > Registergericht: ULM, HRB 632019
> > > USt-Id.: DE232931635, WEEE-Id.: DE74015979
> > > Geschäftsführer: Thomas Ehrle, Fritz R. Paul
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich Tel +41
44
> 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread hp cre

Hello James,

It's a default install of xenial server beta 2 release. Created a user then
followed the ceph installation quick start exactly as it is.

Ceph-deploy version 1.5.31 was used as follows

1- ceph-deploy new node1
2- ceph-deploy install --release jewel  node1
3- ceph-deploy mon create-initial

Step 3 gave error in Python scripts. Meaning it could not find initctl
command. Searched for this command and found out our belongs to upstart.
On 11 Apr 2016 09:32, "James Page"  wrote:

> Hi
>
> On Sun, 10 Apr 2016 at 16:39 hp cre  wrote:
>
>> Hello all,
>>
>> I was just installing jewel 10.1.0 on ubuntu xenial beta 2.
>> I got an error when trying to create a mon about failure to find command
>> 'initctl' which is in upstart.
>> Tried to install upstart, then got an error 'com.ubuntu...' not found.
>>
>> Anyway,  i thought that with jewel release it would support systemd on
>> ubuntu xenial like it said in the release notes?!
>>
> I've been working with the systemd support in the Jewel release and its
> working OK for me - how are you trying to create your mon?  I'm wondering
> whether its your deployment process rather than the ceph packages that have
> a problem here.
>
> Cheers
>
> James
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread James Page

On Mon, 11 Apr 2016 at 10:02 hp cre  wrote:

> Hello James,
>
> It's a default install of xenial server beta 2 release. Created a user
> then followed the ceph installation quick start exactly as it is.
>
> Ceph-deploy version 1.5.31 was used as follows
>
> 1- ceph-deploy new node1
> 2- ceph-deploy install --release jewel  node1
> 3- ceph-deploy mon create-initial
>
> Step 3 gave error in Python scripts. Meaning it could not find initctl
> command. Searched for this command and found out our belongs to upstart.
>
I suspect that ceph-deploy is not playing nicely with systemd based Ubuntu
releases - I'll take a look now..
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread James Page

It would be handy to get visibility of your deployment log data; I'm not
currently able to reproduce your issue deploying ceph using ceph-deploy on
a small three node install running xenial - its correctly detecting systemd
and using systemctl instead of initctl.

On Mon, 11 Apr 2016 at 10:18 James Page  wrote:

> On Mon, 11 Apr 2016 at 10:02 hp cre  wrote:
>
>> Hello James,
>>
>> It's a default install of xenial server beta 2 release. Created a user
>> then followed the ceph installation quick start exactly as it is.
>>
>> Ceph-deploy version 1.5.31 was used as follows
>>
>> 1- ceph-deploy new node1
>> 2- ceph-deploy install --release jewel  node1
>> 3- ceph-deploy mon create-initial
>>
>> Step 3 gave error in Python scripts. Meaning it could not find initctl
>> command. Searched for this command and found out our belongs to upstart.
>>
> I suspect that ceph-deploy is not playing nicely with systemd based Ubuntu
> releases - I'll take a look now..
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [ceph-mds] mds service can not start after shutdown in 10.1.0

2016-04-11 Thread John Spray

Is the ID of the MDS service really "0"?  Usually people set the ID to the
hostname.  Check it in /var/lib/ceph/mds

John

On Mon, Apr 11, 2016 at 9:44 AM, 施柏安  wrote:

> Hi cephers,
>
> I was testing CephFS's HA. So I shutdown the active mds server.
> Then the one of standby mds turn to be active. Everything seems work
> properly.
> But I boot the mds server which was shutdown in test. It can't join
> cluster automatically.
> And I use command 'sudo service ceph-mds start id=0'. It can't start and
> just show 'ceph-mds stop/waiting'
>
> Is that the bug or I do wrong operation?
>
> --
>
> Best regards,
>
> 施柏安 Desmond Shih
> 技術研發部 Technical Development
>  
> 迎棧科技股份有限公司
> │ 886-975-857-982
> │ desmond.s@inwinstack 
> │ 886-2-7738-2858 #7441
> │ 新北市220板橋區遠東路3號5樓C室
> Rm.C, 5F., No.3, Yuandong Rd.,
> Banqiao Dist., New Taipei City 220, Taiwan (R.O.C)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread hp cre

In the process of reproducing it now. I'll attach a full command log
On 11 Apr 2016 11:42, "James Page"  wrote:

It would be handy to get visibility of your deployment log data; I'm not
currently able to reproduce your issue deploying ceph using ceph-deploy on
a small three node install running xenial - its correctly detecting systemd
and using systemctl instead of initctl.

On Mon, 11 Apr 2016 at 10:18 James Page  wrote:

> On Mon, 11 Apr 2016 at 10:02 hp cre  wrote:
>
>> Hello James,
>>
>> It's a default install of xenial server beta 2 release. Created a user
>> then followed the ceph installation quick start exactly as it is.
>>
>> Ceph-deploy version 1.5.31 was used as follows
>>
>> 1- ceph-deploy new node1
>> 2- ceph-deploy install --release jewel  node1
>> 3- ceph-deploy mon create-initial
>>
>> Step 3 gave error in Python scripts. Meaning it could not find initctl
>> command. Searched for this command and found out our belongs to upstart.
>>
> I suspect that ceph-deploy is not playing nicely with systemd based Ubuntu
> releases - I'll take a look now..
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] kernel cephfs - slow requests

2016-04-11 Thread Dzianis Kahanovich

Dzianis Kahanovich пишет:
> Christian Balzer пишет:
> 
>>> New problem (unsure, but probably not observed in Hammer, but sure in
>>> Infernalis): copying large (tens g) files into kernel cephfs (from
>>> outside of cluster, iron - non-VM, preempt kernel) - make slow requests
>>> on some of OSDs (repeated range) - mostly 3 Gbps channels (slow).
>>>
>>> All OSDs default threads numbers. Scheduler=noop. size=3 min_size=2
>>>
>>> No same problem with fuse.
>>>
>>> Looks like broken or unbalanced congestion mechanism or I don't know how
>>> to moderate it. write_congestion_kb trying low (=1) - nothing
>>> interesting.
>>>
>> I think cause and effect are not quite what you think they are.
>>
>> Firstly let me state that I have no experience with CephFS at all, but
>> what you're seeing isn't likely related to it all.
>>
>> Next lets establish some parameters.
>> You're testing kernel and fuse from the same machine, right?
>> What is the write speed (throughput) when doing this with fuse compared to
>> the speed when doing this via the kernel module?
> 
> Now I add 2 and out 2 OSDs to 1 of 3 node (2T->4T), cluster under hardwork, so
> no benchmarks now. But I good understand this point. And after message I got
> slow request on fuse too.
> 
>> What is the top speed of your cluster when doing a 
>> "rados -p  bench 60 write -t 32" from your test machine?
>> Does this result in slow requests as well?
> 
> Hmm... may be later. Now I have no rados pools, only RBD, DATA & METADATA.
> 
>> What I think is happening is that you're simply at the limits of your
>> current cluster and that fuse is slower, thus not exposing this.
>> The kernel module is likely fast AND also will use pagecache, thus creating
>> very large writes (how much memory does your test machine have) when it
>> gets flushed.
> 
> I bound all read/write values in kernel client more then fuse.
> 
> Mostly I understand - problem are fast write & slow HDDs. But IMHO some
> mechanisms must prevent it (congestion-like). And early I don't observe this
> problem on similar configs.
> 
> Later, if I will have more info, I say more. May be PREEMPT kernel is "wrong"
> there...
> 

After series of experiments (and multiple "slow requests" on OSDs add/remove and
backfills) I found solution (and create unrecoverable "inconsistent" PGs in data
pool outside real files - data pool re-created now, all OK). So, solution:
caps_wanted_delay_max=5 option on kernel mount.

-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph striping

2016-04-11 Thread Jason Dillaman

In general, RBD "fancy" striping can help under certain workloads where small 
IO would normally be hitting the same object (e.g. small sequential IO). 

-- 

Jason Dillaman 


- Original Message -
> From: "Alwin Antreich" 
> To: ceph-users@lists.ceph.com
> Sent: Thursday, April 7, 2016 2:48:45 PM
> Subject: [ceph-users] ceph striping
> 
> Hi All,
> 
> first I wanted to say hello, as I am new to the list.
> 
> Secondly, we want to use ceph for VM disks and cephfs for our source
> code, image data, login directories, etc.
> 
> I would like to know, if striping would improve performance if we would
> set something like the following and move away from the defaults?
> 
> size = 4 MB
> stripe_unit = 512KB
> stripe_count = 8
> 
> http://docs.ceph.com/docs/master/man/8/rbd/#striping
> 
> Thanks in advance.
> 
> Best regards,
> Alwin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cephfs Kernel panic

2016-04-11 Thread Simon Ferber

Hi,

I try to setup an ceph cluster on Debian 8.4. Mainly I followed a
tutorial at
http://adminforge.de/raid/ceph/ceph-cluster-unter-debian-wheezy-installieren/

As far as I can see, the first steps are just working fine. I have two
nodes with four OSD on both nodes.
This is the output of ceph -s

cluster 2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
 health HEALTH_OK
 monmap e2: 2 mons at
{ollie2=129.217.207.207:6789/0,stan2=129.217.207.206:6789/0}
election epoch 12, quorum 0,1 stan2,ollie2
 mdsmap e10: 1/1/1 up {0=ollie2=up:active}, 1 up:standby
 osdmap e72: 8 osds: 8 up, 8 in
flags sortbitwise
  pgmap v137: 428 pgs, 4 pools, 2396 bytes data, 20 objects
281 MB used, 14856 GB / 14856 GB avail
 428 active+clean

Then I tried to add cephfs following the manual at
http://docs.ceph.com/docs/hammer/cephfs/createfs/ which seem to do it's
magic:
root@stan2:~# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

However, as soon as I try to mount the cephfs with mount.ceph
129.217.207.206:6789:/ /mnt/ -v -o
name=cephfs,secretfile=/etc/ceph/client.cephfs the server which tries to
mount crashes and has to be cold started again. To be able to use
mount.ceph I had to install ceph-fs-common - if that does matter...

Here is the kernel.log. Can you give me hints? I am pretty stuck on this
for the last few days.

Apr 11 16:25:02 stan2 kernel: [  171.086381] Key type ceph registered
Apr 11 16:25:02 stan2 kernel: [  171.086649] libceph: loaded (mon/osd
proto 15/24)
Apr 11 16:25:02 stan2 kernel: [  171.090582] FS-Cache: Netfs 'ceph'
registered for caching
Apr 11 16:25:02 stan2 kernel: [  171.090596] ceph: loaded (mds proto 32)
Apr 11 16:25:02 stan2 kernel: [  171.096727] libceph: client34164 fsid
2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
Apr 11 16:25:02 stan2 kernel: [  171.133832] libceph: mon0
129.217.207.206:6789 session established
Apr 11 16:25:02 stan2 kernel: [  171.161199] [ cut here
]
Apr 11 16:25:02 stan2 kernel: [  171.161239] kernel BUG at
/build/linux-lqALYs/linux-3.16.7-ckt25/fs/ceph/mds_client.c:1846!
Apr 11 16:25:02 stan2 kernel: [  171.161294] invalid opcode:  [#1] SMP
Apr 11 16:25:02 stan2 kernel: [  171.161328] Modules linked in: cbc ceph
libceph xfs libcrc32c crc32c_generic binfmt_misc mptctl mptbase nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc nls_utf8
nls_cp437 vfat fat x86_pkg_temp_thermal intel_powerclamp intel_rapl
coretemp kvm_intel kvm crc32_pclmul cryptd iTCO_wdt iTCO_vendor_support
efi_pstore efivars pcspkr joydev evdev ast i2c_i801 ttm drm_kms_helper
drm lpc_ich mfd_core mei_me mei shpchp ioatdma tpm_tis wmi tpm ipmi_si
ipmi_msghandler processor thermal_sys acpi_power_meter button acpi_pad
fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 md_mod hid_generic sg
usbhid hid sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul
crct10dif_common crc32c_intel ahci libahci ehci_pci mpt3sas igb
raid_class i2c_algo_bit xhci_hcd libata ehci_hcd scsi_transport_sas
i2c_core dca usbcore ptp usb_common scsi_mod pps_core
Apr 11 16:25:02 stan2 kernel: [  171.162046] CPU: 0 PID: 3513 Comm:
kworker/0:9 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2
Apr 11 16:25:02 stan2 kernel: [  171.162104] Hardware name: Supermicro
SYS-6028R-WTR/X10DRW-i, BIOS 1.0c 01/07/2015
Apr 11 16:25:02 stan2 kernel: [  171.162158] Workqueue: ceph-msgr
con_work [libceph]
Apr 11 16:25:02 stan2 kernel: [  171.162194] task: 88103f2e8ae0 ti:
88103bfbc000 task.ti: 88103bfbc000
Apr 11 16:25:02 stan2 kernel: [  171.162243] RIP:
0010:[]  []
__prepare_send_request+0x801/0x810 [ceph]
Apr 11 16:25:02 stan2 kernel: [  171.162312] RSP: 0018:88103bfbfba8 
EFLAGS: 00010283
Apr 11 16:25:02 stan2 kernel: [  171.162347] RAX: 88103f88ad42 RBX:
88103f7f7400 RCX: 
Apr 11 16:25:02 stan2 kernel: [  171.162394] RDX: 164c5ec6 RSI:
 RDI: 88103f88ad32
Apr 11 16:25:02 stan2 kernel: [  171.162440] RBP: 88103f7f95e0 R08:
 R09: 
Apr 11 16:25:02 stan2 kernel: [  171.162485] R10:  R11:
002c R12: 88103f7f7c00
Apr 11 16:25:02 stan2 kernel: [  171.162531] R13: 88103f88acc0 R14:
 R15: 88103f88ad3a
Apr 11 16:25:02 stan2 kernel: [  171.162578] FS:  ()
GS:88107fc0() knlGS:
Apr 11 16:25:02 stan2 kernel: [  171.162629] CS:  0010 DS:  ES: 
CR0: 80050033
Apr 11 16:25:02 stan2 kernel: [  171.162668] CR2: 7fa73ca0a000 CR3:
01a13000 CR4: 001407f0
Apr 11 16:25:02 stan2 kernel: [  171.162713] Stack:
Apr 11 16:25:02 stan2 kernel: [  171.162730]  88103bfbfbd4
88103ef39540 0001 
Apr 11 16:25:02 stan2 kernel: [  171.162787]  
 88103ef39540 
Apr 11 16:25:02 stan2 kernel: [  171.162845]  0001
 fff

Re: [ceph-users] cephfs Kernel panic

2016-04-11 Thread Ilya Dryomov

On Mon, Apr 11, 2016 at 4:37 PM, Simon Ferber
 wrote:
> Hi,
>
> I try to setup an ceph cluster on Debian 8.4. Mainly I followed a
> tutorial at
> http://adminforge.de/raid/ceph/ceph-cluster-unter-debian-wheezy-installieren/
>
> As far as I can see, the first steps are just working fine. I have two
> nodes with four OSD on both nodes.
> This is the output of ceph -s
>
> cluster 2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
>  health HEALTH_OK
>  monmap e2: 2 mons at
> {ollie2=129.217.207.207:6789/0,stan2=129.217.207.206:6789/0}
> election epoch 12, quorum 0,1 stan2,ollie2
>  mdsmap e10: 1/1/1 up {0=ollie2=up:active}, 1 up:standby
>  osdmap e72: 8 osds: 8 up, 8 in
> flags sortbitwise
>   pgmap v137: 428 pgs, 4 pools, 2396 bytes data, 20 objects
> 281 MB used, 14856 GB / 14856 GB avail
>  428 active+clean
>
> Then I tried to add cephfs following the manual at
> http://docs.ceph.com/docs/hammer/cephfs/createfs/ which seem to do it's
> magic:
> root@stan2:~# ceph fs ls
> name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
>
> However, as soon as I try to mount the cephfs with mount.ceph
> 129.217.207.206:6789:/ /mnt/ -v -o
> name=cephfs,secretfile=/etc/ceph/client.cephfs the server which tries to
> mount crashes and has to be cold started again. To be able to use
> mount.ceph I had to install ceph-fs-common - if that does matter...
>
> Here is the kernel.log. Can you give me hints? I am pretty stuck on this
> for the last few days.
>
> Apr 11 16:25:02 stan2 kernel: [  171.086381] Key type ceph registered
> Apr 11 16:25:02 stan2 kernel: [  171.086649] libceph: loaded (mon/osd
> proto 15/24)
> Apr 11 16:25:02 stan2 kernel: [  171.090582] FS-Cache: Netfs 'ceph'
> registered for caching
> Apr 11 16:25:02 stan2 kernel: [  171.090596] ceph: loaded (mds proto 32)
> Apr 11 16:25:02 stan2 kernel: [  171.096727] libceph: client34164 fsid
> 2a028d5e-5708-4fc4-9c0d-3495c1a3ef3d
> Apr 11 16:25:02 stan2 kernel: [  171.133832] libceph: mon0
> 129.217.207.206:6789 session established
> Apr 11 16:25:02 stan2 kernel: [  171.161199] [ cut here
> ]
> Apr 11 16:25:02 stan2 kernel: [  171.161239] kernel BUG at
> /build/linux-lqALYs/linux-3.16.7-ckt25/fs/ceph/mds_client.c:1846!
> Apr 11 16:25:02 stan2 kernel: [  171.161294] invalid opcode:  [#1] SMP
> Apr 11 16:25:02 stan2 kernel: [  171.161328] Modules linked in: cbc ceph
> libceph xfs libcrc32c crc32c_generic binfmt_misc mptctl mptbase nfsd
> auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc nls_utf8
> nls_cp437 vfat fat x86_pkg_temp_thermal intel_powerclamp intel_rapl
> coretemp kvm_intel kvm crc32_pclmul cryptd iTCO_wdt iTCO_vendor_support
> efi_pstore efivars pcspkr joydev evdev ast i2c_i801 ttm drm_kms_helper
> drm lpc_ich mfd_core mei_me mei shpchp ioatdma tpm_tis wmi tpm ipmi_si
> ipmi_msghandler processor thermal_sys acpi_power_meter button acpi_pad
> fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 md_mod hid_generic sg
> usbhid hid sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul
> crct10dif_common crc32c_intel ahci libahci ehci_pci mpt3sas igb
> raid_class i2c_algo_bit xhci_hcd libata ehci_hcd scsi_transport_sas
> i2c_core dca usbcore ptp usb_common scsi_mod pps_core
> Apr 11 16:25:02 stan2 kernel: [  171.162046] CPU: 0 PID: 3513 Comm:
> kworker/0:9 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2
> Apr 11 16:25:02 stan2 kernel: [  171.162104] Hardware name: Supermicro
> SYS-6028R-WTR/X10DRW-i, BIOS 1.0c 01/07/2015
> Apr 11 16:25:02 stan2 kernel: [  171.162158] Workqueue: ceph-msgr
> con_work [libceph]
> Apr 11 16:25:02 stan2 kernel: [  171.162194] task: 88103f2e8ae0 ti:
> 88103bfbc000 task.ti: 88103bfbc000
> Apr 11 16:25:02 stan2 kernel: [  171.162243] RIP:
> 0010:[]  []
> __prepare_send_request+0x801/0x810 [ceph]
> Apr 11 16:25:02 stan2 kernel: [  171.162312] RSP: 0018:88103bfbfba8
> EFLAGS: 00010283
> Apr 11 16:25:02 stan2 kernel: [  171.162347] RAX: 88103f88ad42 RBX:
> 88103f7f7400 RCX: 
> Apr 11 16:25:02 stan2 kernel: [  171.162394] RDX: 164c5ec6 RSI:
>  RDI: 88103f88ad32
> Apr 11 16:25:02 stan2 kernel: [  171.162440] RBP: 88103f7f95e0 R08:
>  R09: 
> Apr 11 16:25:02 stan2 kernel: [  171.162485] R10:  R11:
> 002c R12: 88103f7f7c00
> Apr 11 16:25:02 stan2 kernel: [  171.162531] R13: 88103f88acc0 R14:
>  R15: 88103f88ad3a
> Apr 11 16:25:02 stan2 kernel: [  171.162578] FS:  ()
> GS:88107fc0() knlGS:
> Apr 11 16:25:02 stan2 kernel: [  171.162629] CS:  0010 DS:  ES: 
> CR0: 80050033
> Apr 11 16:25:02 stan2 kernel: [  171.162668] CR2: 7fa73ca0a000 CR3:
> 01a13000 CR4: 001407f0
> Apr 11 16:25:02 stan2 kernel: [  171.162713] Stack:
> Apr 11 16:25:02 stan2 kernel: [  171.162730]  88103bfbfbd4
> 8810

Re: [ceph-users] Powercpu and ceph

2016-04-11 Thread Gregory Farnum

Upstream doesn't test Ceph on Power. We built it semi-regularly several
years ago but that has fallen by the wayside as well. I think some distros
still package it though; and we are fairly careful about endianness and
things so it's supposed to work.
-Greg

On Sunday, April 10, 2016, louis  wrote:

>
> Hi, I see many cases using x86 arch cup, but I also have several servers
> with power arch, and want to use them in ceph? Anybody can tell me whether
> ceph running on power arch will be stable? Thanks
> 发自网易邮箱大师
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] 回复：Re: Powercpu and ceph

2016-04-11 Thread louis



 
 

Yes,  I installed ceph on power server and can run good path io, but how can i prove it is stable on power arch? Use ceph test suite? Thanks发自网易邮箱大师
在2016年04月11日 23:44，Gregory Farnum 写道:Upstream doesn't test Ceph on Power. We built it semi-regularly several years ago but that has fallen by the wayside as well. I think some distros still package it though; and we are fairly careful about endianness and things so it's supposed to work.-GregOn Sunday, April 10, 2016, louis  wrote:

Hi, I see many cases using x86 arch cup, but I also have several servers with power arch, and want to use them in ceph? Anybody can tell me whether ceph running on power arch will be stable? Thanks发自网易邮箱大师
  





 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Powercpu and ceph

2016-04-11 Thread Gregory Farnum

If you've got the time to run teuthology/ceph-qa-suite on it, that would be
awesome!

But really if you've got it running now, you're probably good. You can
exercise basically all the riskiest bits by killing some OSDs and
then turning them back on once the cluster has finished peering after it
marks then down.
-Greg

On Monday, April 11, 2016, louis  wrote:

>
> Yes,  I installed ceph on power server and can run good path io, but how
> can i prove it is stable on power arch? Use ceph test suite? Thanks
> 发自网易邮箱大师
> 在2016年04月11日 23:44，Gregory Farnum
>  写道:
>
> Upstream doesn't test Ceph on Power. We built it semi-regularly several
> years ago but that has fallen by the wayside as well. I think some distros
> still package it though; and we are fairly careful about endianness and
> things so it's supposed to work.
> -Greg
>
> On Sunday, April 10, 2016, louis  > wrote:
>
>>
>> Hi, I see many cases using x86 arch cup, but I also have several servers
>> with power arch, and want to use them in ceph? Anybody can tell me whether
>> ceph running on power arch will be stable? Thanks
>> 发自网易邮箱大师
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Fwd: Re: Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread hp cre

-- Forwarded message --
From: "hp cre" 
Date: 11 Apr 2016 15:50
Subject: Re: [ceph-users] Ubuntu xenial and ceph jewel systemd
To: "James Page" 
Cc:

Here is exactly what has been done (just started from scratch today):

1- install default xenial beta 2

2- run apt-get update && apt-get dist-upgrade (this step was not done on
first trial)
after update, got warning as follows:
"W: plymouth: The plugin label.so is missing, the selected theme might not
work as expected.
W: plymouth: You might want to install the plymouth-themes and
plymouth-label package to fix this.
W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
"
so i ran apt-get -y install plymouth-themes

3- wget
http://download.ceph.com/debian-jewel/pool/main/c/ceph-deploy/ceph-deploy_1.5.31_all.deb

4- dpkg -i ceph-deploy_1.5.31_all.deb
got errors of unmet dependencies, so i ran apt-get -f install. this
installed all missing packages.

5- followed ceph docs preflight checklist (sudo file, ssh config file,
ssh-copy-id, install ntp)

Followed the storage cluster quick start guide

6- ceph-deploy new xen1 (first node) --> all ok

7-  edit ceph.conf --> osd pool default size = 2

8- ceph-deploy install --release=jewel xen1 --> all ok (this time it
installed jewel 10.1.1, yesterday it was 10.1.0)

9- ceph-deploy mon create-initial --> same error:

wes@xen1:~/cl$ ceph-deploy mon create-initial
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/wes/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.31): /usr/bin/ceph-deploy mon
create-initial
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: create-initial
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  keyrings  : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts xen1
[ceph_deploy.mon][DEBUG ] detecting platform for host xen1 ...
[xen1][DEBUG ] connection detected need for sudo
[xen1][DEBUG ] connected to host: xen1
[xen1][DEBUG ] detect platform information from remote host
[xen1][DEBUG ] detect machine type
[xen1][DEBUG ] find the location of an executable
[ceph_deploy.mon][INFO  ] distro info: Ubuntu 16.04 xenial
[xen1][DEBUG ] determining if provided host has same hostname in remote
[xen1][DEBUG ] get remote short hostname
[xen1][DEBUG ] deploying mon to xen1
[xen1][DEBUG ] get remote short hostname
[xen1][DEBUG ] remote hostname: xen1
[xen1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[xen1][DEBUG ] create the mon path if it does not exist
[xen1][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-xen1/done
[xen1][DEBUG ] done path does not exist: /var/lib/ceph/mon/ceph-xen1/done
[xen1][INFO  ] creating keyring file:
/var/lib/ceph/tmp/ceph-xen1.mon.keyring
[xen1][DEBUG ] create the monitor keyring file
[xen1][INFO  ] Running command: sudo ceph-mon --cluster ceph --mkfs -i xen1
--keyring /var/lib/ceph/tmp/ceph-xen1.mon.keyring --setuser 64045
--setgroup 64045
[xen1][DEBUG ] ceph-mon: mon.noname-a 192.168.56.10:6789/0 is local,
renaming to mon.xen1
[xen1][DEBUG ] ceph-mon: set fsid to d56c2ad9-dc66-4b6a-b269-e32eecc05571
[xen1][DEBUG ] ceph-mon: created monfs at /var/lib/ceph/mon/ceph-xen1 for
mon.xen1
[xen1][INFO  ] unlinking keyring file
/var/lib/ceph/tmp/ceph-xen1.mon.keyring
[xen1][DEBUG ] create a done file to avoid re-doing the mon deployment
[xen1][DEBUG ] create the init path if it does not exist
[xen1][INFO  ] Running command: sudo initctl emit ceph-mon cluster=ceph
id=xen1
[xen1][ERROR ] Traceback (most recent call last):
[xen1][ERROR ]   File
"/usr/lib/python2.7/dist-packages/ceph_deploy/lib/vendor/remoto/process.py",
line 119, in run
[xen1][ERROR ] reporting(conn, result, timeout)
[xen1][ERROR ]   File
"/usr/lib/python2.7/dist-packages/ceph_deploy/lib/vendor/remoto/log.py",
line 13, in reporting
[xen1][ERROR ] received = result.receive(timeout)
[xen1][ERROR ]   File
"/usr/lib/python2.7/dist-packages/ceph_deploy/lib/vendor/remoto/lib/vendor/execnet/gateway_base.py",
line 704, in receive
[xen1][ERROR ] raise self._getremoteerror() or EOFError()
[xen1][ERROR ] RemoteError: Traceback (most recent call last):
[xen1][ERROR ]   File
"/usr/lib/python2.7/dist-packages/ceph_deploy/lib/vendor/remoto/lib/vendor/execnet/gateway_base.py",
line 1036, in executetask
[xen1][ERROR ] function(channel, **kwargs)
[xen1][ERROR ]   File "", line 12, in _remote_run
[xen1][ERROR ]   File "/usr/lib/python2.7/subproce

Re: [ceph-users] OSD activate Error

2016-04-11 Thread Bob R

I'd guess you previously removed an osd.0 but forgot to perform 'ceph auth
del osd.0'

'ceph auth list' might show some other stray certs.

Bob

On Mon, Apr 4, 2016 at 9:52 PM,  wrote:

> Hi,
>
>
>
> I keep getting this error while try to activate:
>
>
>
> [root@mon01 ceph]# ceph-deploy osd prepare osd01:sdc:/dev/sde1
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /root/.cephdeploy.conf
>
> [ceph_deploy.cli][INFO  ] Invoked (1.5.31): /usr/bin/ceph-deploy osd
> prepare osd01:sdc:/dev/sde1
>
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
>
> [ceph_deploy.cli][INFO  ]  username  : None
>
> [ceph_deploy.cli][INFO  ]  disk  : [('osd01',
> '/dev/sdc', '/dev/sde1')]
>
> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
>
> [ceph_deploy.cli][INFO  ]  verbose   : False
>
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
>
> [ceph_deploy.cli][INFO  ]  subcommand: prepare
>
> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
> /etc/ceph/dmcrypt-keys
>
> [ceph_deploy.cli][INFO  ]  quiet : False
>
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
>
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
>
> [ceph_deploy.cli][INFO  ]  fs_type   : xfs
>
> [ceph_deploy.cli][INFO  ]  func  :  at 0xb67320>
>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
>
> [ceph_deploy.cli][INFO  ]  default_release   : False
>
> [ceph_deploy.cli][INFO  ]  zap_disk  : False
>
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks
> osd01:/dev/sdc:/dev/sde1
>
> [osd01][DEBUG ] connected to host: osd01
>
> [osd01][DEBUG ] detect platform information from remote host
>
> [osd01][DEBUG ] detect machine type
>
> [osd01][DEBUG ] find the location of an executable
>
> [ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.1.1503 Core
>
> [ceph_deploy.osd][DEBUG ] Deploying osd to osd01
>
> [osd01][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>
> [ceph_deploy.osd][DEBUG ] Preparing host osd01 disk /dev/sdc journal
> /dev/sde1 activate False
>
> [osd01][INFO  ] Running command: ceph-disk -v prepare --cluster ceph
> --fs-type xfs -- /dev/sdc /dev/sde1
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --check-allows-journal -i 0 --cluster ceph
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --check-wants-journal -i 0 --cluster ceph
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --check-needs-journal -i 0 --cluster ceph
>
> [osd01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is
> /sys/dev/block/8:32/dm/uuid
>
> [osd01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is
> /sys/dev/block/8:32/dm/uuid
>
> [osd01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is
> /sys/dev/block/8:32/dm/uuid
>
> [osd01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sde1 uuid path is
> /sys/dev/block/8:65/dm/uuid
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=fsid
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=osd_journal_size
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_dmcrypt_type
>
> [osd01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is
> /sys/dev/block/8:32/dm/uuid
>
> [osd01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sde1 uuid path is
> /sys/dev/block/8:65/dm/uuid
>
> [osd01][WARNIN] DEBUG:ceph-disk:Journal /dev/sde1 is a partition
>
> [osd01][WARNIN] WARNING:ceph-disk:OSD will not be hot-swappable if journal
> is not the same device as the osd data
>
> [osd01][WARNIN] INFO:ceph-disk:Running command: /usr/sbin/blkid -o udev -p
> /dev/sde1
>
> [osd01][WARNIN] WARNING:ceph-disk:Journal /dev/sde1 was not prepared with
> ceph-disk. Symlinking directly.
>
> [osd01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is
> /sys/dev/block/8:32/dm/uuid
>
> [osd01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdc u

Re: [ceph-users] upgraded to Ubuntu 16.04, getting assert failure

2016-04-11 Thread John Spray

On Sun, Apr 10, 2016 at 4:12 AM, Don Waterloo  wrote:
> I have a 6 osd system (w/ 3 mon, and 3 mds).
> it is running cephfs as part of its task.
>
> i have upgraded the 3 mon nodes to Ubuntu 16.04 and the bundled ceph
> 10.1.0-0ubuntu1.
>
> (upgraded from Ubuntu 15.10 with ceph 0.94.6-0ubuntu0.15.10.1).
>
> 2 of the mon nodes are happy and up. But the 3rd is giving an asset failure
> on start.
> specifically the assert is:
> mds/FSMap.cc: 555: FAILED assert(i.second.state == MDSMap::STATE_STANDBY)
>
> The 'ceph status' is showing 3 mds (1 up active, 2 up standby);
>
> # ceph status
> 2016-04-10 03:08:24.522804 7f2be870c700  0 -- :/1760247070 >>
> 10.100.10.62:6789/0 pipe(0x7f2be405a2f0 sd=3 :0 s=1 pgs=0 cs=0 l=1
> c=0x7f2be405bf90).fault
> cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded
>  health HEALTH_WARN
> crush map has legacy tunables (require bobtail, min is firefly)
> 1 mons down, quorum 0,1 nubo-1,nubo-2
>  monmap e1: 3 mons at
> {nubo-1=10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}
> election epoch 2778, quorum 0,1 nubo-1,nubo-2
>  mdsmap e1279: 1/1/1 up {0:0=nubo-2=up:active}, 2 up:standby
>  osdmap e5666: 6 osds: 6 up, 6 in
>   pgmap v1476810: 712 pgs, 5 pools, 41976 MB data, 109 kobjects
> 86310 MB used, 5538 GB / 5622 GB avail
>  712 active+clean
>
> I'm not sure what to do @ this stage. I've rebooted all of them, i've tried
> taking the 2 standby MDS down. I don't see why this mon fails when the
> others succeed.
>
> Does anyone have any suggestions?
>
> The stack trace from the assert gives:
>  1: (()+0x51fb9d) [0x5572d9e42b9d]
>  2: (()+0x113e0) [0x7fa285f8b3e0]
>  3: (gsignal()+0x38) [0x7fa28416b518]
>  4: (abort()+0x16a) [0x7fa28416d0ea]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x26b) [0x5572d9f7082b]
>  6: (FSMap::sanity() const+0x9ae) [0x5572d9e84f4e]
>  7: (MDSMonitor::update_from_paxos(bool*)+0x313) [0x5572d9c7e8f3]
>  8: (PaxosService::refresh(bool*)+0x3dd) [0x5572d9c012dd]
>  9: (Monitor::refresh_from_paxos(bool*)+0x193) [0x5572d9b99693]
>  10: (Monitor::init_paxos()+0x115) [0x5572d9b99ad5]
>  11: (Monitor::preinit()+0x902) [0x5572d9bca252]
>  12: (main()+0x255b) [0x5572d9b3ec9b]
>  13: (__libc_start_main()+0xf1) [0x7fa284156841]
>  14: (_start()+0x29) [0x5572d9b8b869]

Please provide the full log from the mon starting up to it crashing,
with "debug mon = 10" set.

If the mons are really all running the same code but only one is
failing, presumably that one has somehow during the upgrade process
ended up storing something invalid in its local stores while the
others have somehow proceeded past that version already.

v10.1.1 (i.e. Jewel, when it is released) has a configuration option
(mon_mds_skip_sanity) that may allow you to get past this, assuming
what's in the leader's store is indeed valid (guessing it is since
your other two mons are apparently happy).

I don't know exactly how the Ubuntu release process works, but you
should be aware that the Ceph version you're running is pre-release
code from the jewel branch.

If your CephFS data pool happens to have ID 0, you will also hit a
severe bug in that code, and you should stop using it now (see the
note here: http://blog.gmane.org/gmane.comp.file-systems.ceph.announce)

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RE; upgraded to Ubuntu 16.04, getting assert failure

2016-04-11 Thread Chad William Seys

Hi Don,
I had a similar problem starting a mon.  In my case a computer failed 
and 
I removed and recreated the 3rd mon on a new computer.  It would start but 
never get added to the other mon's lists.
Restarting the other two mons caused them to add the third to their 
monmap .

Good luck!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Re: Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread Peter Sabaini

On 2016-04-11 18:15, hp cre wrote:
> -- Forwarded message --
> From: "hp cre" mailto:hpc...@gmail.com>>
> Date: 11 Apr 2016 15:50
> Subject: Re: [ceph-users] Ubuntu xenial and ceph jewel systemd
> To: "James Page" mailto:james.p...@ubuntu.com>>
> Cc:
> 
> Here is exactly what has been done (just started from scratch today):
> 
> 1- install default xenial beta 2
> 
> 2- run apt-get update && apt-get dist-upgrade (this step was not done on
> first trial)
> after update, got warning as follows:
> "W: plymouth: The plugin label.so is missing, the selected theme might
> not work as expected.
> W: plymouth: You might want to install the plymouth-themes and
> plymouth-label package to fix this.
> W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
> "
> so i ran apt-get -y install plymouth-themes
> 
> 3- wget
> http://download.ceph.com/debian-jewel/pool/main/c/ceph-deploy/ceph-deploy_1.5.31_all.deb

Did you try with the xenial ceph-deploy package? No need to wget,
they're right in the repository

# apt-get install ceph-deploy


> 4- dpkg -i ceph-deploy_1.5.31_all.deb
> got errors of unmet dependencies, so i ran apt-get -f install. this
> installed all missing packages.
> 
> 5- followed ceph docs preflight checklist (sudo file, ssh config file,
> ssh-copy-id, install ntp)
> 
> Followed the storage cluster quick start guide
> 
> 6- ceph-deploy new xen1 (first node) --> all ok
> 
> 7-  edit ceph.conf --> osd pool default size = 2
> 
> 8- ceph-deploy install --release=jewel xen1 --> all ok (this time it
> installed jewel 10.1.1, yesterday it was 10.1.0)
> 
> 9- ceph-deploy mon create-initial --> same error:
> 
> wes@xen1:~/cl$ ceph-deploy mon create-initial
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/wes/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.31): /usr/bin/ceph-deploy mon
> create-initial
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
> [ceph_deploy.cli][INFO  ]  username  : None
> [ceph_deploy.cli][INFO  ]  verbose   : False
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
> [ceph_deploy.cli][INFO  ]  subcommand: create-initial
> [ceph_deploy.cli][INFO  ]  quiet : False
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
> [ceph_deploy.cli][INFO  ]  func  :  at 0x7ffb88bdcf50>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
> [ceph_deploy.cli][INFO  ]  keyrings  : None
> [ceph_deploy.cli][INFO  ]  default_release   : False
> [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts xen1
> [ceph_deploy.mon][DEBUG ] detecting platform for host xen1 ...
> [xen1][DEBUG ] connection detected need for sudo
> [xen1][DEBUG ] connected to host: xen1
> [xen1][DEBUG ] detect platform information from remote host
> [xen1][DEBUG ] detect machine type
> [xen1][DEBUG ] find the location of an executable
> [ceph_deploy.mon][INFO  ] distro info: Ubuntu 16.04 xenial
> [xen1][DEBUG ] determining if provided host has same hostname in remote
> [xen1][DEBUG ] get remote short hostname
> [xen1][DEBUG ] deploying mon to xen1
> [xen1][DEBUG ] get remote short hostname
> [xen1][DEBUG ] remote hostname: xen1
> [xen1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
> [xen1][DEBUG ] create the mon path if it does not exist
> [xen1][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-xen1/done
> [xen1][DEBUG ] done path does not exist: /var/lib/ceph/mon/ceph-xen1/done
> [xen1][INFO  ] creating keyring file:
> /var/lib/ceph/tmp/ceph-xen1.mon.keyring
> [xen1][DEBUG ] create the monitor keyring file
> [xen1][INFO  ] Running command: sudo ceph-mon --cluster ceph --mkfs -i
> xen1 --keyring /var/lib/ceph/tmp/ceph-xen1.mon.keyring --setuser 64045
> --setgroup 64045
> [xen1][DEBUG ] ceph-mon: mon.noname-a 192.168.56.10:6789/0
>  is local, renaming to mon.xen1
> [xen1][DEBUG ] ceph-mon: set fsid to d56c2ad9-dc66-4b6a-b269-e32eecc05571
> [xen1][DEBUG ] ceph-mon: created monfs at /var/lib/ceph/mon/ceph-xen1
> for mon.xen1
> [xen1][INFO  ] unlinking keyring file
> /var/lib/ceph/tmp/ceph-xen1.mon.keyring
> [xen1][DEBUG ] create a done file to avoid re-doing the mon deployment
> [xen1][DEBUG ] create the init path if it does not exist
> [xen1][INFO  ] Running command: sudo initctl emit ceph-mon cluster=ceph
> id=xen1
> [xen1][ERROR ] Traceback (most recent call last):
> [xen1][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/lib/vendor/remoto/process.py",
> line 119, in run
> [xen1][ERROR ] reporting(conn, result, timeout)
> [xen1][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/lib/vendor/remoto/log.py",
> line 13, in reporting
> [xen1][ERROR ] received = result.receive(timeout)
> [xen1][ERROR ]   File
> "/usr/lib/python2.7/dist-packages/ceph_deploy/lib/vendor/remoto

Re: [ceph-users] Fwd: Re: Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread hp cre

I wanted to try the latest ceph-deploy. Thats why i downloaded this version
(31). Latest ubuntu version is (20).

I tried today at the end of the failed attempt to uninstall this version
and install the one that came with xenial,  but whatever i did, it always
defaulted to version 31. Maybe someone already upgraded xenial repository
to use version 31 instead of 20.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Re: Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread James Page

On Mon, 11 Apr 2016 at 21:35 hp cre  wrote:

> I wanted to try the latest ceph-deploy. Thats why i downloaded this
> version (31). Latest ubuntu version is (20).
>
> I tried today at the end of the failed attempt to uninstall this version
> and install the one that came with xenial,  but whatever i did, it always
> defaulted to version 31. Maybe someone already upgraded xenial repository
> to use version 31 instead of 20.
>
Yup that was me..
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] adding cache tier in productive hammer environment

2016-04-11 Thread Oliver Dzombic

Hi,

currently in use:

oldest:

SSDs: Intel S3510 80GB
HDD: HGST 6TB H3IKNAS600012872SE NAS

latest:

SSDs: Kingston 120 GB SV300
HDDs: HGST 3TB H3IKNAS30003272SE NAS

in future will be in use:

SSDs: Samsung SM863 240 GB
HDDs: HGST 3TB H3IKNAS30003272SE NAS and/or
Seagate ST2000NM0023 2 TB


-

Its hard to say, if and at what chance the newer will fail with the
osd's getting down/out, compared to the old ones.

We did a lot to avoid that.

Without having it in real numbers, my feeling is/was that the newer will
fail with a much lower chance. But what is responsible for that, is unknown.

In the very end, the old nodes have with 2x 2,3 GHz Intel Celeron ( 2x
cores without HT ) and 3x 6 TB HDD much less cpu power per HDD compared
to the 4x 3,3 GHz Intel E3-1225v5 CPU ( 4 cores ) with 10x 3 TB HDD.

So its just too much different, CPU, HDD, RAM, even the HDD Controller.

I will have to make sure, that the new cluster will have enough Hardware
to make sure, that i dont need to consider possible problems there.

--

atop: sda/sdb == SSD journal

--

That was my first experience too. At very first, deep-scrubs and even
normal scrubs were driving the %WA and business of the HDDs to 100% Flat.

--

I rechecked it with munin.

The journal SSD's go from ~40% up to 80-90% during deebscrub.
The HDDs go from ~ 20% up to 90-100% flat more or less, during deebscrubt.

At the same time, the load avarage goes to 16-20 ( 4 cores )
while the CPU will see up to 318% Idle Waiting Time. ( out of max. 400% )

--

The OSD's receive a peer timeout. Which is just understandable if the
system will see a 300% Idle Waiting Time for just long enough.


--

And yes, as it seems, for clusters which are very busy, especially with
low hardware ressources, needs much more than the standard config
can/will deliver. As soon as the LTS is out i will have to start busting
my head with available config parameters.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 11.04.2016 um 05:06 schrieb Christian Balzer:
> 
> Hello,
> 
> On Sat, 9 Apr 2016 02:14:45 +0200 Oliver Dzombic wrote:
> 
>> Hi Christian,
>>
>> yeah i saw the problems with cache tier in the current hammer.
>>
>> But as far as i saw it, i would not get in touch with that szenarios. I
>> dont plan to change settings like that, to let it go rubbish.
>>
> Shouldn't, but I'd avoid it anyway.
>  
>> But i already decided to wait for jewel and create a whole new cluster
>> and copy all data.
>>
> Sounds like a safer alternative.
> 
>> -
>>
>> I am running KVM instances. And will also run openVZ instances. Maybe
>> LXC too, lets see. They run all kind of different, independent
>> applications.
>>
>> -
>>
>> Well i have to admit, the beginnings of the cluster were quiet
>> experimentel. Was using 4x ( 2x 2,3 GHz Intel Celeron CPU's for 3x 6 TB
>> HDD + 80 GB SSD with 16 GB RAM ). And extending it by 2 Additional of
>> that kind. And currently also an E3-1225v5 with 32 GB RAM and 10x 3 TB
>> HDD and 2x 120 GB SSD.
>>
> Would you mind sharing what exact models of HDDs and SSDs you're using?
> Also, is the newer node showing the same ratio of unresponsive OSDs as the
> older ones?
> 
> In the atop output you posted, which ones are the SSDs (if they're in
> there at all)?
> 
>> But all my munin tells me its HDD related, if you want i can show it to
>> you. I guess that the hardcore random access on the drives are just
>> killing it.
>>
> Yup, I've seen that with the "bad" cluster here, the first thing to
> indicate things were getting to the edge of IOPS capacity was that
> deep-scrubs killed performance and then even regular scrubs.
> 
>> I deactivated (deep) scrub also because of this problem and just let it
>> run in the night, like now, and having 90% utilization @ journals and
>> 97% utilization @ HDD's.
>>
> This confuses me as well, during deep-scrubs all data gets read, your
> journals shouldn't get busier than they were before and last time you
> mentioned them being around 60% or so?
> 
>> And yes, its simply fixed by restarting the OSD's.
>>
>> They receive a heartbeat timeout and just go out/down.
>>
> Which timeout is it, the peer one or the monitor one?
> Have you tried upping the various parameters to prevent this?
> 
>> I tried to set the flag, that there will be no out/down.
>> That worked. It did not got marked out/down, but it anyway happend and
>> the cluster got instable ( misplaced object / recovery ).
>>
> That's a band-aid indeed, but I wouldn't expect misplaced objects from it.
> 
>> Well as i see the situation in a case a VM has a file open and using it
>> "right now", which is located on a PG on that OSD which is going "right
>> now" down/out, then Filesystem of the VM will get in t

Re: [ceph-users] Fwd: Re: Ubuntu xenial and ceph jewel systemd

2016-04-11 Thread hp cre

Hey James,
Did you check my steps? What did you do differently and worked for your?
Thanks for sharing..
On 11 Apr 2016 22:39, "James Page"  wrote:

> On Mon, 11 Apr 2016 at 21:35 hp cre  wrote:
>
>> I wanted to try the latest ceph-deploy. Thats why i downloaded this
>> version (31). Latest ubuntu version is (20).
>>
>> I tried today at the end of the failed attempt to uninstall this version
>> and install the one that came with xenial,  but whatever i did, it always
>> defaulted to version 31. Maybe someone already upgraded xenial repository
>> to use version 31 instead of 20.
>>
> Yup that was me..
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Allen Samuels

RIP ext4.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, April 11, 2016 2:40 PM
> To: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; ceph-
> maintain...@ceph.com; ceph-annou...@ceph.com
> Subject: Deprecating ext4 support
> 
> Hi,
> 
> ext4 has never been recommended, but we did test it.  After Jewel is out,
> we would like explicitly recommend *against* ext4 and stop testing it.
> 
> Why:
> 
> Recently we discovered an issue with the long object name handling that is
> not fixable without rewriting a significant chunk of FileStores filename
> handling.  (There is a limit in the amount of xattr data ext4 can store in the
> inode, which causes problems in LFNIndex.)
> 
> We *could* invest a ton of time rewriting this to fix, but it only affects 
> ext4,
> which we never recommended, and we plan to deprecate FileStore once
> BlueStore is stable anyway, so it seems like a waste of time that would be
> better spent elsewhere.
> 
> Also, by dropping ext4 test coverage in ceph-qa-suite, we can significantly
> improve time/coverage for FileStore on XFS and on BlueStore.
> 
> The long file name handling is problematic anytime someone is storing rados
> objects with long names.  The primary user that does this is RGW, which
> means any RGW cluster using ext4 should recreate their OSDs to use XFS.
> Other librados users could be affected too, though, like users with very long
> rbd image names (e.g., > 100 characters), or custom librados users.
> 
> How:
> 
> To make this change as visible as possible, the plan is to make ceph-osd
> refuse to start if the backend is unable to support the configured max
> object name (osd_max_object_name_len).  The OSD will complain that ext4
> cannot store such an object and refuse to start.  A user who is only using
> RBD might decide they don't need long file names to work and can adjust
> the osd_max_object_name_len setting to something small (say, 64) and run
> successfully.  They would be taking a risk, though, because we would like
> to stop testing on ext4.
> 
> Is this reasonable?  If there significant ext4 users that are unwilling to
> recreate their OSDs, now would be the time to speak up.
> 
> Thanks!
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Jan Schermer

RIP Ceph.


> On 11 Apr 2016, at 23:42, Allen Samuels  wrote:
> 
> RIP ext4.
> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions 
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samu...@sandisk.com
> 
> 
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>> Sent: Monday, April 11, 2016 2:40 PM
>> To: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; ceph-
>> maintain...@ceph.com; ceph-annou...@ceph.com
>> Subject: Deprecating ext4 support
>> 
>> Hi,
>> 
>> ext4 has never been recommended, but we did test it.  After Jewel is out,
>> we would like explicitly recommend *against* ext4 and stop testing it.
>> 
>> Why:
>> 
>> Recently we discovered an issue with the long object name handling that is
>> not fixable without rewriting a significant chunk of FileStores filename
>> handling.  (There is a limit in the amount of xattr data ext4 can store in 
>> the
>> inode, which causes problems in LFNIndex.)
>> 
>> We *could* invest a ton of time rewriting this to fix, but it only affects 
>> ext4,
>> which we never recommended, and we plan to deprecate FileStore once
>> BlueStore is stable anyway, so it seems like a waste of time that would be
>> better spent elsewhere.
>> 
>> Also, by dropping ext4 test coverage in ceph-qa-suite, we can significantly
>> improve time/coverage for FileStore on XFS and on BlueStore.
>> 
>> The long file name handling is problematic anytime someone is storing rados
>> objects with long names.  The primary user that does this is RGW, which
>> means any RGW cluster using ext4 should recreate their OSDs to use XFS.
>> Other librados users could be affected too, though, like users with very long
>> rbd image names (e.g., > 100 characters), or custom librados users.
>> 
>> How:
>> 
>> To make this change as visible as possible, the plan is to make ceph-osd
>> refuse to start if the backend is unable to support the configured max
>> object name (osd_max_object_name_len).  The OSD will complain that ext4
>> cannot store such an object and refuse to start.  A user who is only using
>> RBD might decide they don't need long file names to work and can adjust
>> the osd_max_object_name_len setting to something small (say, 64) and run
>> successfully.  They would be taking a risk, though, because we would like
>> to stop testing on ext4.
>> 
>> Is this reasonable?  If there significant ext4 users that are unwilling to
>> recreate their OSDs, now would be the time to speak up.
>> 
>> Thanks!
>> sage
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Mark Nelson


On 04/11/2016 04:44 PM, Sage Weil wrote:

On Mon, 11 Apr 2016, Sage Weil wrote:

Hi,

ext4 has never been recommended, but we did test it.  After Jewel is out,
we would like explicitly recommend *against* ext4 and stop testing it.


I should clarify that this is a proposal and solicitation of feedback--we
haven't made any decisions yet.  Now is the time to weigh in.


To add to this on the performance side, we stopped doing regular 
performance testing on ext4 (and btrfs) sometime back around when ICE 
was released to focus specifically on filestore behavior on xfs.  There 
were some cases at the time where ext4 was faster than xfs, but not 
consistently so.  btrfs is often quite fast on fresh fs, but degrades 
quickly due to fragmentation induced by cow with 
small-writes-to-large-object workloads (IE RBD small writes).  If btrfs 
auto-defrag is now safe to use in production it might be worth looking 
at again, but probably not ext4.


Set sail for bluestore!

Mark



sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Sage Weil

On Mon, 11 Apr 2016, Sage Weil wrote:
> Hi,
> 
> ext4 has never been recommended, but we did test it.  After Jewel is out, 
> we would like explicitly recommend *against* ext4 and stop testing it.

I should clarify that this is a proposal and solicitation of feedback--we 
haven't made any decisions yet.  Now is the time to weigh in.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Michael Hanscho

Hi!

How about these findings?

https://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016.pdf

Ext4 seems to be the one file system tested best... (although xfs
survived also quite long...)

Gruesse
Michael

On 2016-04-11 23:44, Sage Weil wrote:
> On Mon, 11 Apr 2016, Sage Weil wrote:
>> Hi,
>>
>> ext4 has never been recommended, but we did test it.  After Jewel is out, 
>> we would like explicitly recommend *against* ext4 and stop testing it.
> 
> I should clarify that this is a proposal and solicitation of feedback--we 
> haven't made any decisions yet.  Now is the time to weigh in.
> 
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Deprecating ext4 support

2016-04-11 Thread Sage Weil

Hi,

ext4 has never been recommended, but we did test it.  After Jewel is out, 
we would like explicitly recommend *against* ext4 and stop testing it.

Why:

Recently we discovered an issue with the long object name handling that is 
not fixable without rewriting a significant chunk of FileStores filename 
handling.  (There is a limit in the amount of xattr data ext4 can store in 
the inode, which causes problems in LFNIndex.)

We *could* invest a ton of time rewriting this to fix, but it only affects 
ext4, which we never recommended, and we plan to deprecate FileStore once 
BlueStore is stable anyway, so it seems like a waste of time that would be 
better spent elsewhere.

Also, by dropping ext4 test coverage in ceph-qa-suite, we can 
significantly improve time/coverage for FileStore on XFS and on BlueStore.

The long file name handling is problematic anytime someone is storing 
rados objects with long names.  The primary user that does this is RGW, 
which means any RGW cluster using ext4 should recreate their OSDs to use 
XFS.  Other librados users could be affected too, though, like users 
with very long rbd image names (e.g., > 100 characters), or custom 
librados users.

How:

To make this change as visible as possible, the plan is to make ceph-osd 
refuse to start if the backend is unable to support the configured max 
object name (osd_max_object_name_len).  The OSD will complain that ext4 
cannot store such an object and refuse to start.  A user who is only using 
RBD might decide they don't need long file names to work and can adjust 
the osd_max_object_name_len setting to something small (say, 64) and run 
successfully.  They would be taking a risk, though, because we would like 
to stop testing on ext4.

Is this reasonable?  If there significant ext4 users that are unwilling to 
recreate their OSDs, now would be the time to speak up.

Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-11 Thread Eric Hall

Power failure in data center has left 3 mons unable to start with 
mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)


Have found simliar problem discussed at 
http://irclogs.ceph.widodh.nl/index.php?date=2015-05-29, but am unsure 
how to proceed.


If I read
ceph-kvstore-tool /var/lib/ceph/mon/ceph-cephsecurestore1/store.db list
correctly, they believe osdmap is 1, but they also have 
osdmap:full_38456 and osdmap:38630 in the store.


Working from http://irclogs info, something like
ceph-kvstore-tool /var/lib/ceph/mon/ceph-foo/store.db set osdmap N 
in /tmp/osdmap
might help, but I am unsure of value for .  Seems like too delicate 
an operation for experimentation.



OS: Ubuntu 14.04.4
kernel: 3.13.0-83-generic
ceph: Firefly 0.80.11-1trusty

Any assistance appreciated,
--
Eric Hall
Institute for Software Integrated Systems
Vanderbilt University



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Shinobu Kinjo

Just to clarify to prevent any confusion.

Honestly I've never used ext4 as underlying filesystem for the Ceph cluster, 
but according to wiki [1], ext4 is recommended -;

[1] https://en.wikipedia.org/wiki/Ceph_%28software%29

Shinobu

- Original Message -
From: "Mark Nelson" 
To: "Sage Weil" , ceph-de...@vger.kernel.org, 
ceph-us...@ceph.com, ceph-maintain...@ceph.com, ceph-annou...@ceph.com
Sent: Tuesday, April 12, 2016 6:57:16 AM
Subject: Re: [ceph-users] Deprecating ext4 support

On 04/11/2016 04:44 PM, Sage Weil wrote:
> On Mon, 11 Apr 2016, Sage Weil wrote:
>> Hi,
>>
>> ext4 has never been recommended, but we did test it.  After Jewel is out,
>> we would like explicitly recommend *against* ext4 and stop testing it.
>
> I should clarify that this is a proposal and solicitation of feedback--we
> haven't made any decisions yet.  Now is the time to weigh in.

To add to this on the performance side, we stopped doing regular 
performance testing on ext4 (and btrfs) sometime back around when ICE 
was released to focus specifically on filestore behavior on xfs.  There 
were some cases at the time where ext4 was faster than xfs, but not 
consistently so.  btrfs is often quite fast on fresh fs, but degrades 
quickly due to fragmentation induced by cow with 
small-writes-to-large-object workloads (IE RBD small writes).  If btrfs 
auto-defrag is now safe to use in production it might be worth looking 
at again, but probably not ext4.

Set sail for bluestore!

Mark

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Lionel Bouton

Hi,

Le 11/04/2016 23:57, Mark Nelson a écrit :
> [...]
> To add to this on the performance side, we stopped doing regular
> performance testing on ext4 (and btrfs) sometime back around when ICE
> was released to focus specifically on filestore behavior on xfs. 
> There were some cases at the time where ext4 was faster than xfs, but
> not consistently so.  btrfs is often quite fast on fresh fs, but
> degrades quickly due to fragmentation induced by cow with
> small-writes-to-large-object workloads (IE RBD small writes).  If
> btrfs auto-defrag is now safe to use in production it might be worth
> looking at again, but probably not ext4.

For BTRFS, autodefrag is probably not performance-safe (yet), at least
with RBD access patterns. At least it wasn't in 4.1.9 when we tested it
last time (the performance degraded slowly but surely over several weeks
from an initially good performing filesystem to the point where we
measured a 100% increase in average latencies and large spikes and
stopped the experiment). I didn't see any patches on linux-btrfs since
then (it might have benefited from other modifications, but the
autodefrag algorithm wasn't reworked itself AFAIK).
That's not an inherent problem of BTRFS but of the autodefrag
implementation though. Deactivating autodefrag and reimplementing a
basic, cautious defragmentation scheduler gave us noticeably better
latencies with BTRFS vs XFS (~30% better) on the same hardware and
workload long term (as in almost a year and countless full-disk rewrites
on the same filesystems due to both normal writes and rebalancing with 3
to 4 months of XFS and BTRFS OSDs coexisting for comparison purposes).
I'll certainly remount a subset of our OSDs autodefrag as I did with
4.1.9 when we will deploy 4.4.x or a later LTS kernel. So I might have
more up to date information in the coming months. I don't plan to
compare BTRFS to XFS anymore though : XFS only saves us from running our
defragmentation scheduler, BTRFS is far more suited to our workload and
we've seen constant improvements in behavior in the (arguably bumpy
until late 3.19 versions) 3.16.x to 4.1.x road.

Other things:

* If the journal is not on a separate partition (SSD), it should
definitely be re-created NoCoW to avoid unnecessary fragmentation. From
memory : stop OSD, touch journal.new, chattr +C journal.new, dd
if=journal of=journal.new (your dd options here for best perf/least
amount of cache eviction), rm journal, mv journal.new journal, start OSD
again.
* filestore btrfs snap = false
  is mandatory if you want consistent performance (at least on HDDs). It
may not be felt with almost empty OSDs but performance hiccups appear if
any non trivial amount of data is added to the filesystems.
  IIRC, after debugging surprisingly the snapshot creation didn't seem
to be the actual cause of the performance problems but the snapshot
deletion... It's so bad that the default should probably be false and
not true.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Christian Balzer


Hello,

What a lovely missive to start off my working day...

On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote:

> Hi,
> 
> ext4 has never been recommended, but we did test it.  
Patently wrong, as Shinobu just pointed.

Ext4 never was (especially recently) flogged as much as XFS, but it always
was a recommended, supported filestorage filesystem, unlike the
experimental BTRFS of ZFS. 
And for various reasons people, including me, deployed it instead of XFS.

> After Jewel is
> out, we would like explicitly recommend *against* ext4 and stop testing
> it.
> 
Changing your recommendations is fine, stopping testing/supporting it
isn't. 
People deployed Ext4 in good faith and can be expected to use it at least
until their HW is up for replacement (4-5 years).

> Why:
> 
> Recently we discovered an issue with the long object name handling that
> is not fixable without rewriting a significant chunk of FileStores
> filename handling.  (There is a limit in the amount of xattr data ext4
> can store in the inode, which causes problems in LFNIndex.)
> 
Is that also true if the Ext4 inode size is larger than default?

> We *could* invest a ton of time rewriting this to fix, but it only
> affects ext4, which we never recommended, and we plan to deprecate
> FileStore once BlueStore is stable anyway, so it seems like a waste of
> time that would be better spent elsewhere.
> 
If you (that is RH) is going to declare bluestore stable this year, I
would be very surprised.
Either way, dropping support before the successor is truly ready doesn't
sit well with me.

Which brings me to the reasons why people would want to migrate (NOT
talking about starting freshly) to bluestore.

1. Will it be faster (IOPS) than filestore with SSD journals? 
Don't think so, but feel free to prove me wrong.

2. Will it be bit-rot proof? Note the deafening silence from the devs in
this thread: 
http://www.spinics.net/lists/ceph-users/msg26510.html

> Also, by dropping ext4 test coverage in ceph-qa-suite, we can 
> significantly improve time/coverage for FileStore on XFS and on
> BlueStore.
> 
Really, isn't that fully automated?

> The long file name handling is problematic anytime someone is storing 
> rados objects with long names.  The primary user that does this is RGW, 
> which means any RGW cluster using ext4 should recreate their OSDs to use 
> XFS.  Other librados users could be affected too, though, like users 
> with very long rbd image names (e.g., > 100 characters), or custom 
> librados users.
> 
> How:
> 
> To make this change as visible as possible, the plan is to make ceph-osd 
> refuse to start if the backend is unable to support the configured max 
> object name (osd_max_object_name_len).  The OSD will complain that ext4 
> cannot store such an object and refuse to start.  A user who is only
> using RBD might decide they don't need long file names to work and can
> adjust the osd_max_object_name_len setting to something small (say, 64)
> and run successfully.  They would be taking a risk, though, because we
> would like to stop testing on ext4.
> 
> Is this reasonable?  
About as reasonable as dropping format 1 support, that is not at all.
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28070.html

I'm officially only allowed to do (preventative) maintenance during weekend
nights on our main production cluster. 
That would mean 13 ruined weekends at the realistic rate of 1 OSD per
night, so you can see where my lack of enthusiasm for OSD recreation comes
from.

> If there significant ext4 users that are unwilling
> to recreate their OSDs, now would be the time to speak up.
> 
Consider that done.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Lindsay Mathieson


On 12/04/2016 9:09 AM, Lionel Bouton wrote:

* If the journal is not on a separate partition (SSD), it should
definitely be re-created NoCoW to avoid unnecessary fragmentation. From
memory : stop OSD, touch journal.new, chattr +C journal.new, dd
if=journal of=journal.new (your dd options here for best perf/least
amount of cache eviction), rm journal, mv journal.new journal, start OSD
again.


Flush the journal after stopping the OSD !

--
Lindsay Mathieson

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Lionel Bouton

Le 12/04/2016 01:40, Lindsay Mathieson a écrit :
> On 12/04/2016 9:09 AM, Lionel Bouton wrote:
>> * If the journal is not on a separate partition (SSD), it should
>> definitely be re-created NoCoW to avoid unnecessary fragmentation. From
>> memory : stop OSD, touch journal.new, chattr +C journal.new, dd
>> if=journal of=journal.new (your dd options here for best perf/least
>> amount of cache eviction), rm journal, mv journal.new journal, start OSD
>> again.
>
> Flush the journal after stopping the OSD !
>

No need to: dd makes an exact duplicate.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph striping

2016-04-11 Thread Christian Balzer

On Mon, 11 Apr 2016 09:25:35 -0400 (EDT) Jason Dillaman wrote:

> In general, RBD "fancy" striping can help under certain workloads where
> small IO would normally be hitting the same object (e.g. small
> sequential IO). 
> 

While the above is very true (especially for single/few clients), I never
bothered to deploy fancy striping because you have to plan it very
carefully, as you can't change it later on.

For example if you start with 8 OSDs and set your striping accordingly (as
Alwin's example suggested) but later add more OSDs you won't be taking
full advantage of the IOPS availalbe.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Robin H. Johnson

On Mon, Apr 11, 2016 at 06:49:09PM -0400,  Shinobu Kinjo wrote:
> Just to clarify to prevent any confusion.
> 
> Honestly I've never used ext4 as underlying filesystem for the Ceph cluster, 
> but according to wiki [1], ext4 is recommended -;
> 
> [1] https://en.wikipedia.org/wiki/Ceph_%28software%29
Clearly somebody made a copy&paste error from the actual documentation.

Here's the docs on master and the recent LTS releases.
http://docs.ceph.com/docs/firefly/rados/configuration/filesystem-recommendations/
http://docs.ceph.com/docs/hammer/rados/configuration/filesystem-recommendations/
http://docs.ceph.com/docs/master2/rados/configuration/filesystem-recommendations/

The documentation has NEVER recommended ext4.
Here's a slice of all history for that file:
http://dev.gentoo.org/~robbat2/ceph-history-of-filesystem-recommendations.patch

Generated with 
$ git log -C -C -M -p ceph/master -- \
doc/rados/configuration/filesystem-recommendations.rst \
doc/config-cluster/file-system-recommendations.rst \
doc/config-cluster/file_system_recommendations.rst

The very first version, back in 2012, said:
> ``ext4`` is a poor file system choice if you intend to deploy the
> RADOS Gateway or use snapshots on versions earlier than 0.45. 

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
E-Mail : robb...@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Thoughts on proposed hardware configuration.

2016-04-11 Thread Brad Smith

We're looking at implementing a 200+TB, 3 OSD-node Ceph cluster to be 
accessed as a filesystem from research compute clusters and "data 
transfer nodes" (from the Science DMZ network model... link 
). 
The goal is a first step to exploring what we can expect from Ceph in 
this kind of roll...


Comments on the following configuration would be greatly appreciated!

Brad
b...@soe.ucsc.edu

##

> 1x Blade server - 4 server nodes in a 2U form factor:
> ?1x Ceph admin/Ceph monitor node
> ?2x Ceph monitor/Ceph metadata server node

1 2U Four Node Server
6028TP-HTR

Mercury RM212Q 2U Quad-NodeServer:
1x Ceph Admin/Ceph Monitor Node:
2x Intel Xeon E5-2620v3 Six-CoreCPU's
32GB's DDR4 ECC/REG memory
2x 512GB SSD drives; Samsung 850 Pro
2x 10GbE DA/SFP+ ports

2x Ceph Monitor/Ceph MetaData Nodes
2x Intel Xeon E5-2630v3 Eight-Core CPU's
64GB's DDR4 ECC/REG memory
2x 512GB SSD drives; Samsung 850 Pro
1x 64GB SATAdom
2x 10GbE DA/SFP+ ports

Four Hot-Pluggable Systems (Nodes) in a 2U Form Factor. Each Node 
Supports the Following:
Dual Socket R (LGA 2011) Supports Intel Xeon Procesor E5-2600v3 Family; 
QPI up to 9.6GT/s Up to 1TB ECC LRDIMM,512GB ECC RDIMM,Up to 2133MHz; 
Sixteen DIMM Sockets

One PCI-E 3.0 x16 Low-Profile Slot;
One "0Slot" (x16)
Intel i350-AM2 Dual Port GbE LAN
Integrated IPMI 2.0 with KVM and Dedicated LAN
Three 3.5 Inch Hot-Swap SATA HDD Bays
2000W Redundant Power Supplies Platinum Level (94%)

> 3 Ceph OSD servers (70+TB each):

Quanta 1U 12-drive storage server
D51PH-1ULH

Mercury RM112 1U Rackmount Server:
2x Intel Xeon E5-2630v3 procesors
64GB's DDR4 ECC/REG memory
1x64GB SATAdom
2x 200GB Intel DC S3710 SSD's
12x 6TB NL SAS drives
1x dual port 10 Gb EDA/SFP+ OCP network card

General System Specifications:
Dual Intel Xeon Procesor E5-2600v3 ProductFamily
Intel C610 Chipset
Sixteen 2133MHz DDR4 RDIMM Memory
Twelve 3.5 Inch/2.5 Inch Hot-Plug 12Gb/s SAS or 6Gb/s SATA HDD
Four 2.5 Inch Hot-Plug 7mm 6Gb/s SATA Solid State Drive
Quanta LSI 3008 12Gb/s SAS Mezanine, RAID 0,1,10
Intel I350 1GbE Dual-Ports
One Dedicated 1GbE Management Port
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] adding cache tier in productive hammer environment

2016-04-11 Thread Christian Balzer


Hello,

On Mon, 11 Apr 2016 22:45:00 +0200 Oliver Dzombic wrote:

> Hi,
> 
> currently in use:
> 
> oldest:
> 
> SSDs: Intel S3510 80GB
Ouch.
As in, not a speed wonder at 110MB/s writes (or 2 HDDs worth), but at
least suitable as a journal when it comes to sync writes.
But at 45TBW dangerously low in the in endurance department, I'd check
their wear-out constantly! See the recent thread:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28083.html

> HDD: HGST 6TB H3IKNAS600012872SE NAS
HGST should be fine.

> 
> latest:
> 
> SSDs: Kingston 120 GB SV300
Don't know them, so no idea if they are suitable when it comes to sync
writes, but at 64TBW also in danger of expiring rather quickly.

> HDDs: HGST 3TB H3IKNAS30003272SE NAS
> 
> in future will be in use:
> 
> SSDs: Samsung SM863 240 GB
Those should be both suitable in the sync write and endurance department,
alas haven't tested them myself.

> HDDs: HGST 3TB H3IKNAS30003272SE NAS and/or
> Seagate ST2000NM0023 2 TB
> 
> 
> -
> 
> Its hard to say, if and at what chance the newer will fail with the
> osd's getting down/out, compared to the old ones.
> 
> We did a lot to avoid that.
> 
> Without having it in real numbers, my feeling is/was that the newer will
> fail with a much lower chance. But what is responsible for that, is
> unknown.
> 
> In the very end, the old nodes have with 2x 2,3 GHz Intel Celeron ( 2x
> cores without HT ) and 3x 6 TB HDD much less cpu power per HDD compared
> to the 4x 3,3 GHz Intel E3-1225v5 CPU ( 4 cores ) with 10x 3 TB HDD.
> 
Yes, I'd suspect CPU exhaustion mostly here, aside from the IO overload.

On my massively underpowered test cluster I've been able to create OSD/MON
failures from exhausting CPU or RAM, on my production clusters never.

> So its just too much different, CPU, HDD, RAM, even the HDD Controller.
> 
> I will have to make sure, that the new cluster will have enough Hardware
> to make sure, that i dont need to consider possible problems there.
> 
> --
> 
> atop: sda/sdb == SSD journal
>
Since there are 12 disk, I presume those are Kingston ones.
Frankly I wouldn't expect 10+ms waits from SSDs, but then again they are
90%ish busy when doing only 500IOPS and writing 1.5MB/s.
This indicates to me that they are NOT handling sync writes gracefully and
are not suitable as Ceph journals.
 
> --
> 
> That was my first experience too. At very first, deep-scrubs and even
> normal scrubs were driving the %WA and business of the HDDs to 100% Flat.
> 
> --
> 
> I rechecked it with munin.
> 
> The journal SSD's go from ~40% up to 80-90% during deebscrub.
I have no explanation for this, as deep-scrubbing introduces no writes.

> The HDDs go from ~ 20% up to 90-100% flat more or less, during
> deebscrubt.
> 
That's to be expected, again the sleep factor can reduce the impact for
client I/O immensely.

Christian
> At the same time, the load avarage goes to 16-20 ( 4 cores )
> while the CPU will see up to 318% Idle Waiting Time. ( out of max. 400% )
> 
> --
> 
> The OSD's receive a peer timeout. Which is just understandable if the
> system will see a 300% Idle Waiting Time for just long enough.
> 
> 
> --
> 
> And yes, as it seems, for clusters which are very busy, especially with
> low hardware ressources, needs much more than the standard config
> can/will deliver. As soon as the LTS is out i will have to start busting
> my head with available config parameters.
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Mon placement over wide area

2016-04-11 Thread Adrian Saul


We are close to being given approval to deploy a 3.5PB Ceph cluster that will 
be distributed over every major capital in Australia.The config will be 
dual sites in each city that will be coupled as HA pairs - 12 sites in total.   
The vast majority of CRUSH rules will place data either locally to the 
individual site, or replicated to the other HA site in that city.   However 
there are future use cases where I think we could use EC to distribute data 
wider or have some replication that puts small data sets across multiple 
cities.   All of this will be tied together with a dedicated private IP network.

The concern I have is around the placement of mons.  In the current design 
there would be two monitors in each site, running separate to the OSDs as part 
of some hosts acting as RBD to iSCSI/NFS gateways.   There will also be a 
"tiebreaker" mon placed on a separate host which will house some management 
infrastructure for the whole platform.

Obviously a concern is latency - the east coast to west coast latency is around 
50ms, and on the east coast it is 12ms between Sydney and the other two sites, 
and 24ms Melbourne to Brisbane.  Most of the data traffic will remain local but 
if we create a single national cluster then how much of an impact will it be 
having all the mons needing to keep in sync, as well as monitor and communicate 
with all OSDs (in the end goal design there will be some 2300+ OSDs).

The other options I  am considering:
- split into east and west coast clusters, most of the cross city need is in 
the east coast, any data moves between clusters can be done with snap 
replication
- city based clusters (tightest latency) but loose the multi-DC EC option, do 
cross city replication using snapshots

Just want to get a feel for what I need to consider when we start building at 
this scale.

Cheers,
 Adrian






Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Thoughts on proposed hardware configuration.

2016-04-11 Thread Christian Balzer


Hello,

On Mon, 11 Apr 2016 16:57:40 -0700 Brad Smith wrote:

> We're looking at implementing a 200+TB, 3 OSD-node Ceph cluster to be 
That's 72TB in your setup below and and 3 nodes are of course the bare
minimum, they're going to perform WORSE than an identical, single,
non-replicated node (latencies).
Once you grow the node number beyond your replication size (default 3),
things will speed up.

> accessed as a filesystem from research compute clusters and "data 
> transfer nodes" (from the Science DMZ network model... link 
> ).
>  
> The goal is a first step to exploring what we can expect from Ceph in 
> this kind of roll...
> 
> Comments on the following configuration would be greatly appreciated!
> 
> Brad
> b...@soe.ucsc.edu
> 
> ##
> 
>  > 1x Blade server - 4 server nodes in a 2U form factor:
>  > ?1x Ceph admin/Ceph monitor node
>  > ?2x Ceph monitor/Ceph metadata server node
> 
> 1 2U Four Node Server
> 6028TP-HTR
> 
> Mercury RM212Q 2U Quad-NodeServer:
> 1x Ceph Admin/Ceph Monitor Node:
> 2x Intel Xeon E5-2620v3 Six-CoreCPU's
Good enough, might be even better with less but faster cores.
Remember to give this node the lowest IP address to become the MON leader.

> 32GB's DDR4 ECC/REG memory
Depends on what kind of monitoring you're going to do there, but my
primary MON also runs graphite/apache and isn't using even 25% of the 16GB
RAM it has.
So definitely good enough.

> 2x 512GB SSD drives; Samsung 850 Pro
If you can afford it, use Samsung or Intel DC drives, simply so you'll
never have to worry about either performance or endurance.
That said, they should be good enough.

> 2x 10GbE DA/SFP+ ports
> 
> 2x Ceph Monitor/Ceph MetaData Nodes
> 2x Intel Xeon E5-2630v3 Eight-Core CPU's
> 64GB's DDR4 ECC/REG memory
Probably better with even more memory, given what people said in the very
recent "800TB - Ceph Physical Architecture Proposal" thread, read it.

> 2x 512GB SSD drives; Samsung 850 Pro
> 1x 64GB SATAdom
> 2x 10GbE DA/SFP+ ports
> 

[snip]
> 
>  > 3 Ceph OSD servers (70+TB each):
> 
> Quanta 1U 12-drive storage server
I'd stay with one vendor (Supermicro preferably), but that's me.

> D51PH-1ULH
> 
> Mercury RM112 1U Rackmount Server:
> 2x Intel Xeon E5-2630v3 procesors
> 64GB's DDR4 ECC/REG memory
Enough, but more RAM can be very beneficial when it comes to reads, both to
keep hot objects in the pagecache and inodes/etc in the SLAB space.

> 1x64GB SATAdom
That's for your OS one presumes, I'd hate having to shut down the server
to replace it and/or to then re-install things. 

> 2x 200GB Intel DC S3710 SSD's
If those where the sadly discontinued S3700's at 365MB/s write speed you'd
be only slightly below your estimated HDD speed of 840MB/s combined and
your network speed of 1GB/s.
I'd look into the 400GB model OR if you're happy with 3DWPD the 3610
model(s).

> 12x 6TB NL SAS drives
> 1x dual port 10 Gb EDA/SFP+ OCP network card
> 

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Sage Weil

On Tue, 12 Apr 2016, Christian Balzer wrote:
> 
> Hello,
> 
> What a lovely missive to start off my working day...
> 
> On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote:
> 
> > Hi,
> > 
> > ext4 has never been recommended, but we did test it.  
> Patently wrong, as Shinobu just pointed.
> 
> Ext4 never was (especially recently) flogged as much as XFS, but it always
> was a recommended, supported filestorage filesystem, unlike the
> experimental BTRFS of ZFS. 
> And for various reasons people, including me, deployed it instead of XFS.

Greg definitely wins the prize for raising this as a major issue, then 
(and for naming you as one of the major ext4 users).

I was not aware that we were recommending ext4 anywhere.  FWIW, here's 
what the docs currently say:

 Ceph OSD Daemons rely heavily upon the stability and performance of the 
 underlying filesystem.

 Note: We currently recommend XFS for production deployments. We recommend 
 btrfs for testing, development, and any non-critical deployments. We 
 believe that btrfs has the correct feature set and roadmap to serve Ceph 
 in the long-term, but XFS and ext4 provide the necessary stability for 
 today’s deployments. btrfs development is proceeding rapidly: users should 
 be comfortable installing the latest released upstream kernels and be able 
 to track development activity for critical bug fixes.

 Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the 
 underlying file system for various forms of internal object state and 
 metadata. The underlying filesystem must provide sufficient capacity for 
 XATTRs. btrfs does not bound the total xattr metadata stored with a file. 
 XFS has a relatively large limit (64 KB) that most deployments won’t 
 encounter, but the ext4 is too small to be usable.

(http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4)

Unfortunately that second paragraph, second sentence indirectly says ext4 
is stable.  :( :(  I'll prepare a PR tomorrow to revise this whole section 
based on the new information.

If anyone knows of other docs that recommend ext4, please let me know!  
They need to be updated.

> > After Jewel is out, we would like explicitly recommend *against* ext4 
> > and stop testing it.
> > 
> Changing your recommendations is fine, stopping testing/supporting it
> isn't. 
> People deployed Ext4 in good faith and can be expected to use it at least
> until their HW is up for replacement (4-5 years).

I agree, which is why I asked.

And part of it depends on what it's being used for.  If there are major 
users using ext4 for RGW then their deployments are at risk and they 
should swap it out for data safety reasons alone.  (Or, we need to figure 
out how to fix long object name support on ext4.)  On the other hand, if 
the only ext4 users are using RBD only, then they can safely continue with 
lower max object names, and upstream testing is important to let those 
OSDs age out naturally.

Does your cluster support RBD, RGW, or something else?

> > Why:
> > 
> > Recently we discovered an issue with the long object name handling that
> > is not fixable without rewriting a significant chunk of FileStores
> > filename handling.  (There is a limit in the amount of xattr data ext4
> > can store in the inode, which causes problems in LFNIndex.)
> > 
> Is that also true if the Ext4 inode size is larger than default?

I'm not sure... Sam, do you know?  (It's somewhat academic, though, since 
we can't change the inode size on existing file systems.)

> > We *could* invest a ton of time rewriting this to fix, but it only
> > affects ext4, which we never recommended, and we plan to deprecate
> > FileStore once BlueStore is stable anyway, so it seems like a waste of
> > time that would be better spent elsewhere.
> > 
> If you (that is RH) is going to declare bluestore stable this year, I
> would be very surprised.

My hope is that it can be the *default* for L (next spring).  But we'll 
see.

> Either way, dropping support before the successor is truly ready doesn't
> sit well with me.

Yeah, I misspoke.  Once BlueStore is supported and the default, support 
for FileStore won't be dropped immediately.  But we'll want to communicate 
that eventually it will lose support.  How strongly that is messaged 
probably depends on how confident we are in BlueStore at that point.  And 
I confess I haven't thought much about how long "long enough" is yet.

> Which brings me to the reasons why people would want to migrate (NOT
> talking about starting freshly) to bluestore.
> 
> 1. Will it be faster (IOPS) than filestore with SSD journals? 
> Don't think so, but feel free to prove me wrong.

It will absolutely faster on the same hardware.  Whether BlueStore on HDD 
only is faster than FileStore HDD + SSD journal will depend on the 
workload.

> 2. Will it be bit-rot proof? Note the deafening silence from the devs in
> this thread: 
> http://www.spinics.net/lists/ceph-users/msg26510.h

Re: [ceph-users] Mon placement over wide area

2016-04-11 Thread Christian Balzer


Hello (again),

On Tue, 12 Apr 2016 00:46:29 + Adrian Saul wrote:

> 
> We are close to being given approval to deploy a 3.5PB Ceph cluster that
> will be distributed over every major capital in Australia.The config
> will be dual sites in each city that will be coupled as HA pairs - 12
> sites in total.   The vast majority of CRUSH rules will place data
> either locally to the individual site, or replicated to the other HA
> site in that city.   However there are future use cases where I think we
> could use EC to distribute data wider or have some replication that puts
> small data sets across multiple cities.   
This will very, very, VERY much depend on the data (use case) in question.

>All of this will be tied
> together with a dedicated private IP network.
> 
> The concern I have is around the placement of mons.  In the current
> design there would be two monitors in each site, running separate to the
> OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> will also be a "tiebreaker" mon placed on a separate host which will
> house some management infrastructure for the whole platform.
> 
Yes, that's the preferable way, might want to up this to 5 mons so you can
loose one while doing maintenance on another one.
But if that would be a coupled, national cluster you're looking both at
significant MON traffic, interesting "split-brain" scenarios and latencies
as well (MONs get chosen randomly by clients AFAIK).

> Obviously a concern is latency - the east coast to west coast latency is
> around 50ms, and on the east coast it is 12ms between Sydney and the
> other two sites, and 24ms Melbourne to Brisbane.  
In any situation other than "write speed doesn't matter at all" combined
with "large writes, not small ones" and "read-mostly" you're going to be in
severe pain.

> Most of the data
> traffic will remain local but if we create a single national cluster
> then how much of an impact will it be having all the mons needing to
> keep in sync, as well as monitor and communicate with all OSDs (in the
> end goal design there will be some 2300+ OSDs).
> 
Significant. 
I wouldn't suggest it, but even if you deploy differently I'd suggest a
test run/setup and sharing the experience with us. ^.^

> The other options I  am considering:
> - split into east and west coast clusters, most of the cross city need
> is in the east coast, any data moves between clusters can be done with
> snap replication
> - city based clusters (tightest latency) but loose the multi-DC EC
> option, do cross city replication using snapshots
> 
The later, I seem to remember that there was work in progress to do this
(snapshot replication) in an automated fashion.

> Just want to get a feel for what I need to consider when we start
> building at this scale.
> 
I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
the only well known/supported way to do geo-replication with Ceph is via
RGW.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [ceph-mds] mds service can not start after shutdown in 10.1.0

2016-04-11 Thread 施柏安

Hi John,

You are right. The id is not  '0'.
I checked the status of mds by command 'ceph mds dump'. There is not
showing much info for MDS servers.
Is there any command can check the mds list or health easily?

Thank for your help.


...
vagrant@mds-1:~$ sudo ls /var/lib/ceph/mds
ceph-mds-1
vagrant@mds-1:~$ sudo service ceph-mds start id=ceph-mds-1
ceph-mds stop/waiting
vagrant@mds-1:~$ sudo service ceph-mds start id=mds-1


ceph-mds (ceph/mds-1) start/running, process 6809
vagrant@mds-1:~$
vagrant@mds-1:~$ ls
vagrant@mds-1:~$ ps aux | grep mds
ceph  6809  0.0  1.7 338752  8792 ?Ssl  01:11   0:00
/usr/bin/ceph-mds --cluster=ceph -i mds-1 -f --setuser ceph --setgroup ceph
vagrant   6830  0.0  0.1  10432   628 pts/0S+   01:11   0:00 grep
--color=auto mds
...

2016-04-11 19:12 GMT+08:00 John Spray :

> Is the ID of the MDS service really "0"?  Usually people set the ID to the
> hostname.  Check it in /var/lib/ceph/mds
>
> John
>
> On Mon, Apr 11, 2016 at 9:44 AM, 施柏安  wrote:
>
>> Hi cephers,
>>
>> I was testing CephFS's HA. So I shutdown the active mds server.
>> Then the one of standby mds turn to be active. Everything seems work
>> properly.
>> But I boot the mds server which was shutdown in test. It can't join
>> cluster automatically.
>> And I use command 'sudo service ceph-mds start id=0'. It can't start and
>> just show 'ceph-mds stop/waiting'
>>
>> Is that the bug or I do wrong operation?
>>
>> --
>>
>> Best regards,
>>
>> 施柏安 Desmond Shih
>> 技術研發部 Technical Development
>>  
>> 迎棧科技股份有限公司
>> │ 886-975-857-982
>> │ desmond.s@inwinstack 
>> │ 886-2-7738-2858 #7441
>> │ 新北市220板橋區遠東路3號5樓C室
>> Rm.C, 5F., No.3, Yuandong Rd.,
>> Banqiao Dist., New Taipei City 220, Taiwan (R.O.C)
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>


-- 

Best regards,

施柏安 Desmond Shih
技術研發部 Technical Development
 
迎棧科技股份有限公司
│ 886-975-857-982
│ desmond.s@inwinstack 
│ 886-2-7738-2858 #7441
│ 新北市220板橋區遠東路3號5樓C室
Rm.C, 5F., No.3, Yuandong Rd.,
Banqiao Dist., New Taipei City 220, Taiwan (R.O.C)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mon placement over wide area

2016-04-11 Thread Adrian Saul

Hello again Christian :)


> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> > will be distributed over every major capital in Australia.The config
> > will be dual sites in each city that will be coupled as HA pairs - 12
> > sites in total.   The vast majority of CRUSH rules will place data
> > either locally to the individual site, or replicated to the other HA
> > site in that city.   However there are future use cases where I think we
> > could use EC to distribute data wider or have some replication that puts
> > small data sets across multiple cities.
> This will very, very, VERY much depend on the data (use case) in question.

The EC use case would be using RGW and to act as an archival backup store

> > The concern I have is around the placement of mons.  In the current
> > design there would be two monitors in each site, running separate to the
> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> > will also be a "tiebreaker" mon placed on a separate host which will
> > house some management infrastructure for the whole platform.
> >
> Yes, that's the preferable way, might want to up this to 5 mons so you can
> loose one while doing maintenance on another one.
> But if that would be a coupled, national cluster you're looking both at
> significant MON traffic, interesting "split-brain" scenarios and latencies as
> well (MONs get chosen randomly by clients AFAIK).

In the case I am setting up it would be 2 per site plus the extra so 25 - but I 
am fearing that would make the mon syncing become to heavy.  Once we build up 
to multiple sites though we can maybe reduce to one per site to reduce the 
workload on keeping the mons in sync.

> > Obviously a concern is latency - the east coast to west coast latency
> > is around 50ms, and on the east coast it is 12ms between Sydney and
> > the other two sites, and 24ms Melbourne to Brisbane.
> In any situation other than "write speed doesn't matter at all" combined with
> "large writes, not small ones" and "read-mostly" you're going to be in severe
> pain.

For data yes, but the main case for that would be backup data where it would be 
large writes, read rarely and as long as streaming performance keeps up latency 
wont matter.   My concern with the latency would be how that impacts the 
monitors having to keep in sync and how that would impact client opertions, 
especially with the rate of change that would occur with the predominant RBD 
use in most sites.

> > Most of the data
> > traffic will remain local but if we create a single national cluster
> > then how much of an impact will it be having all the mons needing to
> > keep in sync, as well as monitor and communicate with all OSDs (in the
> > end goal design there will be some 2300+ OSDs).
> >
> Significant.
> I wouldn't suggest it, but even if you deploy differently I'd suggest a test
> run/setup and sharing the experience with us. ^.^

Someone has to be the canary right :)

> > The other options I  am considering:
> > - split into east and west coast clusters, most of the cross city need
> > is in the east coast, any data moves between clusters can be done with
> > snap replication
> > - city based clusters (tightest latency) but loose the multi-DC EC
> > option, do cross city replication using snapshots
> >
> The later, I seem to remember that there was work in progress to do this
> (snapshot replication) in an automated fashion.
>
> > Just want to get a feel for what I need to consider when we start
> > building at this scale.
> >
> I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
> the only well known/supported way to do geo-replication with Ceph is via
> RGW.

iSCSI is working fairly well.  We have decided to not use Ceph for the latency 
sensitive workloads so while we are still working to keep that low, we wont be 
putting the heavier IOP or latency sensitive workloads onto it until we get a 
better feel for how it behaves at scale and can be sure of the performance.

As above - for the most part we are going to be for the most part having local 
site pools (replicate at application level), a few metro replicated pools and a 
couple of very small multi-metro replicated pools, with the geo-redundant EC 
stuff a future plan.  It would just be a shame to lock the design into a setup 
that won't let us do some of these wider options down the track.

Thanks.

Adrian

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Shinobu Kinjo

Hi Sage,

Probably it may be better to mention that we only update master document 
otherwise someone gets confused again [1].

[1] https://en.wikipedia.org/wiki/Ceph_%28software%29

Cheers,
Shinobu

- Original Message -
From: "Sage Weil" 
To: "Christian Balzer" 
Cc: ceph-de...@vger.kernel.org, ceph-us...@ceph.com, ceph-maintain...@ceph.com
Sent: Tuesday, April 12, 2016 10:12:14 AM
Subject: Re: [ceph-users] Deprecating ext4 support

On Tue, 12 Apr 2016, Christian Balzer wrote:
> 
> Hello,
> 
> What a lovely missive to start off my working day...
> 
> On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote:
> 
> > Hi,
> > 
> > ext4 has never been recommended, but we did test it.  
> Patently wrong, as Shinobu just pointed.
> 
> Ext4 never was (especially recently) flogged as much as XFS, but it always
> was a recommended, supported filestorage filesystem, unlike the
> experimental BTRFS of ZFS. 
> And for various reasons people, including me, deployed it instead of XFS.

Greg definitely wins the prize for raising this as a major issue, then 
(and for naming you as one of the major ext4 users).

I was not aware that we were recommending ext4 anywhere.  FWIW, here's 
what the docs currently say:

 Ceph OSD Daemons rely heavily upon the stability and performance of the 
 underlying filesystem.

 Note: We currently recommend XFS for production deployments. We recommend 
 btrfs for testing, development, and any non-critical deployments. We 
 believe that btrfs has the correct feature set and roadmap to serve Ceph 
 in the long-term, but XFS and ext4 provide the necessary stability for 
 today’s deployments. btrfs development is proceeding rapidly: users should 
 be comfortable installing the latest released upstream kernels and be able 
 to track development activity for critical bug fixes.

 Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the 
 underlying file system for various forms of internal object state and 
 metadata. The underlying filesystem must provide sufficient capacity for 
 XATTRs. btrfs does not bound the total xattr metadata stored with a file. 
 XFS has a relatively large limit (64 KB) that most deployments won’t 
 encounter, but the ext4 is too small to be usable.

(http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4)

Unfortunately that second paragraph, second sentence indirectly says ext4 
is stable.  :( :(  I'll prepare a PR tomorrow to revise this whole section 
based on the new information.

If anyone knows of other docs that recommend ext4, please let me know!  
They need to be updated.

> > After Jewel is out, we would like explicitly recommend *against* ext4 
> > and stop testing it.
> > 
> Changing your recommendations is fine, stopping testing/supporting it
> isn't. 
> People deployed Ext4 in good faith and can be expected to use it at least
> until their HW is up for replacement (4-5 years).

I agree, which is why I asked.

And part of it depends on what it's being used for.  If there are major 
users using ext4 for RGW then their deployments are at risk and they 
should swap it out for data safety reasons alone.  (Or, we need to figure 
out how to fix long object name support on ext4.)  On the other hand, if 
the only ext4 users are using RBD only, then they can safely continue with 
lower max object names, and upstream testing is important to let those 
OSDs age out naturally.

Does your cluster support RBD, RGW, or something else?

> > Why:
> > 
> > Recently we discovered an issue with the long object name handling that
> > is not fixable without rewriting a significant chunk of FileStores
> > filename handling.  (There is a limit in the amount of xattr data ext4
> > can store in the inode, which causes problems in LFNIndex.)
> > 
> Is that also true if the Ext4 inode size is larger than default?

I'm not sure... Sam, do you know?  (It's somewhat academic, though, since 
we can't change the inode size on existing file systems.)

> > We *could* invest a ton of time rewriting this to fix, but it only
> > affects ext4, which we never recommended, and we plan to deprecate
> > FileStore once BlueStore is stable anyway, so it seems like a waste of
> > time that would be better spent elsewhere.
> > 
> If you (that is RH) is going to declare bluestore stable this year, I
> would be very surprised.

My hope is that it can be the *default* for L (next spring).  But we'll 
see.

> Either way, dropping support before the successor is truly ready doesn't
> sit well with me.

Yeah, I misspoke.  Once BlueStore is supported and the default, support 
for FileStore won't be dropped immediately.  But we'll want to communicate 
that eventually it will lose support.  How strongly that is messaged 
probably depends on how confident we are in BlueStore at that point.  And 
I confess I haven't thought much about how long "long enough" is yet.

> Which brings me to the reasons why people would want to migrate (NOT
> talking about star

Re: [ceph-users] [ceph-mds] mds service can not start after shutdown in 10.1.0

2016-04-11 Thread John Spray

On Tue, Apr 12, 2016 at 2:14 AM, 施柏安  wrote:

> Hi John,
>
> You are right. The id is not  '0'.
> I checked the status of mds by command 'ceph mds dump'. There is not
> showing much info for MDS servers.
> Is there any command can check the mds list or health easily?
>

The info you get in "mds dump" includes the service names, but it is for
running daemons only.  Currently to know the ID of a non-running daemon you
either have to remember it (easy if it's the hostname) or look in /var.

John




> Thank for your help.
>
>
> ...
> vagrant@mds-1:~$ sudo ls /var/lib/ceph/mds
> ceph-mds-1
> vagrant@mds-1:~$ sudo service ceph-mds start id=ceph-mds-1
> ceph-mds stop/waiting
> vagrant@mds-1:~$ sudo service ceph-mds start id=mds-1
>
>
> ceph-mds (ceph/mds-1) start/running, process 6809
> vagrant@mds-1:~$
> vagrant@mds-1:~$ ls
> vagrant@mds-1:~$ ps aux | grep mds
> ceph  6809  0.0  1.7 338752  8792 ?Ssl  01:11   0:00
> /usr/bin/ceph-mds --cluster=ceph -i mds-1 -f --setuser ceph --setgroup ceph
> vagrant   6830  0.0  0.1  10432   628 pts/0S+   01:11   0:00 grep
> --color=auto mds
> ...
>
> 2016-04-11 19:12 GMT+08:00 John Spray :
>
>> Is the ID of the MDS service really "0"?  Usually people set the ID to
>> the hostname.  Check it in /var/lib/ceph/mds
>>
>> John
>>
>> On Mon, Apr 11, 2016 at 9:44 AM, 施柏安  wrote:
>>
>>> Hi cephers,
>>>
>>> I was testing CephFS's HA. So I shutdown the active mds server.
>>> Then the one of standby mds turn to be active. Everything seems work
>>> properly.
>>> But I boot the mds server which was shutdown in test. It can't join
>>> cluster automatically.
>>> And I use command 'sudo service ceph-mds start id=0'. It can't start and
>>> just show 'ceph-mds stop/waiting'
>>>
>>> Is that the bug or I do wrong operation?
>>>
>>> --
>>>
>>> Best regards,
>>>
>>> 施柏安 Desmond Shih
>>> 技術研發部 Technical Development
>>>  
>>> 迎棧科技股份有限公司
>>> │ 886-975-857-982
>>> │ desmond.s@inwinstack 
>>> │ 886-2-7738-2858 #7441
>>> │ 新北市220板橋區遠東路3號5樓C室
>>> Rm.C, 5F., No.3, Yuandong Rd.,
>>> Banqiao Dist., New Taipei City 220, Taiwan (R.O.C)
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
>
> --
>
> Best regards,
>
> 施柏安 Desmond Shih
> 技術研發部 Technical Development
>  
> 迎棧科技股份有限公司
> │ 886-975-857-982
> │ desmond.s@inwinstack 
> │ 886-2-7738-2858 #7441
> │ 新北市220板橋區遠東路3號5樓C室
> Rm.C, 5F., No.3, Yuandong Rd.,
> Banqiao Dist., New Taipei City 220, Taiwan (R.O.C)
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-maintainers] Deprecating ext4 support

2016-04-11 Thread hp cre

As far as i remember,  the documentation did say that either filesystems
(ext4 or xfs) are OK,  except for xattr which was better represented on xfs.

I would think the best move would be to make xfs the default osd creation
method and put in a warning about ext4 being deprecated in future
releases.  But leave support for it till all users are weaned off of it in
favour of xfs and later,  btrfs.
On 12 Apr 2016 03:12, "Sage Weil"  wrote:

> On Tue, 12 Apr 2016, Christian Balzer wrote:
> >
> > Hello,
> >
> > What a lovely missive to start off my working day...
> >
> > On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote:
> >
> > > Hi,
> > >
> > > ext4 has never been recommended, but we did test it.
> > Patently wrong, as Shinobu just pointed.
> >
> > Ext4 never was (especially recently) flogged as much as XFS, but it
> always
> > was a recommended, supported filestorage filesystem, unlike the
> > experimental BTRFS of ZFS.
> > And for various reasons people, including me, deployed it instead of XFS.
>
> Greg definitely wins the prize for raising this as a major issue, then
> (and for naming you as one of the major ext4 users).
>
> I was not aware that we were recommending ext4 anywhere.  FWIW, here's
> what the docs currently say:
>
>  Ceph OSD Daemons rely heavily upon the stability and performance of the
>  underlying filesystem.
>
>  Note: We currently recommend XFS for production deployments. We recommend
>  btrfs for testing, development, and any non-critical deployments. We
>  believe that btrfs has the correct feature set and roadmap to serve Ceph
>  in the long-term, but XFS and ext4 provide the necessary stability for
>  today’s deployments. btrfs development is proceeding rapidly: users should
>  be comfortable installing the latest released upstream kernels and be able
>  to track development activity for critical bug fixes.
>
>  Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the
>  underlying file system for various forms of internal object state and
>  metadata. The underlying filesystem must provide sufficient capacity for
>  XATTRs. btrfs does not bound the total xattr metadata stored with a file.
>  XFS has a relatively large limit (64 KB) that most deployments won’t
>  encounter, but the ext4 is too small to be usable.
>
> (
> http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4
> )
>
> Unfortunately that second paragraph, second sentence indirectly says ext4
> is stable.  :( :(  I'll prepare a PR tomorrow to revise this whole section
> based on the new information.
>
> If anyone knows of other docs that recommend ext4, please let me know!
> They need to be updated.
>
> > > After Jewel is out, we would like explicitly recommend *against* ext4
> > > and stop testing it.
> > >
> > Changing your recommendations is fine, stopping testing/supporting it
> > isn't.
> > People deployed Ext4 in good faith and can be expected to use it at least
> > until their HW is up for replacement (4-5 years).
>
> I agree, which is why I asked.
>
> And part of it depends on what it's being used for.  If there are major
> users using ext4 for RGW then their deployments are at risk and they
> should swap it out for data safety reasons alone.  (Or, we need to figure
> out how to fix long object name support on ext4.)  On the other hand, if
> the only ext4 users are using RBD only, then they can safely continue with
> lower max object names, and upstream testing is important to let those
> OSDs age out naturally.
>
> Does your cluster support RBD, RGW, or something else?
>
> > > Why:
> > >
> > > Recently we discovered an issue with the long object name handling that
> > > is not fixable without rewriting a significant chunk of FileStores
> > > filename handling.  (There is a limit in the amount of xattr data ext4
> > > can store in the inode, which causes problems in LFNIndex.)
> > >
> > Is that also true if the Ext4 inode size is larger than default?
>
> I'm not sure... Sam, do you know?  (It's somewhat academic, though, since
> we can't change the inode size on existing file systems.)
>
> > > We *could* invest a ton of time rewriting this to fix, but it only
> > > affects ext4, which we never recommended, and we plan to deprecate
> > > FileStore once BlueStore is stable anyway, so it seems like a waste of
> > > time that would be better spent elsewhere.
> > >
> > If you (that is RH) is going to declare bluestore stable this year, I
> > would be very surprised.
>
> My hope is that it can be the *default* for L (next spring).  But we'll
> see.
>
> > Either way, dropping support before the successor is truly ready doesn't
> > sit well with me.
>
> Yeah, I misspoke.  Once BlueStore is supported and the default, support
> for FileStore won't be dropped immediately.  But we'll want to communicate
> that eventually it will lose support.  How strongly that is messaged
> probably depends on how confident we are in BlueStore at that point.  And
> I confess I haven

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Christian Balzer


Hello,

On Mon, 11 Apr 2016 21:12:14 -0400 (EDT) Sage Weil wrote:

> On Tue, 12 Apr 2016, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > What a lovely missive to start off my working day...
> > 
> > On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote:
> > 
> > > Hi,
> > > 
> > > ext4 has never been recommended, but we did test it.  
> > Patently wrong, as Shinobu just pointed.
> > 
> > Ext4 never was (especially recently) flogged as much as XFS, but it
> > always was a recommended, supported filestorage filesystem, unlike the
> > experimental BTRFS of ZFS. 
> > And for various reasons people, including me, deployed it instead of
> > XFS.
> 
> Greg definitely wins the prize for raising this as a major issue, then 
> (and for naming you as one of the major ext4 users).
> 
I'm sure there are other ones, it's often surprising how people will pipe
up on this ML for the first time with really massive deployments they've
been running for years w/o ever being on anybody's radar.

> I was not aware that we were recommending ext4 anywhere.  FWIW, here's 
> what the docs currently say:
> 
>  Ceph OSD Daemons rely heavily upon the stability and performance of the 
>  underlying filesystem.
> 
>  Note: We currently recommend XFS for production deployments. We
> recommend btrfs for testing, development, and any non-critical
> deployments. We believe that btrfs has the correct feature set and
> roadmap to serve Ceph in the long-term, but XFS and ext4 provide the
> necessary stability for today’s deployments. btrfs development is
> proceeding rapidly: users should be comfortable installing the latest
> released upstream kernels and be able to track development activity for
> critical bug fixes.
> 
>  Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the 
>  underlying file system for various forms of internal object state and 
>  metadata. The underlying filesystem must provide sufficient capacity
> for XATTRs. btrfs does not bound the total xattr metadata stored with a
> file. XFS has a relatively large limit (64 KB) that most deployments
> won’t encounter, but the ext4 is too small to be usable.
> 
> (http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4)
> 
> Unfortunately that second paragraph, second sentence indirectly says
> ext4 is stable.  :( :(  I'll prepare a PR tomorrow to revise this whole
> section based on the new information.
> 
Not only that, the "filestore xattr use omap" section afterwards
reinforces that by clearly suggesting that this is the official
work-around for the XATTR issue.

> If anyone knows of other docs that recommend ext4, please let me know!  
> They need to be updated.
> 
Not going to try find any cached versions, but when I did my first
deployment with Dumpling I don't think the "Note" section was there or as
prominent. 
Not that it would have stopped me from using Ext4, mind.

> > > After Jewel is out, we would like explicitly recommend *against*
> > > ext4 and stop testing it.
> > > 
> > Changing your recommendations is fine, stopping testing/supporting it
> > isn't. 
> > People deployed Ext4 in good faith and can be expected to use it at
> > least until their HW is up for replacement (4-5 years).
> 
> I agree, which is why I asked.
> 
> And part of it depends on what it's being used for.  If there are major 
> users using ext4 for RGW then their deployments are at risk and they 
> should swap it out for data safety reasons alone.  (Or, we need to
> figure out how to fix long object name support on ext4.)  On the other
> hand, if the only ext4 users are using RBD only, then they can safely
> continue with lower max object names, and upstream testing is important
> to let those OSDs age out naturally.
> 
> Does your cluster support RBD, RGW, or something else?
> 
Only RBD on all clusters so far and definitely no plans to change that for
the main, mission critical production cluster.
I might want to add CephFS to the other production cluster at some time,
though.

No RGW, but if/when RGW supports "listing objects quickly" (is what I
vaguely remember from my conversation with Timo Sirainen, the Dovecot
author) we would be very interested in that particular piece of Ceph as
well. On a completely new cluster though, so no issue.

> > > Why:
> > > 
> > > Recently we discovered an issue with the long object name handling
> > > that is not fixable without rewriting a significant chunk of
> > > FileStores filename handling.  (There is a limit in the amount of
> > > xattr data ext4 can store in the inode, which causes problems in
> > > LFNIndex.)
> > > 
> > Is that also true if the Ext4 inode size is larger than default?
> 
> I'm not sure... Sam, do you know?  (It's somewhat academic, though,
> since we can't change the inode size on existing file systems.)
>  
Yes and no.
Some people (and I think not just me) were perfectly capable of reading
between the lines and format their Ext4 FS accordingly:
"mkfs.ext4 -J size=1024 -I 2048 -

Re: [ceph-users] mons die with mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)...

2016-04-11 Thread Gregory Farnum

On Mon, Apr 11, 2016 at 3:45 PM, Eric Hall  wrote:
> Power failure in data center has left 3 mons unable to start with
> mon/OSDMonitor.cc: 125: FAILED assert(version >= osdmap.epoch)
>
> Have found simliar problem discussed at
> http://irclogs.ceph.widodh.nl/index.php?date=2015-05-29, but am unsure how
> to proceed.
>
> If I read
> ceph-kvstore-tool /var/lib/ceph/mon/ceph-cephsecurestore1/store.db list
> correctly, they believe osdmap is 1, but they also have osdmap:full_38456
> and osdmap:38630 in the store.
>
> Working from http://irclogs info, something like
> ceph-kvstore-tool /var/lib/ceph/mon/ceph-foo/store.db set osdmap N in
> /tmp/osdmap
> might help, but I am unsure of value for .  Seems like too delicate an
> operation for experimentation.

Exactly what values are you reading that's giving you those values?
The "real" OSDMap epoch is going to be at least 38630...if you're very
lucky it will be exactly 38630. But since it reset itself to 1 in the
monitor's store, I doubt you'll be lucky.

So in order to get your cluster back up, you need to find the largest
osdmap version in your cluster. You can do that, very tediously, by
looking at the OSDMap stores. Or you may have debug logs indicating it
more easily on the monitors.

But your most important task is to find out why your monitors went
back in time — if the software and hardware underneath of Ceph are
behaving, that should be impossible. The usual scenario is that you
have caches enabled which aren't power-safe (eg, inside of the drives)
or have disabled barriers or something.
-Greg

>
>
> OS: Ubuntu 14.04.4
> kernel: 3.13.0-83-generic
> ceph: Firefly 0.80.11-1trusty
>
> Any assistance appreciated,
> --
> Eric Hall
> Institute for Software Integrated Systems
> Vanderbilt University
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph breizh meetup

2016-04-11 Thread eric mourgaya

hi,

The next ceph breizh meetup up will be organized at Nantes,the April 19th
 in the Suravenir Building:
at 2 Impasse Vasco de Gama, 44800 Saint-Herblain

Here the doodle:

http://doodle.com/poll/3mxqqgfkn4ttpfib

Will see you soon at Nantes

-- 
Eric Mourgaya,


Respectons la planete!
Luttons contre la mediocrite!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-maintainers] Deprecating ext4 support

2016-04-11 Thread Loic Dachary

Hi Sage,

I suspect most people nowadays run tests and develop on ext4. Not supporting 
ext4 in the future means we'll need to find a convenient way for developers to 
run tests against the supported file systems.

My 2cts :-)

On 11/04/2016 23:39, Sage Weil wrote:
> Hi,
> 
> ext4 has never been recommended, but we did test it.  After Jewel is out, 
> we would like explicitly recommend *against* ext4 and stop testing it.
> 
> Why:
> 
> Recently we discovered an issue with the long object name handling that is 
> not fixable without rewriting a significant chunk of FileStores filename 
> handling.  (There is a limit in the amount of xattr data ext4 can store in 
> the inode, which causes problems in LFNIndex.)
> 
> We *could* invest a ton of time rewriting this to fix, but it only affects 
> ext4, which we never recommended, and we plan to deprecate FileStore once 
> BlueStore is stable anyway, so it seems like a waste of time that would be 
> better spent elsewhere.
> 
> Also, by dropping ext4 test coverage in ceph-qa-suite, we can 
> significantly improve time/coverage for FileStore on XFS and on BlueStore.
> 
> The long file name handling is problematic anytime someone is storing 
> rados objects with long names.  The primary user that does this is RGW, 
> which means any RGW cluster using ext4 should recreate their OSDs to use 
> XFS.  Other librados users could be affected too, though, like users 
> with very long rbd image names (e.g., > 100 characters), or custom 
> librados users.
> 
> How:
> 
> To make this change as visible as possible, the plan is to make ceph-osd 
> refuse to start if the backend is unable to support the configured max 
> object name (osd_max_object_name_len).  The OSD will complain that ext4 
> cannot store such an object and refuse to start.  A user who is only using 
> RBD might decide they don't need long file names to work and can adjust 
> the osd_max_object_name_len setting to something small (say, 64) and run 
> successfully.  They would be taking a risk, though, because we would like 
> to stop testing on ext4.
> 
> Is this reasonable?  If there significant ext4 users that are unwilling to 
> recreate their OSDs, now would be the time to speak up.
> 
> Thanks!
> sage
> 
> ___
> Ceph-maintainers mailing list
> ceph-maintain...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-maintainers-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

55 matches

Mail list logo