date:20160711

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-11 Thread Goncalo Borges


Hi John...

Thank you for replying.

Here is the result of the tests you asked but I do not see nothing 
abnormal. Actually, your suggestions made me see that:


1) ceph-fuse 9.2.0 is presenting the same behaviour but with less memory 
consumption, probably, less enought so that it doesn't brake ceph-fuse 
in our machines with less memory.


2) I see a tremendous number of  ceph-fuse threads launched (around 160).

   # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | wc -l
   157

   # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | head
   -n 10
   COMMAND  PPID   PID  SPIDVSZ   RSS %MEM %CPU
   ceph-fuse --id mount_user - 1  3230  3230 9935240 339780 0.6 0.0
   ceph-fuse --id mount_user - 1  3230  3231 9935240 339780 0.6 0.1
   ceph-fuse --id mount_user - 1  3230  3232 9935240 339780 0.6 0.0
   ceph-fuse --id mount_user - 1  3230  3233 9935240 339780 0.6 0.0
   ceph-fuse --id mount_user - 1  3230  3234 9935240 339780 0.6 0.0
   ceph-fuse --id mount_user - 1  3230  3235 9935240 339780 0.6 0.0
   ceph-fuse --id mount_user - 1  3230  3236 9935240 339780 0.6 0.0
   ceph-fuse --id mount_user - 1  3230  3237 9935240 339780 0.6 0.0
   ceph-fuse --id mount_user - 1  3230  3238 9935240 339780 0.6 0.0


I do not see a way to actually limit the number of ceph-fuse threads 
launched  or to limit the max vm size each thread should take.


Do you know how to limit those options.

Cheers

Goncalo




1.> Try running ceph-fuse with valgrind --tool=memcheck to see if it's 
leaking


I have launched ceph-fuse with valgrind in the cluster where there is 
sufficient memory available, and therefore, there is no object cacher 
segfault.


$ valgrind --log-file=/tmp/valgrind-ceph-fuse-10.2.2.txt 
--tool=memcheck ceph-fuse --id mount_user -k 
/etc/ceph/ceph.client.mount_user.keyring -m X.X.X.8:6789 -r /cephfs 
/coepp/cephfs


This is the output which I get once I unmount the file system after user 
application execution


   # cat valgrind-ceph-fuse-10.2.2.txt
   ==12123== Memcheck, a memory error detector
   ==12123== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward
   et al.
   ==12123== Using Valgrind-3.8.1 and LibVEX; rerun with -h for
   copyright info
   ==12123== Command: ceph-fuse --id mount_user -k
   /etc/ceph/ceph.client.mount_user.keyring -m 192.231.127.8:6789 -r
   /cephfs /coepp/cephfs
   ==12123== Parent PID: 11992
   ==12123==
   ==12123==
   ==12123== HEAP SUMMARY:
   ==12123== in use at exit: 29,129 bytes in 397 blocks
   ==12123==   total heap usage: 14,824 allocs, 14,427 frees, 648,030
   bytes allocated
   ==12123==
   ==12123== LEAK SUMMARY:
   ==12123==definitely lost: 16 bytes in 1 blocks
   ==12123==indirectly lost: 0 bytes in 0 blocks
   ==12123==  possibly lost: 11,705 bytes in 273 blocks
   ==12123==still reachable: 17,408 bytes in 123 blocks
   ==12123== suppressed: 0 bytes in 0 blocks
   ==12123== Rerun with --leak-check=full to see details of leaked memory
   ==12123==
   ==12123== For counts of detected and suppressed errors, rerun with: -v
   ==12123== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)
   ==12126==
   ==12126== HEAP SUMMARY:
   ==12126== in use at exit: 9,641 bytes in 73 blocks
   ==12126==   total heap usage: 31,363,579 allocs, 31,363,506 frees,
   41,389,143,617 bytes allocated
   ==12126==
   ==12126== LEAK SUMMARY:
   ==12126==definitely lost: 28 bytes in 1 blocks
   ==12126==indirectly lost: 0 bytes in 0 blocks
   ==12126==  possibly lost: 0 bytes in 0 blocks
   ==12126==still reachable: 9,613 bytes in 72 blocks
   ==12126== suppressed: 0 bytes in 0 blocks
   ==12126== Rerun with --leak-check=full to see details of leaked memory
   ==12126==
   ==12126== For counts of detected and suppressed errors, rerun with: -v
   ==12126== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17
   from 9)

--- * ---

2.>  Inspect inode count (ceph daemon  status) to see if 
it's obeying its limit


This is the output I get once ceph-fuse is mounted but no user 
application is running


# ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
{
"metadata": {
"ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
"ceph_version": "ceph version 10.2.2 
(45107e21c568dd033c2f0a3107dec8f0b0e58374)",

"entity_id": "mount_user",
"hostname": "",
"mount_point": "\/coepp\/cephfs",
"root": "\/cephfs"
},
"dentry_count": 0,
"dentry_pinned_count": 0,
"inode_count": 2,
"mds_epoch": 817,
"osd_epoch": 1005,
"osd_epoch_barrier": 0
}


This is already when ceph-fuse reached 10g of virtual memory, and user 
applications are hammering the filesystem.


# ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
{
"metadata": {
"ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-11 Thread Goncalo Borges




On 07/11/2016 05:04 PM, Goncalo Borges wrote:


Hi John...

Thank you for replying.

Here is the result of the tests you asked but I do not see nothing 
abnormal. Actually, your suggestions made me see that:


1) ceph-fuse 9.2.0 is presenting the same behaviour but with less 
memory consumption, probably, less enought so that it doesn't brake 
ceph-fuse in our machines with less memory.


2) I see a tremendous number of  ceph-fuse threads launched (around 160).

# ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | wc -l
157

# ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu |
head -n 10
COMMAND  PPID   PID  SPIDVSZ   RSS %MEM %CPU
ceph-fuse --id mount_user - 1  3230  3230 9935240 339780 0.6 0.0
ceph-fuse --id mount_user - 1  3230  3231 9935240 339780 0.6 0.1
ceph-fuse --id mount_user - 1  3230  3232 9935240 339780 0.6 0.0
ceph-fuse --id mount_user - 1  3230  3233 9935240 339780 0.6 0.0
ceph-fuse --id mount_user - 1  3230  3234 9935240 339780 0.6 0.0
ceph-fuse --id mount_user - 1  3230  3235 9935240 339780 0.6 0.0
ceph-fuse --id mount_user - 1  3230  3236 9935240 339780 0.6 0.0
ceph-fuse --id mount_user - 1  3230  3237 9935240 339780 0.6 0.0
ceph-fuse --id mount_user - 1  3230  3238 9935240 339780 0.6 0.0


I do not see a way to actually limit the number of ceph-fuse threads 
launched  or to limit the max vm size each thread should take.




By the way, mounting ceph requesting to disable fuse multithreading 
doesn't seem to work for me


   ceph-fuse --id mount_user -k 
/etc/ceph/ceph.client.mount_user.keyring -m XXX:6789 -s -r /cephfs 
/coepp/cephfs &


Once the user application fills the machine, I just see the number of 
threads increasing until ~160






# ps -T -p 21426 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | grep 
ceph-fuse | wc -l

24

# ps -T -p 21426 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | grep 
ceph-fuse | wc -l

28


# ps -T -p 21426 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | grep 
ceph-fuse | wc -l

30


# ps -T -p 21426 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | grep 
ceph-fuse | wc -l

50

(...)

Cheers
G.




Do you know how to limit those options.

Cheers

Goncalo




1.> Try running ceph-fuse with valgrind --tool=memcheck to see if it's 
leaking


I have launched ceph-fuse with valgrind in the cluster where there is 
sufficient memory available, and therefore, there is no object cacher 
segfault.


$ valgrind --log-file=/tmp/valgrind-ceph-fuse-10.2.2.txt 
--tool=memcheck ceph-fuse --id mount_user -k 
/etc/ceph/ceph.client.mount_user.keyring -m X.X.X.8:6789 -r /cephfs 
/coepp/cephfs


This is the output which I get once I unmount the file system after 
user application execution


# cat valgrind-ceph-fuse-10.2.2.txt
==12123== Memcheck, a memory error detector
==12123== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward
et al.
==12123== Using Valgrind-3.8.1 and LibVEX; rerun with -h for
copyright info
==12123== Command: ceph-fuse --id mount_user -k
/etc/ceph/ceph.client.mount_user.keyring -m 192.231.127.8:6789 -r
/cephfs /coepp/cephfs
==12123== Parent PID: 11992
==12123==
==12123==
==12123== HEAP SUMMARY:
==12123== in use at exit: 29,129 bytes in 397 blocks
==12123==   total heap usage: 14,824 allocs, 14,427 frees, 648,030
bytes allocated
==12123==
==12123== LEAK SUMMARY:
==12123==definitely lost: 16 bytes in 1 blocks
==12123==indirectly lost: 0 bytes in 0 blocks
==12123==  possibly lost: 11,705 bytes in 273 blocks
==12123==still reachable: 17,408 bytes in 123 blocks
==12123== suppressed: 0 bytes in 0 blocks
==12123== Rerun with --leak-check=full to see details of leaked memory
==12123==
==12123== For counts of detected and suppressed errors, rerun with: -v
==12123== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8
from 6)
==12126==
==12126== HEAP SUMMARY:
==12126== in use at exit: 9,641 bytes in 73 blocks
==12126==   total heap usage: 31,363,579 allocs, 31,363,506 frees,
41,389,143,617 bytes allocated
==12126==
==12126== LEAK SUMMARY:
==12126==definitely lost: 28 bytes in 1 blocks
==12126==indirectly lost: 0 bytes in 0 blocks
==12126==  possibly lost: 0 bytes in 0 blocks
==12126==still reachable: 9,613 bytes in 72 blocks
==12126== suppressed: 0 bytes in 0 blocks
==12126== Rerun with --leak-check=full to see details of leaked memory
==12126==
==12126== For counts of detected and suppressed errors, rerun with: -v
==12126== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17
from 9)

--- * ---

2.>  Inspect inode count (ceph daemon  status) to see if 
it's obeying its limit


This is the output I get once ceph-fuse is mounted but no user 
application is running


# ceph daemon /

Re: [ceph-users] Slow performance into windows VM

2016-07-11 Thread Christian Balzer


Hello,

On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:

> 
> > I hope the fastest of these MONs (CPU and storage) has the lowest IP
> > number and thus is the leader.
> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all 
> mons.

In your use case and configuration no surprise, but again, the lowest IP
will be leader by default and thus the busiest. 

> > Also what Ceph, OS, kernel version?
> 
> ubuntu 16.04 kernel 4.4.0-22
> 
Check the ML archives, I remember people having performance issues with the
4.4 kernels.

Still don't know your Ceph version, is it the latest Jewel?

> > Two GbE ports, given the "frontend" up there with the MON description I
> > assume that's 1 port per client (front) and cluster (back) network?
> yes, one GbE for ceph client, one GbE for back network.
OK, so (from a single GbE client) 100MB/s at most.

> > Is there any other client on than that Windows VM on your Ceph cluster?
> Yes, another one instance but without load.
OK.

> > Is Ceph understanding this now?
> > Other than that, the queue options aren't likely to do much good with pure
> >HDD OSDs.
> 
> I can't find those parameter in running config:
> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep 
> "filestore_queue"

These are OSD parameters, you need to query an OSD daemon. 

> "filestore_queue_max_ops": "3000",
> "filestore_queue_max_bytes": "1048576000",
> "filestore_queue_max_delay_multiple": "0",
> "filestore_queue_high_delay_multiple": "0",
> "filestore_queue_low_threshhold": "0.3",
> "filestore_queue_high_threshhold": "0.9",
> > That should be 512, 1024 really with one RBD pool.
> 
> Yes, I know. Today for test I added scbench pool with 128 pg
> There are output status and osd tree:
> ceph status
> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
> health HEALTH_OK
> monmap e6: 3 mons at 
> {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
> election epoch 238, quorum 0,1,2 block01,object01,object02
> osdmap e6887: 18 osds: 18 up, 18 in
> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
> 35049 GB used, 15218 GB / 50267 GB avail
> 1275 active+clean
> 3 active+clean+scrubbing+deep
> 2 active+clean+scrubbing
>
Check the ML archives and restrict scrubs to off-peak hours as well as
tune things to keep their impact low.

Scrubbing is a major performance killer, especially on non-SSD journal
OSDs and with older Ceph versions and/or non-tuned parameters:
---
osd_scrub_end_hour = 6
osd_scrub_load_threshold = 2.5
osd_scrub_sleep = 0.1
---

> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr
> 
> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 54.0 root default 
> -2 27.0 host cn802 
> 0 3.0 osd.0 up 1.0 1.0 
> 2 3.0 osd.2 up 1.0 1.0 
> 4 3.0 osd.4 up 1.0 1.0 
> 6 3.0 osd.6 up 0.89995 1.0 
> 8 3.0 osd.8 up 1.0 1.0 
> 10 3.0 osd.10 up 1.0 1.0 
> 12 3.0 osd.12 up 0.8 1.0 
> 16 3.0 osd.16 up 1.0 1.0 
> 18 3.0 osd.18 up 0.90002 1.0 
> -3 27.0 host cn803 
> 1 3.0 osd.1 up 1.0 1.0 
> 3 3.0 osd.3 up 0.95316 1.0 
> 5 3.0 osd.5 up 1.0 1.0 
> 7 3.0 osd.7 up 1.0 1.0 
> 9 3.0 osd.9 up 1.0 1.0 
> 11 3.0 osd.11 up 0.95001 1.0 
> 13 3.0 osd.13 up 1.0 1.0 
> 17 3.0 osd.17 up 0.84999 1.0 
> 19 3.0 osd.19 up 1.0 1.0
> > Wrong way to test this, test it from a monitor node, another client node
> > (like your openstack nodes).
> > In your 2 node cluster half of the reads or writes will be local, very
> > much skewing your results.
> I have been tested from copmute node also and have same result. 80-100Mb/sec
> 
That's about as good as it gets (not 148MB/s, though!).
But rados bench is not the same as real client I/O.

> > Very high max latency, telling us that your cluster ran out of steam at
> some point.
> 
> I copying data from my windows instance right now.

Re-do any testing when you've stopped all scrubbing.

> > I'd de-frag anyway, just to rule that out.
> 
> 
> >When doing your tests or normal (busy) operations from the client VM, run
> > atop on your storage nodes and observe your OSD HDDs. 
> > Do they get busy, around 100%?
> 
> Yes, high IO load (600-800 io).  But this is very strange on SATA HDD. All 
> HDD have own OSD daemon and presented in OS as hardware RAID0(each block node 
> have hardware RAID). Example:

Your RAID controller and its HW cache are likely to help with that speed,
also all of these are reads, most likely the scrubs above, not a single
write to be seen.

> avg-cpu: %user %nice %system %iowait %steal %idle
> 1.44 0.00 3.56 17.56 0.00 77.44
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util
> sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 649.00 0.00 82912.00 0.00 255.51 8.30 12.74 12.74 0.00 1.26 
> 81.60
> sdd 0.0

[ceph-users] CephFS and WORM

2016-07-11 Thread Xusangdi

Hi Cephers,

I’m planning to set up samba/nfs based on CephFS kernel mount. The WORM(write 
once read many) feature is required but I’m not
sure if CephFS officially supports it, any suggestions? Thanks in advance.

Regards,
---Sandy

-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
邮件！
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] (no subject)

2016-07-11 Thread Kees Meijs

Hi,

I think there's still something misconfigured:
> Invalid: 400 Bad Request: Unknown scheme 'file' found in URI (HTTP 400)

It seems the RBD backend is not used as expected.

Have you configured both Cinder _and_ Glance to use Ceph?

Regards,
Kees

On 08-07-16 17:33, Gaurav Goyal wrote:
>
> I regenerated the UUID as per your suggestion. 
> Now i have same UUID in host1 and host2.
> I could create volumes and attach them to existing VMs.
>
> I could create new glance images. 
>
> But still finding the same error while instance launch via GUI.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow performance into windows VM

2016-07-11 Thread K K


>  Still don't know your Ceph version, is it the latest Jewel?
10.1.2-0ubuntu1
> Check the ML archives, I remember people having performance issues with the
4.4 kernels.
Yes, I try today to find something

> These are OSD parameters, you need to query an OSD daemon. 
There are:
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show|grep 
"filestore_queue"
"filestore_queue_max_ops": "3000",
"filestore_queue_max_bytes": "1048576000",
"filestore_queue_max_delay_multiple": "0",
"filestore_queue_high_delay_multiple": "0",
"filestore_queue_low_threshhold": "0.3",
"filestore_queue_high_threshhold": "0.9",
> Scrubbing is a major performance killer, especially on non-SSD journal
> OSDs and with older Ceph versions and/or non-tuned parameters:
I can't change those parametres on fly:
ceph tell osd.* injectargs '--osd_scrub_end_hour=6'
osd.0: osd_scrub_end_hour = '6' (unchangeable) 
osd.1: osd_scrub_end_hour = '6' (unchangeable) 
osd.2: osd_scrub_end_hour = '6' (unchangeable)
...
I try to change it today sone later and restart OSDs.


>Понедельник, 11 июля 2016, 12:38 +05:00 от Christian Balzer :
>
>
>Hello,
>
>On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:
>
>> 
>> > I hope the fastest of these MONs (CPU and storage) has the lowest IP
>> > number and thus is the leader.
>> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all 
>> mons.
>
>In your use case and configuration no surprise, but again, the lowest IP
>will be leader by default and thus the busiest. 
>
>> > Also what Ceph, OS, kernel version?
>> 
>> ubuntu 16.04 kernel 4.4.0-22
>> 
>Check the ML archives, I remember people having performance issues with the
>4.4 kernels.
>
>Still don't know your Ceph version, is it the latest Jewel?
>
>> > Two GbE ports, given the "frontend" up there with the MON description I
>> > assume that's 1 port per client (front) and cluster (back) network?
>> yes, one GbE for ceph client, one GbE for back network.
>OK, so (from a single GbE client) 100MB/s at most.
>
>> > Is there any other client on than that Windows VM on your Ceph cluster?
>> Yes, another one instance but without load.
>OK.
>
>> > Is Ceph understanding this now?
>> > Other than that, the queue options aren't likely to do much good with pure
>> >HDD OSDs.
>> 
>> I can't find those parameter in running config:
>> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep 
>> "filestore_queue"
>
>These are OSD parameters, you need to query an OSD daemon. 
>
>> "filestore_queue_max_ops": "3000",
>> "filestore_queue_max_bytes": "1048576000",
>> "filestore_queue_max_delay_multiple": "0",
>> "filestore_queue_high_delay_multiple": "0",
>> "filestore_queue_low_threshhold": "0.3",
>> "filestore_queue_high_threshhold": "0.9",
>> > That should be 512, 1024 really with one RBD pool.
>> 
>> Yes, I know. Today for test I added scbench pool with 128 pg
>> There are output status and osd tree:
>> ceph status
>> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
>> health HEALTH_OK
>> monmap e6: 3 mons at 
>> {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
>> election epoch 238, quorum 0,1,2 block01,object01,object02
>> osdmap e6887: 18 osds: 18 up, 18 in
>> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
>> 35049 GB used, 15218 GB / 50267 GB avail
>> 1275 active+clean
>> 3 active+clean+scrubbing+deep
>> 2 active+clean+scrubbing
>>
>Check the ML archives and restrict scrubs to off-peak hours as well as
>tune things to keep their impact low.
>
>Scrubbing is a major performance killer, especially on non-SSD journal
>OSDs and with older Ceph versions and/or non-tuned parameters:
>---
>osd_scrub_end_hour = 6
>osd_scrub_load_threshold = 2.5
>osd_scrub_sleep = 0.1
>---
>
>> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr
>> 
>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
>> -1 54.0 root default 
>> -2 27.0 host cn802 
>> 0 3.0 osd.0 up 1.0 1.0 
>> 2 3.0 osd.2 up 1.0 1.0 
>> 4 3.0 osd.4 up 1.0 1.0 
>> 6 3.0 osd.6 up 0.89995 1.0 
>> 8 3.0 osd.8 up 1.0 1.0 
>> 10 3.0 osd.10 up 1.0 1.0 
>> 12 3.0 osd.12 up 0.8 1.0 
>> 16 3.0 osd.16 up 1.0 1.0 
>> 18 3.0 osd.18 up 0.90002 1.0 
>> -3 27.0 host cn803 
>> 1 3.0 osd.1 up 1.0 1.0 
>> 3 3.0 osd.3 up 0.95316 1.0 
>> 5 3.0 osd.5 up 1.0 1.0 
>> 7 3.0 osd.7 up 1.0 1.0 
>> 9 3.0 osd.9 up 1.0 1.0 
>> 11 3.0 osd.11 up 0.95001 1.0 
>> 13 3.0 osd.13 up 1.0 1.0 
>> 17 3.0 osd.17 up 0.84999 1.0 
>> 19 3.0 osd.19 up 1.0 1.0
>> > Wrong way to test this, test it from a monitor node, another client node
>> > (like your openstack nodes).
>> > In your 2 node cluster half of the reads or writes will be local, very
>> > much skewing your results.
>> I have been tested from copmute node also and have same result. 80-100Mb/sec
>> 
>That's about as good as i

Re: [ceph-users] Slow performance into windows VM

2016-07-11 Thread K K


Additional.
OSD params aplying to scrub:
"osd_scrub_invalid_stats": "true",
"osd_scrub_begin_hour": "0",
"osd_scrub_end_hour": "24",
"osd_scrub_load_threshold": "0.5",
"osd_scrub_min_interval": "86400",
"osd_scrub_max_interval": "604800",
"osd_scrub_interval_randomize_ratio": "0.5",
"osd_scrub_chunk_min": "5",
"osd_scrub_chunk_max": "25",
"osd_scrub_sleep": "0",
"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",
"osd_scrub_priority": "5",
"osd_scrub_cost": "52428800",
Christian, can you tell optimal params for my environment?

>Понедельник, 11 июля 2016, 12:38 +05:00 от Christian Balzer :
>
>
>Hello,
>
>On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:
>
>> 
>> > I hope the fastest of these MONs (CPU and storage) has the lowest IP
>> > number and thus is the leader.
>> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all 
>> mons.
>
>In your use case and configuration no surprise, but again, the lowest IP
>will be leader by default and thus the busiest. 
>
>> > Also what Ceph, OS, kernel version?
>> 
>> ubuntu 16.04 kernel 4.4.0-22
>> 
>Check the ML archives, I remember people having performance issues with the
>4.4 kernels.
>
>Still don't know your Ceph version, is it the latest Jewel?
>
>> > Two GbE ports, given the "frontend" up there with the MON description I
>> > assume that's 1 port per client (front) and cluster (back) network?
>> yes, one GbE for ceph client, one GbE for back network.
>OK, so (from a single GbE client) 100MB/s at most.
>
>> > Is there any other client on than that Windows VM on your Ceph cluster?
>> Yes, another one instance but without load.
>OK.
>
>> > Is Ceph understanding this now?
>> > Other than that, the queue options aren't likely to do much good with pure
>> >HDD OSDs.
>> 
>> I can't find those parameter in running config:
>> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep 
>> "filestore_queue"
>
>These are OSD parameters, you need to query an OSD daemon. 
>
>> "filestore_queue_max_ops": "3000",
>> "filestore_queue_max_bytes": "1048576000",
>> "filestore_queue_max_delay_multiple": "0",
>> "filestore_queue_high_delay_multiple": "0",
>> "filestore_queue_low_threshhold": "0.3",
>> "filestore_queue_high_threshhold": "0.9",
>> > That should be 512, 1024 really with one RBD pool.
>> 
>> Yes, I know. Today for test I added scbench pool with 128 pg
>> There are output status and osd tree:
>> ceph status
>> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
>> health HEALTH_OK
>> monmap e6: 3 mons at 
>> {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
>> election epoch 238, quorum 0,1,2 block01,object01,object02
>> osdmap e6887: 18 osds: 18 up, 18 in
>> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
>> 35049 GB used, 15218 GB / 50267 GB avail
>> 1275 active+clean
>> 3 active+clean+scrubbing+deep
>> 2 active+clean+scrubbing
>>
>Check the ML archives and restrict scrubs to off-peak hours as well as
>tune things to keep their impact low.
>
>Scrubbing is a major performance killer, especially on non-SSD journal
>OSDs and with older Ceph versions and/or non-tuned parameters:
>---
>osd_scrub_end_hour = 6
>osd_scrub_load_threshold = 2.5
>osd_scrub_sleep = 0.1
>---
>
>> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr
>> 
>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
>> -1 54.0 root default 
>> -2 27.0 host cn802 
>> 0 3.0 osd.0 up 1.0 1.0 
>> 2 3.0 osd.2 up 1.0 1.0 
>> 4 3.0 osd.4 up 1.0 1.0 
>> 6 3.0 osd.6 up 0.89995 1.0 
>> 8 3.0 osd.8 up 1.0 1.0 
>> 10 3.0 osd.10 up 1.0 1.0 
>> 12 3.0 osd.12 up 0.8 1.0 
>> 16 3.0 osd.16 up 1.0 1.0 
>> 18 3.0 osd.18 up 0.90002 1.0 
>> -3 27.0 host cn803 
>> 1 3.0 osd.1 up 1.0 1.0 
>> 3 3.0 osd.3 up 0.95316 1.0 
>> 5 3.0 osd.5 up 1.0 1.0 
>> 7 3.0 osd.7 up 1.0 1.0 
>> 9 3.0 osd.9 up 1.0 1.0 
>> 11 3.0 osd.11 up 0.95001 1.0 
>> 13 3.0 osd.13 up 1.0 1.0 
>> 17 3.0 osd.17 up 0.84999 1.0 
>> 19 3.0 osd.19 up 1.0 1.0
>> > Wrong way to test this, test it from a monitor node, another client node
>> > (like your openstack nodes).
>> > In your 2 node cluster half of the reads or writes will be local, very
>> > much skewing your results.
>> I have been tested from copmute node also and have same result. 80-100Mb/sec
>> 
>That's about as good as it gets (not 148MB/s, though!).
>But rados bench is not the same as real client I/O.
>
>> > Very high max latency, telling us that your cluster ran out of steam at
>> some point.
>> 
>> I copying data from my windows instance right now.
>
>Re-do any testing when you've stopped all scrubbing.
>
>> > I'd de-frag anyway, just to rule that out.
>> 
>> 
>> >When doing your tests or normal (busy) operations from the client VM, run
>> > atop on your storage nodes and observe your OSD HDDs. 
>

Re: [ceph-users] Question about how to start ceph OSDs with systemd

2016-07-11 Thread Ernst Pijper

Hi Manual,

This is a well known issue. You are definitely not the first one to hit this 
problem. Before Jewel i (and other as well) added the line

ceph-disk activate all

to /etc/rc.local to get the OSD’s running at boot. In Jewel, however, this 
doesn’t work anymore. Now i add these line to /etc/rc.local

for dev in $(ceph-disk list | grep "ceph journal" | awk '{print $1}')
do
   ceph-disk trigger $dev
done

The OSD’s are now automatically mounted and started. After this, the systemctl 
commands also work.

Ernst


> On 8 jul. 2016, at 17:59, Manuel Lausch  wrote:
> 
> hi,
> 
> In the last days I do play around with ceph jewel on debian Jessie and CentOS 
> 7. Now I have a question about systemd on this Systems.
> 
> I installed ceph jewel (ceph version 10.2.2 
> (45107e21c568dd033c2f0a3107dec8f0b0e58374)) on debian Jessie and prepared 
> some OSDs. While playing around I decided to reinstall my operating system 
> (of course without deleting the OSD devices ). After the reinstallation of 
> ceph and put in the old ceph.conf I thought the previously prepared OSDs do 
> easily start and all will be fine after that.
> 
> With debian Wheezy and ceph firefly this worked well, but with the new 
> versions and systemd this doesn't work at all. Now what have I to do to get 
> the OSDs running again?
> 
> The following command didn't work and I didn't get any output from it.
>  systemctl start ceph-osd.target
> 
> And this is the output from systemctl status ceph-osd.target
> ● ceph-osd.target - ceph target allowing to start/stop all ceph-osd@.service 
> instances at once
>   Loaded: loaded (/usr/lib/systemd/system/ceph-osd.target; enabled; vendor 
> preset: enabled)
>   Active: active since Fri 2016-07-08 17:19:29 CEST; 36min ago
> 
> Jul 08 17:19:29 cs-dellbrick01.server.lan systemd[1]: Reached target ceph 
> target allowing to start/stop all ceph-osd@.service instances at once.
> Jul 08 17:19:29 cs-dellbrick01.server.lan systemd[1]: Starting ceph target 
> allowing to start/stop all ceph-osd@.service instances at once.
> Jul 08 17:31:15 cs-dellbrick01.server.lan systemd[1]: Reached target ceph 
> target allowing to start/stop all ceph-osd@.service instances at once.
> 
> 
> 
> thanks,
> Manuel
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Lionel Bouton

Le 11/07/2016 04:48, 한승진 a écrit :
> Hi cephers.
>
> I need your help for some issues.
>
> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>
> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>
> I've experienced one of OSDs was killed himself.
>
> Always it issued suicide timeout message.

This is probably a fragmentation problem : typical rbd access patterns
cause heavy BTRFS fragmentation.

If you already use the autodefrag mount option, you can try this which
performs much better for us :
https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb

Note that it can take some time to fully defragment the filesystems but
it shouldn't put more stress than autodefrag while doing so.

If you don't already use it, set :
filestore btrfs snap = false
in ceph.conf an restart your OSDs.

Finally if you use journals on the filesystem and not on dedicated
partitions, you'll have to recreate them with the NoCow attribute
(there's no way to defragment journals in any way that doesn't kill
performance otherwise).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow performance into windows VM

2016-07-11 Thread K K


I have change those params to all OSD and restart its:

osd_scrub_end_hour = 6
osd_scrub_load_threshold = 2.5
osd_scrub_sleep = 0.1

but ceph status still show deep scrub:
ceph status
cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
health HEALTH_OK
monmap e6: 3 mons at 
{block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
election epoch 238, quorum 0,1,2 block01,object01,object02
osdmap e7046: 18 osds: 18 up, 18 in
pgmap v9746276: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
35049 GB used, 15218 GB / 50267 GB avail
1278 active+clean
2 active+clean+scrubbing+deep
client io 3909 kB/s rd, 30277 B/s wr, 23 op/s rd, 9 op/s wr

Bow I temporary disable deep-scrub via "ceph osd set nodeep-scrub".
But performance still poor into VM Понедельник, 11 июля 2016, 12:38 +05:00 от 
Christian Balzer < ch...@gol.com >:
>
>
>Hello,
>
>On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:
>
>> 
>> > I hope the fastest of these MONs (CPU and storage) has the lowest IP
>> > number and thus is the leader.
>> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all 
>> mons.
>
>In your use case and configuration no surprise, but again, the lowest IP
>will be leader by default and thus the busiest. 
>
>> > Also what Ceph, OS, kernel version?
>> 
>> ubuntu 16.04 kernel 4.4.0-22
>> 
>Check the ML archives, I remember people having performance issues with the
>4.4 kernels.
>
>Still don't know your Ceph version, is it the latest Jewel?
>
>> > Two GbE ports, given the "frontend" up there with the MON description I
>> > assume that's 1 port per client (front) and cluster (back) network?
>> yes, one GbE for ceph client, one GbE for back network.
>OK, so (from a single GbE client) 100MB/s at most.
>
>> > Is there any other client on than that Windows VM on your Ceph cluster?
>> Yes, another one instance but without load.
>OK.
>
>> > Is Ceph understanding this now?
>> > Other than that, the queue options aren't likely to do much good with pure
>> >HDD OSDs.
>> 
>> I can't find those parameter in running config:
>> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep 
>> "filestore_queue"
>
>These are OSD parameters, you need to query an OSD daemon. 
>
>> "filestore_queue_max_ops": "3000",
>> "filestore_queue_max_bytes": "1048576000",
>> "filestore_queue_max_delay_multiple": "0",
>> "filestore_queue_high_delay_multiple": "0",
>> "filestore_queue_low_threshhold": "0.3",
>> "filestore_queue_high_threshhold": "0.9",
>> > That should be 512, 1024 really with one RBD pool.
>> 
>> Yes, I know. Today for test I added scbench pool with 128 pg
>> There are output status and osd tree:
>> ceph status
>> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
>> health HEALTH_OK
>> monmap e6: 3 mons at 
>> {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
>> election epoch 238, quorum 0,1,2 block01,object01,object02
>> osdmap e6887: 18 osds: 18 up, 18 in
>> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
>> 35049 GB used, 15218 GB / 50267 GB avail
>> 1275 active+clean
>> 3 active+clean+scrubbing+deep
>> 2 active+clean+scrubbing
>>
>Check the ML archives and restrict scrubs to off-peak hours as well as
>tune things to keep their impact low.
>
>Scrubbing is a major performance killer, especially on non-SSD journal
>OSDs and with older Ceph versions and/or non-tuned parameters:
>---
>osd_scrub_end_hour = 6
>osd_scrub_load_threshold = 2.5
>osd_scrub_sleep = 0.1
>---
>
>> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr
>> 
>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
>> -1 54.0 root default 
>> -2 27.0 host cn802 
>> 0 3.0 osd.0 up 1.0 1.0 
>> 2 3.0 osd.2 up 1.0 1.0 
>> 4 3.0 osd.4 up 1.0 1.0 
>> 6 3.0 osd.6 up 0.89995 1.0 
>> 8 3.0 osd.8 up 1.0 1.0 
>> 10 3.0 osd.10 up 1.0 1.0 
>> 12 3.0 osd.12 up 0.8 1.0 
>> 16 3.0 osd.16 up 1.0 1.0 
>> 18 3.0 osd.18 up 0.90002 1.0 
>> -3 27.0 host cn803 
>> 1 3.0 osd.1 up 1.0 1.0 
>> 3 3.0 osd.3 up 0.95316 1.0 
>> 5 3.0 osd.5 up 1.0 1.0 
>> 7 3.0 osd.7 up 1.0 1.0 
>> 9 3.0 osd.9 up 1.0 1.0 
>> 11 3.0 osd.11 up 0.95001 1.0 
>> 13 3.0 osd.13 up 1.0 1.0 
>> 17 3.0 osd.17 up 0.84999 1.0 
>> 19 3.0 osd.19 up 1.0 1.0
>> > Wrong way to test this, test it from a monitor node, another client node
>> > (like your openstack nodes).
>> > In your 2 node cluster half of the reads or writes will be local, very
>> > much skewing your results.
>> I have been tested from copmute node also and have same result. 80-100Mb/sec
>> 
>That's about as good as it gets (not 148MB/s, though!).
>But rados bench is not the same as real client I/O.
>
>> > Very high max latency, telling us that your cluster ran out of steam at
>> some point.
>> 
>> I copying data from my windows instance right now.
>
>Re-do any testing when you

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Brad Hubbard

On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
 wrote:
> Le 11/07/2016 04:48, 한승진 a écrit :
>> Hi cephers.
>>
>> I need your help for some issues.
>>
>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>
>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>
>> I've experienced one of OSDs was killed himself.
>>
>> Always it issued suicide timeout message.
>
> This is probably a fragmentation problem : typical rbd access patterns
> cause heavy BTRFS fragmentation.

To the extent that operations take over 120 seconds to complete? Really?

I have no experience with BTRFS but had heard that performance can "fall
off a cliff" but I didn't know it was that bad.

-- 
Cheers,
Brad

>
> If you already use the autodefrag mount option, you can try this which
> performs much better for us :
> https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb
>
> Note that it can take some time to fully defragment the filesystems but
> it shouldn't put more stress than autodefrag while doing so.
>
> If you don't already use it, set :
> filestore btrfs snap = false
> in ceph.conf an restart your OSDs.
>
> Finally if you use journals on the filesystem and not on dedicated
> partitions, you'll have to recreate them with the NoCow attribute
> (there's no way to defragment journals in any way that doesn't kill
> performance otherwise).
>
> Best regards,
>
> Lionel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error EPERM when running ceph tell command

2016-07-11 Thread Andrei Mikhailovsky

Hello again 

Any thoughts on this issue? 

Cheers 

Andrei 

> From: "Andrei Mikhailovsky" 
> To: "ceph-users" 
> Sent: Wednesday, 22 June, 2016 18:02:28
> Subject: [ceph-users] Error EPERM when running ceph tell command

> Hi

> I am trying to run an osd level benchmark but get the following error:

> # ceph tell osd.3 bench
> Error EPERM: problem getting command descriptions from osd.3

> I am running Jewel 10.2.2 on Ubuntu 16.04 servers. Has the syntax change or 
> do I
> have an issue?

> Cheers
> Andrei

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Misdirected clients due to kernel bug?

2016-07-11 Thread Simon Engelsman

Hello everyone,

Last week, while deploying new disks in our cluster, we bump into what
we believe is a kernel bug. Now everything is working fine, though we
wanted to share our experience and see if other people have experienced
similar behaviour.

Steps we followed were:

1) First we removed DNE osds (that had been previously removed
from the cluster) to reuse their ids.

ceph osd crush remove osd.6
ceph auth del osd.6
ceph osd rm 6

2) Then we deployed new disks with ceph-deploy

ceph-deploy --overwrite-conf osd create ds1-ceph01:sda

We have two different pools on the cluster, hence we used the option

osd crush update on start = false

So we could later manually add OSDs to the desired pool with

ceph osd crush add osd.6 0.9 host=ds1-ceph01


We added two disks. First one looked fine, however, after adding the
second disk ceph -s started to show odd info such as some PGS on
backfill_toofull. The odd thing was that, the OSD supposed to be full
was 81% full, and the ratios are full_ratio 0.95 nearfull_ratio 0.88.

Also, monitor logs were getting flooded with messages like:

misdirected client.708156.1:1609543462 pg 2.1eff89a7 to osd.83 not
[1,83,93] in e154784/154784

On the clients we got write errors:

[20882274.721623] rbd: rbd28: result -6 xferred 2000
[20882274.773296] rbd: rbd28: write 2000 at aef404000 (4000)
[20882274.773304] rbd: rbd28: result -6 xferred 2000
[20882274.826057] rbd: rbd28: write 2000 at aef404000 (4000)
[20882274.826064] rbd: rbd28: result -6 xferred 2000

On OSDs, most of them were running:
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
and few of them (including the new ones) with:
ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

On the clients, we were running kernel 4.1.1.

Once we rebooted clients with kernel 4.1.13 errors disappeared.

The misdirect messages made us think that there were incorrect/outdated
copies of the cluster map.

Any insights would be very welcome.

Regards,
Simon Engelsman

Greenhost - sustainable hosting & digital security
https://greenhost.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS and WORM

2016-07-11 Thread John Spray

On Mon, Jul 11, 2016 at 9:28 AM, Xusangdi  wrote:
> Hi Cephers,
>
>
>
> I’m planning to set up samba/nfs based on CephFS kernel mount. The
> WORM(write once read many) feature is required but I’m not
>
> sure if CephFS officially supports it, any suggestions? Thanks in advance.

There's nothing in CephFS to support WORM, but apparently there is a
module in Samba (https://wiki.samba.org/index.php/VFS/vfs_worm), maybe
someone here has tried it?

John

>
>
> Regards,
>
> ---Sandy
>
>
>
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
> 邮件！
> This e-mail and its attachments contain confidential information from H3C,
> which is
> intended only for the person or entity whose address is listed above. Any
> use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender
> by phone or email immediately and delete it!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-11 Thread John Spray

On Mon, Jul 11, 2016 at 8:04 AM, Goncalo Borges
 wrote:
> Hi John...
>
> Thank you for replying.
>
> Here is the result of the tests you asked but I do not see nothing abnormal.

Thanks for running through that.  Yes, nothing in the output struck me
as unreasonable either :-/

> Actually, your suggestions made me see that:
>
> 1) ceph-fuse 9.2.0 is presenting the same behaviour but with less memory
> consumption, probably, less enought so that it doesn't brake ceph-fuse in
> our machines with less memory.
>
> 2) I see a tremendous number of  ceph-fuse threads launched (around 160).

Unless you're using the async messenger, Ceph creates threads for each
OSD connection, so it's normal to have a significant number of threads
(e.g. if you had about 80 OSDs that would explain your thread count).

John

> # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | wc -l
> 157
>
> # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | head -n 10
> COMMAND  PPID   PID  SPIDVSZ   RSS %MEM %CPU
> ceph-fuse --id mount_user - 1  3230  3230 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3231 9935240 339780  0.6 0.1
> ceph-fuse --id mount_user - 1  3230  3232 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3233 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3234 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3235 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3236 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3237 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3238 9935240 339780  0.6 0.0
>
>
> I do not see a way to actually limit the number of ceph-fuse threads
> launched  or to limit the max vm size each thread should take.
>
> Do you know how to limit those options.
>
> Cheers
>
> Goncalo
>
>
>
>
> 1.> Try running ceph-fuse with valgrind --tool=memcheck to see if it's
> leaking
>
> I have launched ceph-fuse with valgrind in the cluster where there is
> sufficient memory available, and therefore, there is no object cacher
> segfault.
>
> $ valgrind --log-file=/tmp/valgrind-ceph-fuse-10.2.2.txt --tool=memcheck
> ceph-fuse --id mount_user -k /etc/ceph/ceph.client.mount_user.keyring -m
> X.X.X.8:6789 -r /cephfs /coepp/cephfs
>
> This is the output which I get once I unmount the file system after user
> application execution
>
> # cat valgrind-ceph-fuse-10.2.2.txt
> ==12123== Memcheck, a memory error detector
> ==12123== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
> ==12123== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
> ==12123== Command: ceph-fuse --id mount_user -k
> /etc/ceph/ceph.client.mount_user.keyring -m 192.231.127.8:6789 -r /cephfs
> /coepp/cephfs
> ==12123== Parent PID: 11992
> ==12123==
> ==12123==
> ==12123== HEAP SUMMARY:
> ==12123== in use at exit: 29,129 bytes in 397 blocks
> ==12123==   total heap usage: 14,824 allocs, 14,427 frees, 648,030 bytes
> allocated
> ==12123==
> ==12123== LEAK SUMMARY:
> ==12123==definitely lost: 16 bytes in 1 blocks
> ==12123==indirectly lost: 0 bytes in 0 blocks
> ==12123==  possibly lost: 11,705 bytes in 273 blocks
> ==12123==still reachable: 17,408 bytes in 123 blocks
> ==12123== suppressed: 0 bytes in 0 blocks
> ==12123== Rerun with --leak-check=full to see details of leaked memory
> ==12123==
> ==12123== For counts of detected and suppressed errors, rerun with: -v
> ==12123== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)
> ==12126==
> ==12126== HEAP SUMMARY:
> ==12126== in use at exit: 9,641 bytes in 73 blocks
> ==12126==   total heap usage: 31,363,579 allocs, 31,363,506 frees,
> 41,389,143,617 bytes allocated
> ==12126==
> ==12126== LEAK SUMMARY:
> ==12126==definitely lost: 28 bytes in 1 blocks
> ==12126==indirectly lost: 0 bytes in 0 blocks
> ==12126==  possibly lost: 0 bytes in 0 blocks
> ==12126==still reachable: 9,613 bytes in 72 blocks
> ==12126== suppressed: 0 bytes in 0 blocks
> ==12126== Rerun with --leak-check=full to see details of leaked memory
> ==12126==
> ==12126== For counts of detected and suppressed errors, rerun with: -v
> ==12126== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 9)
>
> --- * ---
>
> 2.>  Inspect inode count (ceph daemon  status) to see if it's
> obeying its limit
>
> This is the output I get once ceph-fuse is mounted but no user application
> is running
>
> # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
> {
> "metadata": {
> "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
> "ceph_version": "ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374)",
> "entity_id": "mount_user",
> "hostname": "",
> "mount_point": "\/coepp\/cephfs",
> "root": "\/cephfs"
> },
> "dentry_count": 0,
> "dentry_pinned_count": 0,
>

Re: [ceph-users] Filestore merge and split

2016-07-11 Thread Nick Fisk

I believe splitting will happen on writes, merging I think only happens on 
deletions.

 

From: Paul Renner [mailto:renner...@gmail.com] 
Sent: 10 July 2016 19:40
To: n...@fisk.me.uk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Filestore merge and split

 

Thanks...

Do you know when splitting or merging will happen? Is it enough that a 
directory is read, eg. through scrub? If possible I would like to initiate the 
process

Regards

Paul 

 

On Sun, Jul 10, 2016 at 10:47 AM, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

You need to set the option in the ceph.conf and restart the OSD I think. But it 
will only take effect when splitting or merging in the future, it won't adjust 
the current folder layout.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
>  ] On Behalf Of Paul Renner
> Sent: 09 July 2016 22:18
> To: ceph-users@lists.ceph.com  
> Subject: [ceph-users] Filestore merge and split
>
> Hello cephers
> we have many (millions,  small objects in our RadosGW system and are getting 
> not very good write performance, 100-200 PUTs /sec.
>
> I have read on the mailinglist that one possible tuning option would be to 
> increase the max. number of files per directory on OSDs with
> eg.
>
> filestore merge threshold = 40
> filestore split multiple = 8
> Now my question is, do we need to rebuild the OSDs to make this effective? Or 
> is it a runtime setting?
> I'm asking because when setting this with injectargs I get the message 
> "unchangeable" back.
> Thanks for any insight.



 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS and WORM

2016-07-11 Thread Xusangdi

Thank you for the confirmation, John!

As we have both CIFS&NFS users, I was wishing the feature should be implemented
at the CephFS layer :<

Regards,
---Sandy

> -Original Message-
> From: John Spray [mailto:jsp...@redhat.com]
> Sent: Monday, July 11, 2016 7:28 PM
> To: xusangdi 11976 (RD)
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] CephFS and WORM
> 
> On Mon, Jul 11, 2016 at 9:28 AM, Xusangdi  wrote:
> > Hi Cephers,
> >
> >
> >
> > I’m planning to set up samba/nfs based on CephFS kernel mount. The
> > WORM(write once read many) feature is required but I’m not
> >
> > sure if CephFS officially supports it, any suggestions? Thanks in advance.
> 
> There's nothing in CephFS to support WORM, but apparently there is a module 
> in Samba
> (https://wiki.samba.org/index.php/VFS/vfs_worm), maybe someone here has tried 
> it?
> 
> John
> 
> >
> >
> > Regards,
> >
> > ---Sandy
> >
> >
> >
> > --
> > ---
> > 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
> > 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
> > 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
> > 邮件！
> > This e-mail and its attachments contain confidential information from
> > H3C, which is intended only for the person or entity whose address is
> > listed above. Any use of the information contained herein in any way
> > (including, but not limited to, total or partial disclosure,
> > reproduction, or dissemination) by persons other than the intended
> > recipient(s) is prohibited. If you receive this e-mail in error,
> > please notify the sender by phone or email immediately and delete it!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Filestore merge and split

2016-07-11 Thread Anand Bhat

Merge happens either due to movement of objects due to CRUSH recalculation
(when cluster grows or shrinks due to various reasons) or deletion of
objects.

Split happens when portions of objects/volumes get populated that were
previously sparse. Each RADOS object by default is 4MB chunk and volumes
comprise of these objects There is no RADOS object created when there is no
write on that region. When write spans sparse portions of the volume, new
RADOS objects are created under directory that maps the PG to which the
object belongs.

Regards,
Anand

On Mon, Jul 11, 2016 at 5:38 PM, Nick Fisk  wrote:

> I believe splitting will happen on writes, merging I think only happens on
> deletions.
>
>
>
> *From:* Paul Renner [mailto:renner...@gmail.com]
> *Sent:* 10 July 2016 19:40
> *To:* n...@fisk.me.uk
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Filestore merge and split
>
>
>
> Thanks...
>
> Do you know when splitting or merging will happen? Is it enough that a
> directory is read, eg. through scrub? If possible I would like to initiate
> the process
>
> Regards
>
> Paul
>
>
>
> On Sun, Jul 10, 2016 at 10:47 AM, Nick Fisk  wrote:
>
> You need to set the option in the ceph.conf and restart the OSD I think.
> But it will only take effect when splitting or merging in the future, it
> won't adjust the current folder layout.
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Paul Renner
> > Sent: 09 July 2016 22:18
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] Filestore merge and split
> >
> > Hello cephers
> > we have many (millions,  small objects in our RadosGW system and are
> getting not very good write performance, 100-200 PUTs /sec.
> >
> > I have read on the mailinglist that one possible tuning option would be
> to increase the max. number of files per directory on OSDs with
> > eg.
> >
> > filestore merge threshold = 40
> > filestore split multiple = 8
> > Now my question is, do we need to rebuild the OSDs to make this
> effective? Or is it a runtime setting?
> > I'm asking because when setting this with injectargs I get the message
> "unchangeable" back.
> > Thanks for any insight.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Never say never.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSPF to the host

2016-07-11 Thread Saverio Proto

> I'm looking at the Dell S-ON switches which we can get in a Cumulus
> version. Any pro's and con's of using Cumulus vs old school switch OS's you
> may have come across?

Nothing to declare here. Once configured properly the hardware works
as expected. I never used Dell, I used switches from Quanta.

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] New to Ceph - osd autostart problem

2016-07-11 Thread Dirk Laurenz


Hello,


i'm new to ceph an try to do some first steps with ceph to understand 
concepts.


my setup is at first completly in vm


i deployed (with ceph-deploy) three monitors and three osd hosts. (3+3 vms)

my frist test was to find out, if everything comes back online after a 
system restart. this works fine for the monitors, but fails for the 
osds. i have to start them manually.



OS is debian jessie, ceph is the current release


Where can find out, what's going wrong


Dirk

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-11 Thread Oliver Dzombic

Hi Dirk,

without any information, its impossible to tell you anything.

Please provide us some detail information about what is going wrong,
including error messages and so on.

As an admin you should be enough familar with your system to give us
more information but just "its not working". As you know, this
information does not help.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 11.07.2016 um 14:32 schrieb Dirk Laurenz:
> Hello,
> 
> 
> i'm new to ceph an try to do some first steps with ceph to understand
> concepts.
> 
> my setup is at first completly in vm
> 
> 
> i deployed (with ceph-deploy) three monitors and three osd hosts. (3+3 vms)
> 
> my frist test was to find out, if everything comes back online after a
> system restart. this works fine for the monitors, but fails for the
> osds. i have to start them manually.
> 
> 
> OS is debian jessie, ceph is the current release
> 
> 
> Where can find out, what's going wrong
> 
> 
> Dirk
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow performance into windows VM

2016-07-11 Thread K K


I have tested windows instance Crystal Disk Mark. Result is:

Sequential Read : 43.049 MB/s
Sequential Write : 45.181 MB/s
Random Read 512KB : 78.660 MB/s
Random Write 512KB : 39.292 MB/s
Random Read 4KB (QD=1) : 3.511 MB/s [ 857.3 IOPS]
Random Write 4KB (QD=1) : 1.380 MB/s [ 337.0 IOPS]
Random Read 4KB (QD=32) : 32.220 MB/s [ 7866.1 IOPS]
Random Write 4KB (QD=32) : 12.564 MB/s [ 3067.4 IOPS]
Test : 4000 MB [D: 97.5% (15699.7/16103.1 GB)] (x3)

>Понедельник, 11 июля 2016, 12:38 +05:00 от Christian Balzer :
>
>
>Hello,
>
>On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:
>
>> 
>> > I hope the fastest of these MONs (CPU and storage) has the lowest IP
>> > number and thus is the leader.
>> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all 
>> mons.
>
>In your use case and configuration no surprise, but again, the lowest IP
>will be leader by default and thus the busiest. 
>
>> > Also what Ceph, OS, kernel version?
>> 
>> ubuntu 16.04 kernel 4.4.0-22
>> 
>Check the ML archives, I remember people having performance issues with the
>4.4 kernels.
>
>Still don't know your Ceph version, is it the latest Jewel?
>
>> > Two GbE ports, given the "frontend" up there with the MON description I
>> > assume that's 1 port per client (front) and cluster (back) network?
>> yes, one GbE for ceph client, one GbE for back network.
>OK, so (from a single GbE client) 100MB/s at most.
>
>> > Is there any other client on than that Windows VM on your Ceph cluster?
>> Yes, another one instance but without load.
>OK.
>
>> > Is Ceph understanding this now?
>> > Other than that, the queue options aren't likely to do much good with pure
>> >HDD OSDs.
>> 
>> I can't find those parameter in running config:
>> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep 
>> "filestore_queue"
>
>These are OSD parameters, you need to query an OSD daemon. 
>
>> "filestore_queue_max_ops": "3000",
>> "filestore_queue_max_bytes": "1048576000",
>> "filestore_queue_max_delay_multiple": "0",
>> "filestore_queue_high_delay_multiple": "0",
>> "filestore_queue_low_threshhold": "0.3",
>> "filestore_queue_high_threshhold": "0.9",
>> > That should be 512, 1024 really with one RBD pool.
>> 
>> Yes, I know. Today for test I added scbench pool with 128 pg
>> There are output status and osd tree:
>> ceph status
>> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
>> health HEALTH_OK
>> monmap e6: 3 mons at 
>> {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
>> election epoch 238, quorum 0,1,2 block01,object01,object02
>> osdmap e6887: 18 osds: 18 up, 18 in
>> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
>> 35049 GB used, 15218 GB / 50267 GB avail
>> 1275 active+clean
>> 3 active+clean+scrubbing+deep
>> 2 active+clean+scrubbing
>>
>Check the ML archives and restrict scrubs to off-peak hours as well as
>tune things to keep their impact low.
>
>Scrubbing is a major performance killer, especially on non-SSD journal
>OSDs and with older Ceph versions and/or non-tuned parameters:
>---
>osd_scrub_end_hour = 6
>osd_scrub_load_threshold = 2.5
>osd_scrub_sleep = 0.1
>---
>
>> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr
>> 
>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
>> -1 54.0 root default 
>> -2 27.0 host cn802 
>> 0 3.0 osd.0 up 1.0 1.0 
>> 2 3.0 osd.2 up 1.0 1.0 
>> 4 3.0 osd.4 up 1.0 1.0 
>> 6 3.0 osd.6 up 0.89995 1.0 
>> 8 3.0 osd.8 up 1.0 1.0 
>> 10 3.0 osd.10 up 1.0 1.0 
>> 12 3.0 osd.12 up 0.8 1.0 
>> 16 3.0 osd.16 up 1.0 1.0 
>> 18 3.0 osd.18 up 0.90002 1.0 
>> -3 27.0 host cn803 
>> 1 3.0 osd.1 up 1.0 1.0 
>> 3 3.0 osd.3 up 0.95316 1.0 
>> 5 3.0 osd.5 up 1.0 1.0 
>> 7 3.0 osd.7 up 1.0 1.0 
>> 9 3.0 osd.9 up 1.0 1.0 
>> 11 3.0 osd.11 up 0.95001 1.0 
>> 13 3.0 osd.13 up 1.0 1.0 
>> 17 3.0 osd.17 up 0.84999 1.0 
>> 19 3.0 osd.19 up 1.0 1.0
>> > Wrong way to test this, test it from a monitor node, another client node
>> > (like your openstack nodes).
>> > In your 2 node cluster half of the reads or writes will be local, very
>> > much skewing your results.
>> I have been tested from copmute node also and have same result. 80-100Mb/sec
>> 
>That's about as good as it gets (not 148MB/s, though!).
>But rados bench is not the same as real client I/O.
>
>> > Very high max latency, telling us that your cluster ran out of steam at
>> some point.
>> 
>> I copying data from my windows instance right now.
>
>Re-do any testing when you've stopped all scrubbing.
>
>> > I'd de-frag anyway, just to rule that out.
>> 
>> 
>> >When doing your tests or normal (busy) operations from the client VM, run
>> > atop on your storage nodes and observe your OSD HDDs. 
>> > Do they get busy, around 100%?
>> 
>> Yes, high IO load (600-800 io).  But this is very strange on SATA

Re: [ceph-users] OSPF to the host

2016-07-11 Thread Daniel Gryniewicz


On 07/11/2016 08:23 AM, Saverio Proto wrote:

I'm looking at the Dell S-ON switches which we can get in a Cumulus
version. Any pro's and con's of using Cumulus vs old school switch OS's you
may have come across?


Nothing to declare here. Once configured properly the hardware works
as expected. I never used Dell, I used switches from Quanta.

Saverio


I've had good experiences with Dell switches in the past, including routing.

Daniel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-11 Thread Dirk Laurenz


Hi,

what i do to reproduce the failure:

root@cephadmin:~# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.26340 root default
-2 0.08780 host cephosd01
 0 0.04390 osd.0   up  1.0  1.0
 1 0.04390 osd.1   up  1.0  1.0
-3 0.08780 host cephosd02
 2 0.04390 osd.2   up  1.0  1.0
 3 0.04390 osd.3   up  1.0  1.0
-4 0.08780 host cephosd03
 4 0.04390 osd.4   up  1.0  1.0
 5 0.04390 osd.5   up  1.0  1.0
root@cephadmin:~# ssh cephosd01 shutdown -r
Shutdown scheduled for Mon 2016-07-11 14:44:17 CEST, use 'shutdown -c' 
to cancel.

root@cephadmin:~# ssh cephosd01 uptime
 14:44:45 up 0 min,  0 users,  load average: 0.44, 0.10, 0.03
root@cephadmin:~# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.26340 root default
-2 0.08780 host cephosd01
 0 0.04390 osd.0 down  1.0  1.0
 1 0.04390 osd.1 down  1.0  1.0
-3 0.08780 host cephosd02
 2 0.04390 osd.2   up  1.0  1.0
 3 0.04390 osd.3   up  1.0  1.0
-4 0.08780 host cephosd03
 4 0.04390 osd.4   up  1.0  1.0
 5 0.04390 osd.5   up  1.0  1.0

here are some logs of osd.0

root@cephosd01:~# tail /var/log/ceph/ceph-osd.0.log
2016-07-11 14:44:24.588509 7f8228d72800  1 -- :/2152 shutdown complete.
2016-07-11 14:44:39.243944 7efe11c0e800  0 set uid:gid to 64045:64045 
(ceph:ceph)
2016-07-11 14:44:39.258622 7efe11c0e800  0 ceph version 10.2.2 
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 1018
2016-07-11 14:44:39.268743 7efe11c0e800 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2016-07-11 14:44:40.578479 7f05a6b42800  0 set uid:gid to 64045:64045 
(ceph:ceph)
2016-07-11 14:44:40.578591 7f05a6b42800  0 ceph version 10.2.2 
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 1177
2016-07-11 14:44:40.578771 7f05a6b42800 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2016-07-11 14:44:41.794321 7fc756fc8800  0 set uid:gid to 64045:64045 
(ceph:ceph)
2016-07-11 14:44:41.794423 7fc756fc8800  0 ceph version 10.2.2 
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 1378
2016-07-11 14:44:41.794601 7fc756fc8800 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory


but i can start it manually...

root@cephosd01:~# mount /dev/sdb1 /var/lib/ceph/osd/ceph-0
root@cephosd01:~# ceph-osd -i 0
starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 
/var/lib/ceph/osd/ceph-0/journal


root@cephadmin:~# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.26340 root default
-2 0.08780 host cephosd01
 0 0.04390 osd.0   up  1.0  1.0
 1 0.04390 osd.1 down  1.0  1.0
-3 0.08780 host cephosd02
 2 0.04390 osd.2   up  1.0  1.0
 3 0.04390 osd.3   up  1.0  1.0
-4 0.08780 host cephosd03
 4 0.04390 osd.4   up  1.0  1.0
 5 0.04390 osd.5   up  1.0  1.0

The osd have been created with:

ceph-deploy osd prepare  cephosd01:sdb cephosd01:sdc

i'm not sure where to cearch..


Dirk





Am 11.07.2016 um 14:35 schrieb Oliver Dzombic:

Hi Dirk,

without any information, its impossible to tell you anything.

Please provide us some detail information about what is going wrong,
including error messages and so on.

As an admin you should be enough familar with your system to give us
more information but just "its not working". As you know, this
information does not help.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-11 Thread George Shuklin


Short story how OSDs are started in systemd environments:

Ceph OSD parittions has specific typecode (partition type 
4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D). It handled by udev rules shipped 
by ceph package:

/lib/udev/rules.d/95-ceph-osd.rules

It set up proper owner/group for this disk ('ceph' instead 'root') and 
calls /usr/sbin/ceph-disk trigger.


ceph-disk triggers creation of instance of ceph-disk@ systemd unit (to 
mount disk to /var/lib/ceph/osd/...), and ceph-osd@ (i'm not sure about 
all sequence of events).


Basically, to make OSD autostart they NEED to have proper typecode in 
their partition. If you using something different (like 'directory based 
OSD') you should enable OSD autostart:


systemctl enable ceph-osd@42


On 07/11/2016 03:32 PM, Dirk Laurenz wrote:

Hello,


i'm new to ceph an try to do some first steps with ceph to understand 
concepts.


my setup is at first completly in vm


i deployed (with ceph-deploy) three monitors and three osd hosts. (3+3 
vms)


my frist test was to find out, if everything comes back online after a 
system restart. this works fine for the monitors, but fails for the 
osds. i have to start them manually.



OS is debian jessie, ceph is the current release


Where can find out, what's going wrong


Dirk

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] drop i386 support

2016-07-11 Thread George Shuklin


On 07/11/2016 09:57 AM, kefu chai wrote:

Hi Cephers,

I am proposing drop the support of i386. as we don't compile Ceph with
any i386 gitbuilder now[1] and hence don't test the i386 builds on
sepia on a regular basis. Also, based on the assumption that people
don't use i386 in production, I think we can drop it from the minimum
hardware document[2]?

And we won't explicitly disable the i386 build in code if we decide to
drop the i386 support, as we always try to be portable if possible.
But just don't claim the i386 as the officially supported arch
anymore.

What do you think?

---
[1] http://ceph.com/gitbuilder.cgi
[2] 
http://docs.ceph.com/docs/master/start/hardware-recommendations/#minimum-hardware-recommendations


I think no one care about 32-bit server part, but client pieces should 
keep compatibility for few more versions, at least. You never know what 
kind of grue luking under the client side.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSPF to the host

2016-07-11 Thread Nick Fisk

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Daniel Gryniewicz
> Sent: 11 July 2016 13:38
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSPF to the host
> 
> On 07/11/2016 08:23 AM, Saverio Proto wrote:
> >> I'm looking at the Dell S-ON switches which we can get in a
> >> Cumulus version. Any pro's and con's of using Cumulus vs old school
> >> switch OS's you may have come across?
> >
> > Nothing to declare here. Once configured properly the hardware works
> > as expected. I never used Dell, I used switches from Quanta.
> >
> > Saverio
> 
> I've had good experiences with Dell switches in the past, including routing.

I've just hit a bit of a problem with my N4000's, as I've found out they don't 
seem to support VRRP on IPv6.  :-(

> 
> Daniel
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-11 Thread Dirk Laurenz


hmm, helps partially ... running


/usr/sbin/ceph-disk trigger /dev/sdc1 or sdb1 works and brings osd up..


systemctl enable does not help


Am 11.07.2016 um 14:49 schrieb George Shuklin:

Short story how OSDs are started in systemd environments:

Ceph OSD parittions has specific typecode (partition type 
4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D). It handled by udev rules 
shipped by ceph package:

/lib/udev/rules.d/95-ceph-osd.rules

It set up proper owner/group for this disk ('ceph' instead 'root') and 
calls /usr/sbin/ceph-disk trigger.


ceph-disk triggers creation of instance of ceph-disk@ systemd unit (to 
mount disk to /var/lib/ceph/osd/...), and ceph-osd@ (i'm not sure 
about all sequence of events).


Basically, to make OSD autostart they NEED to have proper typecode in 
their partition. If you using something different (like 'directory 
based OSD') you should enable OSD autostart:


systemctl enable ceph-osd@42


On 07/11/2016 03:32 PM, Dirk Laurenz wrote:

Hello,


i'm new to ceph an try to do some first steps with ceph to understand 
concepts.


my setup is at first completly in vm


i deployed (with ceph-deploy) three monitors and three osd hosts. 
(3+3 vms)


my frist test was to find out, if everything comes back online after 
a system restart. this works fine for the monitors, but fails for the 
osds. i have to start them manually.



OS is debian jessie, ceph is the current release


Where can find out, what's going wrong


Dirk

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph OSD stuck in booting state

2016-07-11 Thread William Josefsson

Hi Everyone,

I have a problem with OSD stuck in booting state.

sudo ceph daemon osd.7 status
{
"cluster_fsid": "724e501f-f4a3-4731-a832-c73685aabd21",
"osd_fsid": "058cac6e-6c66-4eeb-865b-3d22f0e91a99",
"whoami": 7,
"state": "booting",
"oldest_map": 1255,
"newest_map": 2498,
"num_pgs": 0
}

This is what I get in the log file /var/log/ceph/ceph.osd.7.log

2016-07-11 20:38:19.166607 7fb258077880  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-osd, pid 74882
2016-07-11 20:38:19.194663 7fb258077880  0
filestore(/var/lib/ceph/osd/ceph-7) backend xfs (magic 0x58465342)
2016-07-11 20:38:19.196561 7fb258077880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is supported and appears to work
2016-07-11 20:38:19.196567 7fb258077880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2016-07-11 20:38:19.197649 7fb258077880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2016-07-11 20:38:19.197680 7fb258077880  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
disabled by conf
2016-07-11 20:38:19.199632 7fb258077880  0
filestore(/var/lib/ceph/osd/ceph-7) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2016-07-11 20:38:19.202743 7fb258077880  1 journal _open
/var/lib/ceph/osd/ceph-7/journal fd 20: 40010514432 bytes, block size 4096
bytes, directio = 1, aio = 1
2016-07-11 20:38:19.209391 7fb258077880  1 journal _open
/var/lib/ceph/osd/ceph-7/journal fd 20: 40010514432 bytes, block size 4096
bytes, directio = 1, aio = 1
2016-07-11 20:38:19.210134 7fb258077880  0 
cls/hello/cls_hello.cc:271: loading cls_hello
2016-07-11 20:38:19.218632 7fb258077880  0 osd.7 2498 crush map has
features 1107558400, adjusting msgr requires for clients
2016-07-11 20:38:19.218644 7fb258077880  0 osd.7 2498 crush map has
features 1107558400 was 8705, adjusting msgr requires for mons
2016-07-11 20:38:19.218652 7fb258077880  0 osd.7 2498 crush map has
features 1107558400, adjusting msgr requires for osds
2016-07-11 20:38:19.218671 7fb258077880  0 osd.7 2498 load_pgs
2016-07-11 20:38:19.218706 7fb258077880  0 osd.7 2498 load_pgs opened 0 pgs
2016-07-11 20:38:19.219596 7fb258077880 -1 osd.7 2498 log_to_monitors
{default=true}
2016-07-11 20:38:19.223879 7fb2466ea700  0 osd.7 2498 ignoring osdmap until
we have initialized
2016-07-11 20:38:19.223959 7fb2466ea700  0 osd.7 2498 ignoring osdmap until
we have initialized
2016-07-11 20:38:19.224145 7fb258077880  0 osd.7 2498 done with init,
starting boot process

OSD never get connected to the monitor.

I'm running Hammer/Centos 7.2.

Any hint for me?

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cache Tier configuration

2016-07-11 Thread Mateusz Skała

Hello Cephers.

Can someone help me in my cache tier configuration? I have 4 same SSD drives
176GB (184196208K) in SSD pool, how to determine target_max_bytes? I assume
that should be (4 drives* 188616916992 bytes )/ 3 replica = 251489222656
bytes *85% (because of full disk warning)

It will be 213765839257 bytes ~200GB. I make this little bit lower (160GB)
and after some time whole cluster stops on full disk error. One of SSD
drives are full. I see that use of space at the osd is not equal:

32 0.17099  1.0   175G   127G 49514M 72.47 1.77  95

42 0.17099  1.0   175G   120G 56154M 68.78 1.68  90

37 0.17099  1.0   175G   136G 39670M 77.95 1.90 102

47 0.17099  1.0   175G   130G 46599M 74.09 1.80  97

 

My setup:

ceph --admin-daemon /var/run/ceph/ceph-osd.32.asok config show | grep cache

  

  "debug_objectcacher": "0\/5",

"mon_osd_cache_size": "10",

"mon_cache_target_full_warn_ratio": "0.66",

"mon_warn_on_cache_pools_without_hit_sets": "true",

"client_cache_size": "16384",

"client_cache_mid": "0.75",

"mds_cache_size": "10",

"mds_cache_mid": "0.7",

"mds_dump_cache_on_map": "false",

"mds_dump_cache_after_rejoin": "false",

"osd_pool_default_cache_target_dirty_ratio": "0.4",

"osd_pool_default_cache_target_dirty_high_ratio": "0.6",

"osd_pool_default_cache_target_full_ratio": "0.8",

"osd_pool_default_cache_min_flush_age": "0",

"osd_pool_default_cache_min_evict_age": "0",

"osd_tier_default_cache_mode": "writeback",

"osd_tier_default_cache_hit_set_count": "4",

"osd_tier_default_cache_hit_set_period": "1200",

"osd_tier_default_cache_hit_set_type": "bloom",

"osd_tier_default_cache_min_read_recency_for_promote": "3",

"osd_tier_default_cache_min_write_recency_for_promote": "3",

"osd_map_cache_size": "200",

"osd_pg_object_context_cache_count": "64",

"leveldb_cache_size": "134217728",

"filestore_omap_header_cache_size": "1024",

"filestore_fd_cache_size": "128",

"filestore_fd_cache_shards": "16",

"keyvaluestore_header_cache_size": "4096",

"rbd_cache": "true",

"rbd_cache_writethrough_until_flush": "true",

"rbd_cache_size": "33554432",

"rbd_cache_max_dirty": "25165824",

"rbd_cache_target_dirty": "16777216",

"rbd_cache_max_dirty_age": "1",

"rbd_cache_max_dirty_object": "0",

"rbd_cache_block_writes_upfront": "false",

"rgw_cache_enabled": "true",

"rgw_cache_lru_size": "1",

"rgw_keystone_token_cache_size": "1",

"rgw_bucket_quota_cache_size": "1",

 

 

Rule for SSD:

rule ssd {

ruleset 1

type replicated

min_size 1

max_size 10

step take ssd

step choose firstn 2 type rack

step chooseleaf firstn 2 type host

step emit

   step take ssd

step chooseleaf firstn -2 type osd

step emit

}

 

OSD tree with SSD:

-8  0.68597 root ssd

-9  0.34299 rack skwer-ssd

-16  0.17099 host ceph40-ssd

32  0.17099 osd.32up  1.0  1.0

-19  0.17099 host ceph50-ssd

42  0.17099 osd.42up  1.0  1.0

-11  0.34299 rack nzoz-ssd

-17  0.17099 host ceph45-ssd

37  0.17099 osd.37up  1.0  1.0

-22  0.17099 host ceph55-ssd

47  0.17099 osd.47up  1.0  1.0

 

Can someone help? Any ideas? It is normal that whole cluster stops at disk
full error on cache tier, I was thinking that only one of pools can stops
and other without cache tier should still work.

Best regards,

-- 

Mateusz Skała

mateusz.sk...@budikom.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD stuck in booting state

2016-07-11 Thread William Josefsson

Hi All,

Initially, I used /dev/disk/by-partuuid/xxx-- when make the
journal for the OSD, i.e.
sudo ceph-osd -i 30 --mkjournal
--osd-journal=/dev/disk/by-partuuid/2fe31ba2-1ac6-4729-9fdc-63432f50357

Then, I try to use /dev/sdx5 format and it works, i.e.
sudo ceph-osd -i 30 --mkjournal --osd-journal=/dev/sdc5

Does anyone knows why it doesn't work if I map the journal using
/dev/disk/by-partuuid/xxx-xxx ?

Thanks.

On Mon, Jul 11, 2016 at 9:09 PM, William Josefsson <
william.josef...@gmail.com> wrote:

> Hi Everyone,
>
> I have a problem with OSD stuck in booting state.
>
> sudo ceph daemon osd.7 status
> {
> "cluster_fsid": "724e501f-f4a3-4731-a832-c73685aabd21",
> "osd_fsid": "058cac6e-6c66-4eeb-865b-3d22f0e91a99",
> "whoami": 7,
> "state": "booting",
> "oldest_map": 1255,
> "newest_map": 2498,
> "num_pgs": 0
> }
>
> This is what I get in the log file /var/log/ceph/ceph.osd.7.log
>
> 2016-07-11 20:38:19.166607 7fb258077880  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-osd, pid 74882
> 2016-07-11 20:38:19.194663 7fb258077880  0
> filestore(/var/lib/ceph/osd/ceph-7) backend xfs (magic 0x58465342)
> 2016-07-11 20:38:19.196561 7fb258077880  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
> ioctl is supported and appears to work
> 2016-07-11 20:38:19.196567 7fb258077880  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
> ioctl is disabled via 'filestore fiemap' config option
> 2016-07-11 20:38:19.197649 7fb258077880  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
> syncfs(2) syscall fully supported (by glibc and kernel)
> 2016-07-11 20:38:19.197680 7fb258077880  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
> disabled by conf
> 2016-07-11 20:38:19.199632 7fb258077880  0
> filestore(/var/lib/ceph/osd/ceph-7) mount: enabling WRITEAHEAD journal
> mode: checkpoint is not enabled
> 2016-07-11 20:38:19.202743 7fb258077880  1 journal _open
> /var/lib/ceph/osd/ceph-7/journal fd 20: 40010514432 bytes, block size 4096
> bytes, directio = 1, aio = 1
> 2016-07-11 20:38:19.209391 7fb258077880  1 journal _open
> /var/lib/ceph/osd/ceph-7/journal fd 20: 40010514432 bytes, block size 4096
> bytes, directio = 1, aio = 1
> 2016-07-11 20:38:19.210134 7fb258077880  0 
> cls/hello/cls_hello.cc:271: loading cls_hello
> 2016-07-11 20:38:19.218632 7fb258077880  0 osd.7 2498 crush map has
> features 1107558400, adjusting msgr requires for clients
> 2016-07-11 20:38:19.218644 7fb258077880  0 osd.7 2498 crush map has
> features 1107558400 was 8705, adjusting msgr requires for mons
> 2016-07-11 20:38:19.218652 7fb258077880  0 osd.7 2498 crush map has
> features 1107558400, adjusting msgr requires for osds
> 2016-07-11 20:38:19.218671 7fb258077880  0 osd.7 2498 load_pgs
> 2016-07-11 20:38:19.218706 7fb258077880  0 osd.7 2498 load_pgs opened 0 pgs
> 2016-07-11 20:38:19.219596 7fb258077880 -1 osd.7 2498 log_to_monitors
> {default=true}
> 2016-07-11 20:38:19.223879 7fb2466ea700  0 osd.7 2498 ignoring osdmap
> until we have initialized
> 2016-07-11 20:38:19.223959 7fb2466ea700  0 osd.7 2498 ignoring osdmap
> until we have initialized
> 2016-07-11 20:38:19.224145 7fb258077880  0 osd.7 2498 done with init,
> starting boot process
>
> OSD never get connected to the monitor.
>
> I'm running Hammer/Centos 7.2.
>
> Any hint for me?
>
> Thanks.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Lionel Bouton

Le 11/07/2016 11:56, Brad Hubbard a écrit :
> On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
>  wrote:
>> Le 11/07/2016 04:48, 한승진 a écrit :
>>> Hi cephers.
>>>
>>> I need your help for some issues.
>>>
>>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>>
>>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>>
>>> I've experienced one of OSDs was killed himself.
>>>
>>> Always it issued suicide timeout message.
>> This is probably a fragmentation problem : typical rbd access patterns
>> cause heavy BTRFS fragmentation.
> To the extent that operations take over 120 seconds to complete? Really?

Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
aggressive way, rewriting data all over the place and creating/deleting
snapshots every filestore sync interval (5 seconds max by default IIRC).

As I said there are 3 main causes of performance degradation :
- the snapshots,
- the journal in a standard copy-on-write file (move it out of the FS or
use NoCow),
- the weak auto defragmentation of BTRFS (autodefrag mount option).

Each one of them is enough to impact or even destroy performance in the
long run. The 3 combined make BTRFS unusable by default. This is why
BTRFS is not recommended : if you want to use it you have to be prepared
for some (heavy) tuning. The first 2 points are easy to address, for the
last (which begins to be noticeable when you accumulate rewrites on your
data) I'm not aware of any other tool than the one we developed and
published on github (link provided in previous mail).

Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
it now and would recommend 4.4.x if it's possible for you and 4.1.x
otherwise.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Design for Ceph Storage integration with openstack

2016-07-11 Thread Gaurav Goyal

Situation is --> I have installed openstack setup (Liberty) for my lab.
Dear Ceph users,

I need your suggestion for my ceph design.

I have

Host 1 --> Controller + Compute1
Host 2  --> Compute 2

DELL SAN storage is attached to both hosts as

[root@OSKVM1 ~]# iscsiadm -m node

10.35.0.3:3260,1
iqn.2001-05.com.equallogic:0-1cb196-07a83c107-4770018575af-vol1

10.35.0.8:3260,1
iqn.2001-05.com.equallogic:0-1cb196-07a83c107-4770018575af-vol1

10.35.0.*:3260,-1
iqn.2001-05.com.equallogic:0-1cb196-20d83c107-729002157606-vol2

10.35.0.8:3260,1
iqn.2001-05.com.equallogic:0-1cb196-20d83c107-729002157606-vol2

10.35.0.*:3260,-1
iqn.2001-05.com.equallogic:0-1cb196-f0783c107-70a00245761a-vol3

10.35.0.8:3260,1
iqn.2001-05.com.equallogic:0-1cb196-f0783c107-70a00245761a-vol3

10.35.0.*:3260,-1
iqn.2001-05.com.equallogic:0-1cb196-fda83c107-92700275761a-vol4
10.35.0.8:3260,1
iqn.2001-05.com.equallogic:0-1cb196-fda83c107-92700275761a-vol4

with fdisk -l, it is mentioned as
sdc, sdd, sde and sdf on host1
sdb,sdc,sdd,and sde on host 2

I need to configure this SAN storage as CEPH.

i am thinking of
osd0 with sdc on host1
osd1 with sdd on host1

osd2 with sdd on host2
osd3 with sde on host2

So as to have

[root@host1 ~]# ceph osd tree

ID WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 7.95996 root default

-2 3.97998 host host1

 0 1.98999 osd.0up  1.0  1.0

 1 1.98999 osd.1up  1.0  1.0

-3 3.97998 host host2

 2 1.98999 osd.2up  1.0  1.0

 3 1.98999 osd.3up  1.0  1.0
Is it ok? or i must change my ceph design?


Regards
Gaurav Goyal
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] (no subject)

2016-07-11 Thread Gaurav Goyal

Hello it worked for me after removing the following parameter from
/etc/nova/nova.conf file

[root@OSKVM1 ~]# cat /etc/nova/nova.conf|grep hw_disk_discard

#hw_disk_discard=unmap


Though as per ceph documentation, for KILO version we must set this
parameter. I am using Liberty but i am not sure if this parameter is
removed from Liberty. If that is the case please update the documentation.


KILO

Enable discard support for virtual machine ephemeral root disk:

[libvirt]

...

hw_disk_discard = unmap # enable discard support (be careful of performance)


Regards

Gaurav Goyal

On Mon, Jul 11, 2016 at 4:38 AM, Kees Meijs  wrote:

> Hi,
>
> I think there's still something misconfigured:
>
> Invalid: 400 Bad Request: Unknown scheme 'file' found in URI (HTTP 400)
>
>
> It seems the RBD backend is not used as expected.
>
> Have you configured both Cinder *and* Glance to use Ceph?
>
> Regards,
> Kees
>
> On 08-07-16 17:33, Gaurav Goyal wrote:
>
>
> I regenerated the UUID as per your suggestion.
> Now i have same UUID in host1 and host2.
> I could create volumes and attach them to existing VMs.
>
> I could create new glance images.
>
> But still finding the same error while instance launch via GUI.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-11 Thread Patrick Donnelly

Hi Goncalo,

On Fri, Jul 8, 2016 at 3:01 AM, Goncalo Borges
 wrote:
> 5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of
> virtual memory when there is no applications using the filesystem.
>
>  7152 root  20   0 1108m  12m 5496 S  0.0  0.0   0:00.04 ceph-fuse
>
> When I only have one instance of the user application running, ceph-fuse (in
> 10.2.2) slowly rises with time up to 10 GB of memory usage.
>
> if I submit a large number of user applications simultaneously, ceph-fuse
> goes very fast to ~10GB.
>
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 18563 root  20   0 10.0g 328m 5724 S  4.0  0.7   1:38.00 ceph-fuse
>  4343 root  20   0 3131m 237m  12m S  0.0  0.5  28:24.56 dsm_om_connsvcd
>  5536 goncalo   20   0 1599m  99m  32m R 99.9  0.2  31:35.46 python
> 31427 goncalo   20   0 1597m  89m  20m R 99.9  0.2  31:35.88 python
> 20504 goncalo   20   0 1599m  89m  20m R 100.2  0.2  31:34.29 python
> 20508 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:34.20 python
>  4973 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:35.70 python
>  1331 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:35.72 python
> 20505 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:34.46 python
> 20507 goncalo   20   0 1599m  87m  20m R 99.9  0.2  31:34.37 python
> 28375 goncalo   20   0 1597m  86m  20m R 99.9  0.2  31:35.52 python
> 20503 goncalo   20   0 1597m  85m  20m R 100.2  0.2  31:34.09 python
> 20506 goncalo   20   0 1597m  84m  20m R 99.5  0.2  31:34.42 python
> 20502 goncalo   20   0 1597m  83m  20m R 99.9  0.2  31:34.32 python

I've seen this type of thing before. It could be glibc's malloc arenas
for threads. See:

https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en

I would guess there are 20 cores on this machine*?

* 20 = 10GB/(8*64MB)

If the cause here is glibc arenas, I don't think we need to do
anything special. The virtual memory is not actually being used due to
Linux overcommit.

> 6./ On the machines where the user had the segfault, we have 16 GB of RAM
> and 1GB of SWAP
>
> Mem:  16334244k total,  3590100k used, 12744144k free,   221364k buffers
> Swap:  1572860k total,10512k used,  1562348k free,  2937276k cached

But do we know that ceph-fuse is using 10G VM on those machines (the
core count may be different)?

> 7./ I think what is happening is that once the user submits his sets of
> jobs, the memory usage goes to the very limit on this type machine, and the
> raise is actually to fast that ceph-fuse segfaults before OOM Killer can
> kill it.

It's possible but we have no evidence yet that ceph-fuse is using up
all the memory on those machines yet, right?

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-11 Thread Mike Christie

On 07/08/2016 02:22 PM, Oliver Dzombic wrote:
> Hi,
> 
> does anyone have experience how to connect vmware with ceph smart ?
> 
> iSCSI multipath does not really worked well.

Are you trying to export rbd images from multiple iscsi targets at the
same time or just one target?

For the HA/multiple target setup, I am working on this for Red Hat. We
plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something already as
someone mentioned.

We just got a large chunk of code in the upstream kernel (it is in the
block layer maintainer's tree for the next kernel) so it should be
simple to add COMPARE_AND_WRITE support now. We should be posting krbd
exclusive lock support in the next couple weeks.

> NFS could be, but i think thats just too much layers in between to have
> some useable performance.
> 
> Systems like ScaleIO have developed a vmware addon to talk with it.
> 
> Is there something similar out there for ceph ?
> 
> What are you using ?
> 
> Thank you !
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] (no subject)

2016-07-11 Thread Kees Meijs

Glad to hear it works now! Good luck with your setup.

Regards,
Kees

On 11-07-16 17:29, Gaurav Goyal wrote:
> Hello it worked for me after removing the following parameter from
> /etc/nova/nova.conf file

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-11 Thread George Shuklin


Check out partition type for data partition for ceph.

fdisk -l /dev/sdc

On 07/11/2016 04:03 PM, Dirk Laurenz wrote:


hmm, helps partially ... running


/usr/sbin/ceph-disk trigger /dev/sdc1 or sdb1 works and brings osd up..


systemctl enable does not help


Am 11.07.2016 um 14:49 schrieb George Shuklin:

Short story how OSDs are started in systemd environments:

Ceph OSD parittions has specific typecode (partition type 
4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D). It handled by udev rules 
shipped by ceph package:

/lib/udev/rules.d/95-ceph-osd.rules

It set up proper owner/group for this disk ('ceph' instead 'root') 
and calls /usr/sbin/ceph-disk trigger.


ceph-disk triggers creation of instance of ceph-disk@ systemd unit 
(to mount disk to /var/lib/ceph/osd/...), and ceph-osd@ (i'm not sure 
about all sequence of events).


Basically, to make OSD autostart they NEED to have proper typecode in 
their partition. If you using something different (like 'directory 
based OSD') you should enable OSD autostart:


systemctl enable ceph-osd@42


On 07/11/2016 03:32 PM, Dirk Laurenz wrote:

Hello,


i'm new to ceph an try to do some first steps with ceph to 
understand concepts.


my setup is at first completly in vm


i deployed (with ceph-deploy) three monitors and three osd hosts. 
(3+3 vms)


my frist test was to find out, if everything comes back online after 
a system restart. this works fine for the monitors, but fails for 
the osds. i have to start them manually.



OS is debian jessie, ceph is the current release


Where can find out, what's going wrong


Dirk

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] (no subject)

2016-07-11 Thread Gaurav Goyal

Thanks!

I need to create a VM having qcow2 image file as 6.7 GB but raw image as
600GB which is too big.
Is there a way that i need not to convert qcow2 file to raw and it works
well with rbd?

Regards
Gaurav Goyal

On Mon, Jul 11, 2016 at 11:46 AM, Kees Meijs  wrote:

> Glad to hear it works now! Good luck with your setup.
>
> Regards,
> Kees
>
> On 11-07-16 17:29, Gaurav Goyal wrote:
> > Hello it worked for me after removing the following parameter from
> > /etc/nova/nova.conf file
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Using two roots for the same pool

2016-07-11 Thread George Shuklin


Hello.

I want to try CRUSH rule with following idea:
take one OSD from root with SSD drives (and use it as primary).
take two OSD from root with HDD drives.

I've created this rule:

rule rule_mix {
ruleset 2
type replicated
min_size 2
max_size 10
step take ssd
step chooseleaf firstn 1 type osd
step take hdd
step chooseleaf firstn -1 type osd
step emit
}

But I think I done something wrong - all PG are undersized+degraded (I 
use 'size 3', have 2 SSD OSD and 5 HDD OSD).


My noobie questions:

1) Can I use multiple 'take' steps in the single rule?
2) How many emit I should/may use per rule?
3) Is this a proper way to describe such logic? Or it should be done 
differently? (How?)


Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD stuck in booting state

2016-07-11 Thread George Shuklin

Mount OSD data to temp. directory (e.g. /mnt) and check where journal 
pointing (ls -la /mnt/journal). It can lead to a different location from 
your "--osd-journal=" instruction to --mkjournal.


On 07/11/2016 05:46 PM, William Josefsson wrote:

Hi All,

Initially, I used /dev/disk/by-partuuid/xxx-- when make 
the journal for the OSD, i.e.
sudo ceph-osd -i 30 --mkjournal 
--osd-journal=/dev/disk/by-partuuid/2fe31ba2-1ac6-4729-9fdc-63432f50357


Then, I try to use /dev/sdx5 format and it works, i.e.
sudo ceph-osd -i 30 --mkjournal --osd-journal=/dev/sdc5

Does anyone knows why it doesn't work if I map the journal using 
/dev/disk/by-partuuid/xxx-xxx ?


Thanks.

On Mon, Jul 11, 2016 at 9:09 PM, William Josefsson 
mailto:william.josef...@gmail.com>> wrote:


Hi Everyone,

I have a problem with OSD stuck in booting state.

sudo ceph daemon osd.7 status
{
"cluster_fsid": "724e501f-f4a3-4731-a832-c73685aabd21",
"osd_fsid": "058cac6e-6c66-4eeb-865b-3d22f0e91a99",
"whoami": 7,
"state": "booting",
"oldest_map": 1255,
"newest_map": 2498,
"num_pgs": 0
}

This is what I get in the log file /var/log/ceph/ceph.osd.7.log

2016-07-11 20:38:19.166607 7fb258077880  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-osd, pid
74882
2016-07-11 20:38:19.194663 7fb258077880  0
filestore(/var/lib/ceph/osd/ceph-7) backend xfs (magic 0x58465342)
2016-07-11 20:38:19.196561 7fb258077880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
FIEMAP ioctl is supported and appears to work
2016-07-11 20:38:19.196567 7fb258077880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2016-07-11 20:38:19.197649 7fb258077880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2016-07-11 20:38:19.197680 7fb258077880  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature:
extsize is disabled by conf
2016-07-11 20:38:19.199632 7fb258077880  0
filestore(/var/lib/ceph/osd/ceph-7) mount: enabling WRITEAHEAD
journal mode: checkpoint is not enabled
2016-07-11 20:38:19.202743 7fb258077880  1 journal _open
/var/lib/ceph/osd/ceph-7/journal fd 20: 40010514432 bytes, block
size 4096 bytes, directio = 1, aio = 1
2016-07-11 20:38:19.209391 7fb258077880  1 journal _open
/var/lib/ceph/osd/ceph-7/journal fd 20: 40010514432 bytes, block
size 4096 bytes, directio = 1, aio = 1
2016-07-11 20:38:19.210134 7fb258077880  0 
cls/hello/cls_hello.cc:271: loading cls_hello
2016-07-11 20:38:19.218632 7fb258077880  0 osd.7 2498 crush map
has features 1107558400, adjusting msgr requires for clients
2016-07-11 20:38:19.218644 7fb258077880  0 osd.7 2498 crush map
has features 1107558400 was 8705, adjusting msgr requires for mons
2016-07-11 20:38:19.218652 7fb258077880  0 osd.7 2498 crush map
has features 1107558400, adjusting msgr requires for osds
2016-07-11 20:38:19.218671 7fb258077880  0 osd.7 2498 load_pgs
2016-07-11 20:38:19.218706 7fb258077880  0 osd.7 2498 load_pgs
opened 0 pgs
2016-07-11 20:38:19.219596 7fb258077880 -1 osd.7 2498
log_to_monitors {default=true}
2016-07-11 20:38:19.223879 7fb2466ea700  0 osd.7 2498 ignoring
osdmap until we have initialized
2016-07-11 20:38:19.223959 7fb2466ea700  0 osd.7 2498 ignoring
osdmap until we have initialized
2016-07-11 20:38:19.224145 7fb258077880  0 osd.7 2498 done with
init, starting boot process

OSD never get connected to the monitor.

I'm running Hammer/Centos 7.2.

Any hint for me?

Thanks.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-11 Thread Oliver Dzombic

Hi Mike,

i was trying:

https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/

ONE target, from different OSD servers directly, to multiple vmware esxi
servers.

A config looked like:

#cat iqn.ceph-cluster_netzlaboranten-storage.conf

driver iscsi
bs-type rbd
backing-store rbd/vmware-storage
initiator-address 10.0.0.9
initiator-address 10.0.0.10
incominguser vmwaren-storage RPb18P0xAqkAw4M1

We had 4 OSD servers. Everyone had this config running.
We had 2 vmware servers ( esxi ).

So we had 4 paths to this vmware-storage RBD object.

VMware, in the very end, had 8 paths ( 4 path's directly connected to
the specific vmware server ) + 4 paths this specific vmware servers saw
via the other vmware server ).

There were very big problems with performance. I am talking about < 10
MB/s. So the customer was not able to use it, so good old nfs is serving.

At that time we used ceph hammer, and i think esxi 5.5 the customer was
using, or maybe esxi 6, was somewhere last year the testing.

We will make a new attempt now with ceph jewel and esxi 6 and this time
we will manage the vmware servers.

As soon as we fixed this

"ceph mon Segmentation fault after set crush_ruleset ceph 10.2.2"

what i already mailed here to the list is solved, we can start the testing.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 11.07.2016 um 17:45 schrieb Mike Christie:
> On 07/08/2016 02:22 PM, Oliver Dzombic wrote:
>> Hi,
>>
>> does anyone have experience how to connect vmware with ceph smart ?
>>
>> iSCSI multipath does not really worked well.
> 
> Are you trying to export rbd images from multiple iscsi targets at the
> same time or just one target?
> 
> For the HA/multiple target setup, I am working on this for Red Hat. We
> plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something already as
> someone mentioned.
> 
> We just got a large chunk of code in the upstream kernel (it is in the
> block layer maintainer's tree for the next kernel) so it should be
> simple to add COMPARE_AND_WRITE support now. We should be posting krbd
> exclusive lock support in the next couple weeks.
> 
> 
>> NFS could be, but i think thats just too much layers in between to have
>> some useable performance.
>>
>> Systems like ScaleIO have developed a vmware addon to talk with it.
>>
>> Is there something similar out there for ceph ?
>>
>> What are you using ?
>>
>> Thank you !
>>
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Using two roots for the same pool

2016-07-11 Thread Gregory Farnum

I'm not looking at the docs, but I think you need an "emit" statement after
every choose.
-Greg

On Monday, July 11, 2016, George Shuklin  wrote:

> Hello.
>
> I want to try CRUSH rule with following idea:
> take one OSD from root with SSD drives (and use it as primary).
> take two OSD from root with HDD drives.
>
> I've created this rule:
>
> rule rule_mix {
> ruleset 2
> type replicated
> min_size 2
> max_size 10
> step take ssd
> step chooseleaf firstn 1 type osd
> step take hdd
> step chooseleaf firstn -1 type osd
> step emit
> }
>
> But I think I done something wrong - all PG are undersized+degraded (I use
> 'size 3', have 2 SSD OSD and 5 HDD OSD).
>
> My noobie questions:
>
> 1) Can I use multiple 'take' steps in the single rule?
> 2) How many emit I should/may use per rule?
> 3) Is this a proper way to describe such logic? Or it should be done
> differently? (How?)
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-11 Thread Gregory Farnum

Oh, is this one of your custom-built packages? Are they using
tcmalloc? That difference between VSZ and RSS looks like a glibc
malloc problem.
-Greg

On Mon, Jul 11, 2016 at 12:04 AM, Goncalo Borges
 wrote:
> Hi John...
>
> Thank you for replying.
>
> Here is the result of the tests you asked but I do not see nothing abnormal.
> Actually, your suggestions made me see that:
>
> 1) ceph-fuse 9.2.0 is presenting the same behaviour but with less memory
> consumption, probably, less enought so that it doesn't brake ceph-fuse in
> our machines with less memory.
>
> 2) I see a tremendous number of  ceph-fuse threads launched (around 160).
>
> # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | wc -l
> 157
>
> # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | head -n 10
> COMMAND  PPID   PID  SPIDVSZ   RSS %MEM %CPU
> ceph-fuse --id mount_user - 1  3230  3230 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3231 9935240 339780  0.6 0.1
> ceph-fuse --id mount_user - 1  3230  3232 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3233 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3234 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3235 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3236 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3237 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user - 1  3230  3238 9935240 339780  0.6 0.0
>
>
> I do not see a way to actually limit the number of ceph-fuse threads
> launched  or to limit the max vm size each thread should take.
>
> Do you know how to limit those options.
>
> Cheers
>
> Goncalo
>
>
>
>
> 1.> Try running ceph-fuse with valgrind --tool=memcheck to see if it's
> leaking
>
> I have launched ceph-fuse with valgrind in the cluster where there is
> sufficient memory available, and therefore, there is no object cacher
> segfault.
>
> $ valgrind --log-file=/tmp/valgrind-ceph-fuse-10.2.2.txt --tool=memcheck
> ceph-fuse --id mount_user -k /etc/ceph/ceph.client.mount_user.keyring -m
> X.X.X.8:6789 -r /cephfs /coepp/cephfs
>
> This is the output which I get once I unmount the file system after user
> application execution
>
> # cat valgrind-ceph-fuse-10.2.2.txt
> ==12123== Memcheck, a memory error detector
> ==12123== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
> ==12123== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
> ==12123== Command: ceph-fuse --id mount_user -k
> /etc/ceph/ceph.client.mount_user.keyring -m 192.231.127.8:6789 -r /cephfs
> /coepp/cephfs
> ==12123== Parent PID: 11992
> ==12123==
> ==12123==
> ==12123== HEAP SUMMARY:
> ==12123== in use at exit: 29,129 bytes in 397 blocks
> ==12123==   total heap usage: 14,824 allocs, 14,427 frees, 648,030 bytes
> allocated
> ==12123==
> ==12123== LEAK SUMMARY:
> ==12123==definitely lost: 16 bytes in 1 blocks
> ==12123==indirectly lost: 0 bytes in 0 blocks
> ==12123==  possibly lost: 11,705 bytes in 273 blocks
> ==12123==still reachable: 17,408 bytes in 123 blocks
> ==12123== suppressed: 0 bytes in 0 blocks
> ==12123== Rerun with --leak-check=full to see details of leaked memory
> ==12123==
> ==12123== For counts of detected and suppressed errors, rerun with: -v
> ==12123== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)
> ==12126==
> ==12126== HEAP SUMMARY:
> ==12126== in use at exit: 9,641 bytes in 73 blocks
> ==12126==   total heap usage: 31,363,579 allocs, 31,363,506 frees,
> 41,389,143,617 bytes allocated
> ==12126==
> ==12126== LEAK SUMMARY:
> ==12126==definitely lost: 28 bytes in 1 blocks
> ==12126==indirectly lost: 0 bytes in 0 blocks
> ==12126==  possibly lost: 0 bytes in 0 blocks
> ==12126==still reachable: 9,613 bytes in 72 blocks
> ==12126== suppressed: 0 bytes in 0 blocks
> ==12126== Rerun with --leak-check=full to see details of leaked memory
> ==12126==
> ==12126== For counts of detected and suppressed errors, rerun with: -v
> ==12126== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 9)
>
> --- * ---
>
> 2.>  Inspect inode count (ceph daemon  status) to see if it's
> obeying its limit
>
> This is the output I get once ceph-fuse is mounted but no user application
> is running
>
> # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
> {
> "metadata": {
> "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
> "ceph_version": "ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374)",
> "entity_id": "mount_user",
> "hostname": "",
> "mount_point": "\/coepp\/cephfs",
> "root": "\/cephfs"
> },
> "dentry_count": 0,
> "dentry_pinned_count": 0,
> "inode_count": 2,
> "mds_epoch": 817,
> "osd_epoch": 1005,
> "osd_epoch_barrier": 0
> }
>
>
> This is already when ceph-fuse reached

Re: [ceph-users] Using two roots for the same pool

2016-07-11 Thread Bob R

George,

Check the instructions here which should allow you to test your crush rules
without applying them to your cluster.
http://dachary.org/?p=3189

also, fwiw, we are not using an 'emit' after each choose (note these rules
are not implementing what you're trying to)-
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}
rule ssd {
ruleset 1
type replicated
min_size 1
max_size 4
step take ssd
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}

Bob

On Mon, Jul 11, 2016 at 9:19 AM, Gregory Farnum  wrote:

> I'm not looking at the docs, but I think you need an "emit" statement
> after every choose.
> -Greg
>
>
> On Monday, July 11, 2016, George Shuklin  wrote:
>
>> Hello.
>>
>> I want to try CRUSH rule with following idea:
>> take one OSD from root with SSD drives (and use it as primary).
>> take two OSD from root with HDD drives.
>>
>> I've created this rule:
>>
>> rule rule_mix {
>> ruleset 2
>> type replicated
>> min_size 2
>> max_size 10
>> step take ssd
>> step chooseleaf firstn 1 type osd
>> step take hdd
>> step chooseleaf firstn -1 type osd
>> step emit
>> }
>>
>> But I think I done something wrong - all PG are undersized+degraded (I
>> use 'size 3', have 2 SSD OSD and 5 HDD OSD).
>>
>> My noobie questions:
>>
>> 1) Can I use multiple 'take' steps in the single rule?
>> 2) How many emit I should/may use per rule?
>> 3) Is this a proper way to describe such logic? Or it should be done
>> differently? (How?)
>>
>> Thanks.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Using two roots for the same pool

2016-07-11 Thread Gregory Farnum

On Mon, Jul 11, 2016 at 11:15 AM, Bob R  wrote:
> George,
>
> Check the instructions here which should allow you to test your crush rules
> without applying them to your cluster.
> http://dachary.org/?p=3189
>
> also, fwiw, we are not using an 'emit' after each choose (note these rules
> are not implementing what you're trying to)-

I should have been more precise. You need an emit after each *final*
choose. In the rules below they're doing a choose to select some
internal buckets, and then a chooseleaf to get to the OSD level and
emitting those.

If you've got rules trying to select OSDs from multiple roots, you
need an emit each time you reach the OSD level. (ie, one for each
"take" statement.) See
http://docs.ceph.com/docs/master/rados/operations/crush-map/?highlight=emit#placing-different-pools-on-different-osds
and the "ssd-primary" rule.
-Greg

> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step choose firstn 2 type room
> step chooseleaf firstn 2 type host
> step emit
> }
> rule ssd {
> ruleset 1
> type replicated
> min_size 1
> max_size 4
> step take ssd
> step choose firstn 2 type room
> step chooseleaf firstn 2 type host
> step emit
> }
>
> Bob
>
> On Mon, Jul 11, 2016 at 9:19 AM, Gregory Farnum  wrote:
>>
>> I'm not looking at the docs, but I think you need an "emit" statement
>> after every choose.
>> -Greg
>>
>>
>> On Monday, July 11, 2016, George Shuklin  wrote:
>>>
>>> Hello.
>>>
>>> I want to try CRUSH rule with following idea:
>>> take one OSD from root with SSD drives (and use it as primary).
>>> take two OSD from root with HDD drives.
>>>
>>> I've created this rule:
>>>
>>> rule rule_mix {
>>> ruleset 2
>>> type replicated
>>> min_size 2
>>> max_size 10
>>> step take ssd
>>> step chooseleaf firstn 1 type osd
>>> step take hdd
>>> step chooseleaf firstn -1 type osd
>>> step emit
>>> }
>>>
>>> But I think I done something wrong - all PG are undersized+degraded (I
>>> use 'size 3', have 2 SSD OSD and 5 HDD OSD).
>>>
>>> My noobie questions:
>>>
>>> 1) Can I use multiple 'take' steps in the single rule?
>>> 2) How many emit I should/may use per rule?
>>> 3) Is this a proper way to describe such logic? Or it should be done
>>> differently? (How?)
>>>
>>> Thanks.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-11 Thread Jake Young

I'm using this setup with ESXi 5.1 and I get very good performance.  I
suspect you have other issues.  Reliability is another story (see Nick's
posts on tgt and HA to get an idea of the awful problems you can have), but
for my test labs the risk is acceptable.

One change I found helpful is to run tgtd with 128 threads.  I'm running
Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the line
that read:

exec tgtd

to

exec tgtd --nr_iothreads=128

If you're not concerned with reliability, you can enhance throughput even
more by enabling rbd client write-back cache in your tgt VM's ceph.conf
file (you'll need to restart tgtd for this to take effect):

[client]
rbd_cache = true
rbd_cache_size = 67108864 # (64MB)
rbd_cache_max_dirty = 50331648 # (48MB)
rbd_cache_target_dirty = 33554432 # (32MB)
rbd_cache_max_dirty_age = 2
rbd_cache_writethrough_until_flush = false

Here's a sample targets.conf:

  initiator-address ALL
  scsi_sn Charter
  #vendor_id CEPH
  #controller_tid 1
  write-cache on
  read-cache on
  driver iscsi
  bs-type rbd

  lun 5
  scsi_id cfe1000c4a71e700506357

  lun 6
  scsi_id cfe1000c4a71e700507157

  lun 7
  scsi_id cfe1000c4a71e70050da7a

  lun 8
  scsi_id cfe1000c4a71e70050bac0

I don't have FIO numbers handy, but I have some oracle calibrate io output.

We're running Oracle RAC database servers in linux VMs on ESXi 5.1, which
use iSCSI to connect to the tgt service.  I only have a single connection
setup in ESXi for each LUN.  I tested using multipathing and two tgt VMs
presenting identical LUNs/RBD disks, but found that there wasn't a
significant performance gain by doing this, even with round-robin path
selecting in VMware.

These tests were run from two RAC VMs, each on a different host, with both
hosts connected to the same tgt instance.  The way we have oracle
configured, it would have been using two of the LUNs heavily during this
calibrate IO test.

This output is with 128 threads in tgtd and rbd client cache enabled:

START_TIME   END_TIME   MAX_IOPS   MAX_MBPS
MAX_PMBPS   LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 15:10:50  28-JUN-016 15:20:04   14153658
412   14  75

This output is with the same configuration, but with rbd client cache
disabled:

START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
  LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 22:44:29  28-JUN-016 22:49:057449161219
  20  75

This output is from a directly connected EMC VNX5100 FC SAN with 25 disks
using dual 8Gb FC links on a different lab system:

START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
  LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 22:11:25  28-JUN-016 22:18:486487299224
  19  75

One of our goals for our Ceph cluster is to replace the EMC SANs.  We've
accomplished this performance wise, the next step is to get a plausible
iSCSI HA solution working.  I'm very interested in what Mike Christie is
putting together.  I'm in the process of vetting the SUSE solution now.

BTW - The tests were run when we had 75 OSDs, which are all 7200RPM 2TB
HDs, across 9 OSD hosts.  We have no SSD journals, instead we have all the
disks setup as single disk RAID1 disk groups with WB cache with BBU.  All
OSD hosts have 40Gb networking and the ESXi hosts have 10G.

Jake

On Mon, Jul 11, 2016 at 12:06 PM, Oliver Dzombic 
wrote:

> Hi Mike,
>
> i was trying:
>
> https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
>
> ONE target, from different OSD servers directly, to multiple vmware esxi
> servers.
>
> A config looked like:
>
> #cat iqn.ceph-cluster_netzlaboranten-storage.conf
>
> 
> driver iscsi
> bs-type rbd
> backing-store rbd/vmware-storage
> initiator-address 10.0.0.9
> initiator-address 10.0.0.10
> incominguser vmwaren-storage RPb18P0xAqkAw4M1
> 
>
>
> We had 4 OSD servers. Everyone had this config running.
> We had 2 vmware servers ( esxi ).
>
> So we had 4 paths to this vmware-storage RBD object.
>
> VMware, in the very end, had 8 paths ( 4 path's directly connected to
> the specific vmware server ) + 4 paths this specific vmware servers saw
> via the other vmware server ).
>
> There were very big problems with performance. I am talking about < 10
> MB/s. So the customer was not able to use it, so good old nfs is serving.
>
> At that time we used ceph hammer, and i think esxi 5.5 the customer was
> using, or maybe esxi 6, was somewhere last year the testing.
>
> 
>
> We will make a new attempt now with ceph jewel and esxi 6 and this time
> we will manage the vmware servers.
>
> As soon as we fixed this
>
> "ceph mon Segmentation fault after set

[ceph-users] Advice on increasing pgs

2016-07-11 Thread Robin Percy

Hello,

I'm looking for some advice on how to most safely increase the pgs in our
primary ceph pool.

A bit of background: We're running ceph 0.80.9 and have a cluster of 126
OSDs with only 64 pgs allocated to the pool. As a result, 2 OSDs are now
88% full, while the pool is only showing as 6% used.

Based on my understanding, this is clearly a placement problem, so the plan
is to increase to 2048 pgs. In order to avoid significant performance
degradation, we'll be incrementing pg_num and pgp_num one power of two at a
time and waiting for the cluster to rebalance before making the next
increment.

My question is: are there any other steps we can take to minimize potential
performance impact? And/or is there a way to model or predict the level of
impact, based on cluster configuration, data placement, etc?

Thanks in advance for any answers,
Robin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] exclusive-lock

2016-07-11 Thread Jason Dillaman

Unfortunately that is correct -- the exclusive lock automatically
transitions upon request in order to handle QEMU live migration. There
is some on-going work to deeply integrate locking support into QEMU
which would solve this live migration case and librbd could internally
disable automatic lock transitions. In the meantime, before starting
your second copy of QEMU, you should issue a "ceph osd blacklist"
command against the current lock owner.  That will ensure you won't
have two QEMU processes fighting for the exclusive lock.

On Sat, Jul 9, 2016 at 12:37 PM, Bob Tucker  wrote:
> Hello all,
>
> I have been attempting to use the exclusive-lock rbd volume feature to try
> to protect against having two QEMUs writing to a volume at the same time.
> Specifically if one VM appears to fail due to a net-split, and a second copy
> is started somewhere else.
>
> Looking at various mailing list posts and some code patches it looks like
> this is not possible currently because if a client doesn't have the lock it
> will request it from the lock holder and the lock holder will always give it
> up. Therefore the lock will flip back and forth between the clients - which
> in the case of a regular filesystem (such as xfs) will lead to corruption.
>
> Could someone confirm this is the behavior and whether it is possible to
> protect the volume in this scenario?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Advice on increasing pgs

2016-07-11 Thread David Turner

When you increase your PGs you're already going to be moving around all of your 
data.  Doing a full doubling of your PGs from 64 -> 128 -> 256 -> ... -> 2048 
over and over and letting it backfill to healthy every time is a lot of extra 
data movement that isn't needed.

I would recommend setting osd_max_backfills to something that won't cripple 
your cluster (5 works decently for us), set the norecover, nobackfill, nodown, 
and noout flags, and then increase your pg_num and pgp_num slowly until you 
reach your target.  Depending on how much extra RAM you have in each of your 
storage nodes depends on how much you want to increase pg_num by at a time.  We 
don't do more than ~200 at a time.  When you reach your target and there is no 
more peering happening, then unset norecover, nobackfill, and nodown.  After 
you finish all of the backfilling, then unset noout.

You are likely to see slow/blocked requests in your cluster throughout this 
process, but the best thing is to get to the other side of increasing your pgs. 
 The official recommendation for increasing pgs is to plan ahead for the size 
of your cluster and start with that many pgs because this process is painful 
and will slow down your cluster until it's done.

Note, if you're increasing pgs from 2048 to 4096, then doing it in smaller 
chunks of 512 at a time could make sense because of how ceph treats pools with 
a non-base 2 number of pgs.  if you have 8 pgs that are 4GB and increase the 
number to 10 (a non-power of 2) then you will have 6 pgs that are 4GB and 4 pgs 
that are 2GB.  It splits them in half to fill up the number of pgs that aren't 
a power of 2.  If you went to 14 pgs, then you would have 2 pgs that are 4GB 
and 12 pgs that are 2GB.  Finally when you set it to 16 pgs you would have 16 
pgs that are all 2GB.

So if you increase your PGs by less than a power of 2, then it will only work 
on  that number of pgs and leave the rest of them alone.  However in your 
scenario of going from 64 pgs to 2048, you are going to be affecting all of the 
PGs every time you split and buy yourself nothing by doing it in smaller 
chunks.  The reason to not just increase pg_num to 2048 is that when ceph 
creates each PG it has to peer and you can peer your osds into oblivion and 
lose access to all of your data for a while, that's why the recommendation to 
add them bit by bit with nodown, noout, nobackfill, and norecover set so that 
you get to the number you want and then can tell your cluster to start moving 
data.

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Robin Percy 
[rpe...@gmail.com]
Sent: Monday, July 11, 2016 2:53 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Advice on increasing pgs

Hello,

I'm looking for some advice on how to most safely increase the pgs in our 
primary ceph pool.

A bit of background: We're running ceph 0.80.9 and have a cluster of 126 OSDs 
with only 64 pgs allocated to the pool. As a result, 2 OSDs are now 88% full, 
while the pool is only showing as 6% used.

Based on my understanding, this is clearly a placement problem, so the plan is 
to increase to 2048 pgs. In order to avoid significant performance degradation, 
we'll be incrementing pg_num and pgp_num one power of two at a time and waiting 
for the cluster to rebalance before making the next increment.

My question is: are there any other steps we can take to minimize potential 
performance impact? And/or is there a way to model or predict the level of 
impact, based on cluster configuration, data placement, etc?

Thanks in advance for any answers,
Robin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-11 Thread Dirk Laurenz


root@cephosd01:~# fdisk -l /dev/sdb

Disk /dev/sdb: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 87B152E0-EB5D-4EB0-8FFB-C27096CBB1ED

DeviceStart   End  Sectors Size Type
/dev/sdb1  10487808 104857566 94369759  45G unknown
/dev/sdb2  2048  10487807 10485760   5G unknown

Partition table entries are not in disk order.
root@cephosd01:~# fdisk -l /dev/sdc

Disk /dev/sdc: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 31B81FCA-9163-4723-B195-97AEC9568AD0

DeviceStart   End  Sectors Size Type
/dev/sdc1  10487808 104857566 94369759  45G unknown
/dev/sdc2  2048  10487807 10485760   5G unknown

Partition table entries are not in disk order.


Am 11.07.2016 um 18:01 schrieb George Shuklin:

Check out partition type for data partition for ceph.

fdisk -l /dev/sdc

On 07/11/2016 04:03 PM, Dirk Laurenz wrote:


hmm, helps partially ... running


/usr/sbin/ceph-disk trigger /dev/sdc1 or sdb1 works and brings osd up..


systemctl enable does not help


Am 11.07.2016 um 14:49 schrieb George Shuklin:

Short story how OSDs are started in systemd environments:

Ceph OSD parittions has specific typecode (partition type 
4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D). It handled by udev rules 
shipped by ceph package:

/lib/udev/rules.d/95-ceph-osd.rules

It set up proper owner/group for this disk ('ceph' instead 'root') 
and calls /usr/sbin/ceph-disk trigger.


ceph-disk triggers creation of instance of ceph-disk@ systemd unit 
(to mount disk to /var/lib/ceph/osd/...), and ceph-osd@ (i'm not 
sure about all sequence of events).


Basically, to make OSD autostart they NEED to have proper typecode 
in their partition. If you using something different (like 
'directory based OSD') you should enable OSD autostart:


systemctl enable ceph-osd@42


On 07/11/2016 03:32 PM, Dirk Laurenz wrote:

Hello,


i'm new to ceph an try to do some first steps with ceph to 
understand concepts.


my setup is at first completly in vm


i deployed (with ceph-deploy) three monitors and three osd hosts. 
(3+3 vms)


my frist test was to find out, if everything comes back online 
after a system restart. this works fine for the monitors, but fails 
for the osds. i have to start them manually.



OS is debian jessie, ceph is the current release


Where can find out, what's going wrong


Dirk

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Object creation in librbd

2016-07-11 Thread Mansour Shafaei Moghaddam

Can anyone explain or at least refer to the lines of the codes in librd by
which objects are created? I need to know the relation between objects and
fio's iodepth...

Thanks in advance
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow performance into windows VM

2016-07-11 Thread Christian Balzer


Hello,

scrub settings will only apply to new scrubs, not running ones, as you
found out.

On Mon, 11 Jul 2016 15:37:49 +0300 K K wrote:

> 
> I have tested windows instance Crystal Disk Mark. Result is:
>
Again, when running a test like this, check with atop/iostat how your
OSDs/HDDs are doing
 
> Sequential Read : 43.049 MB/s
> Sequential Write : 45.181 MB/s
> Random Read 512KB : 78.660 MB/s
> Random Write 512KB : 39.292 MB/s
> Random Read 4KB (QD=1) : 3.511 MB/s [ 857.3 IOPS]
> Random Write 4KB (QD=1) : 1.380 MB/s [ 337.0 IOPS]
> Random Read 4KB (QD=32) : 32.220 MB/s [ 7866.1 IOPS]
> Random Write 4KB (QD=32) : 12.564 MB/s [ 3067.4 IOPS]
> Test : 4000 MB [D: 97.5% (15699.7/16103.1 GB)] (x3)
> 

These numbers aren't all that bad, with your network and w/o SSD journals
the 4KB ones are pretty much on par.

You may get better read performance by permanently enabling read-ahead, as
per:
http://docs.ceph.com/docs/hammer/rbd/rbd-config-ref/

Windows may have native settings to do that, but I know zilch about that.

Christian

> >Понедельник, 11 июля 2016, 12:38 +05:00 от Christian Balzer :
> >
> >
> >Hello,
> >
> >On Mon, 11 Jul 2016 09:54:59 +0300 K K wrote:
> >
> >> 
> >> > I hope the fastest of these MONs (CPU and storage) has the lowest IP
> >> > number and thus is the leader.
> >> no, the lowest IP has slowest CPU. But zabbix didn't show any load at all 
> >> mons.
> >
> >In your use case and configuration no surprise, but again, the lowest IP
> >will be leader by default and thus the busiest. 
> >
> >> > Also what Ceph, OS, kernel version?
> >> 
> >> ubuntu 16.04 kernel 4.4.0-22
> >> 
> >Check the ML archives, I remember people having performance issues with the
> >4.4 kernels.
> >
> >Still don't know your Ceph version, is it the latest Jewel?
> >
> >> > Two GbE ports, given the "frontend" up there with the MON description I
> >> > assume that's 1 port per client (front) and cluster (back) network?
> >> yes, one GbE for ceph client, one GbE for back network.
> >OK, so (from a single GbE client) 100MB/s at most.
> >
> >> > Is there any other client on than that Windows VM on your Ceph cluster?
> >> Yes, another one instance but without load.
> >OK.
> >
> >> > Is Ceph understanding this now?
> >> > Other than that, the queue options aren't likely to do much good with 
> >> > pure
> >> >HDD OSDs.
> >> 
> >> I can't find those parameter in running config:
> >> ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep 
> >> "filestore_queue"
> >
> >These are OSD parameters, you need to query an OSD daemon. 
> >
> >> "filestore_queue_max_ops": "3000",
> >> "filestore_queue_max_bytes": "1048576000",
> >> "filestore_queue_max_delay_multiple": "0",
> >> "filestore_queue_high_delay_multiple": "0",
> >> "filestore_queue_low_threshhold": "0.3",
> >> "filestore_queue_high_threshhold": "0.9",
> >> > That should be 512, 1024 really with one RBD pool.
> >> 
> >> Yes, I know. Today for test I added scbench pool with 128 pg
> >> There are output status and osd tree:
> >> ceph status
> >> cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c
> >> health HEALTH_OK
> >> monmap e6: 3 mons at 
> >> {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0}
> >> election epoch 238, quorum 0,1,2 block01,object01,object02
> >> osdmap e6887: 18 osds: 18 up, 18 in
> >> pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects
> >> 35049 GB used, 15218 GB / 50267 GB avail
> >> 1275 active+clean
> >> 3 active+clean+scrubbing+deep
> >> 2 active+clean+scrubbing
> >>
> >Check the ML archives and restrict scrubs to off-peak hours as well as
> >tune things to keep their impact low.
> >
> >Scrubbing is a major performance killer, especially on non-SSD journal
> >OSDs and with older Ceph versions and/or non-tuned parameters:
> >---
> >osd_scrub_end_hour = 6
> >osd_scrub_load_threshold = 2.5
> >osd_scrub_sleep = 0.1
> >---
> >
> >> client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr
> >> 
> >> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> >> -1 54.0 root default 
> >> -2 27.0 host cn802 
> >> 0 3.0 osd.0 up 1.0 1.0 
> >> 2 3.0 osd.2 up 1.0 1.0 
> >> 4 3.0 osd.4 up 1.0 1.0 
> >> 6 3.0 osd.6 up 0.89995 1.0 
> >> 8 3.0 osd.8 up 1.0 1.0 
> >> 10 3.0 osd.10 up 1.0 1.0 
> >> 12 3.0 osd.12 up 0.8 1.0 
> >> 16 3.0 osd.16 up 1.0 1.0 
> >> 18 3.0 osd.18 up 0.90002 1.0 
> >> -3 27.0 host cn803 
> >> 1 3.0 osd.1 up 1.0 1.0 
> >> 3 3.0 osd.3 up 0.95316 1.0 
> >> 5 3.0 osd.5 up 1.0 1.0 
> >> 7 3.0 osd.7 up 1.0 1.0 
> >> 9 3.0 osd.9 up 1.0 1.0 
> >> 11 3.0 osd.11 up 0.95001 1.0 
> >> 13 3.0 osd.13 up 1.0 1.0 
> >> 17 3.0 osd.17 up 0.84999 1.0 
> >> 19 3.0 osd.19 up 1.0 1.0
> >> > Wrong way to test this, test it from a monitor node, another client node
> >> > (like your openstack nodes).
> >> > In

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Brad Hubbard

On Mon, Jul 11, 2016 at 04:53:36PM +0200, Lionel Bouton wrote:
> Le 11/07/2016 11:56, Brad Hubbard a écrit :
> > On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
> >  wrote:
> >> Le 11/07/2016 04:48, 한승진 a écrit :
> >>> Hi cephers.
> >>>
> >>> I need your help for some issues.
> >>>
> >>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
> >>>
> >>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
> >>>
> >>> I've experienced one of OSDs was killed himself.
> >>>
> >>> Always it issued suicide timeout message.
> >> This is probably a fragmentation problem : typical rbd access patterns
> >> cause heavy BTRFS fragmentation.
> > To the extent that operations take over 120 seconds to complete? Really?
> 
> Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
> aggressive way, rewriting data all over the place and creating/deleting
> snapshots every filestore sync interval (5 seconds max by default IIRC).
> 
> As I said there are 3 main causes of performance degradation :
> - the snapshots,
> - the journal in a standard copy-on-write file (move it out of the FS or
> use NoCow),
> - the weak auto defragmentation of BTRFS (autodefrag mount option).
> 
> Each one of them is enough to impact or even destroy performance in the
> long run. The 3 combined make BTRFS unusable by default. This is why
> BTRFS is not recommended : if you want to use it you have to be prepared
> for some (heavy) tuning. The first 2 points are easy to address, for the
> last (which begins to be noticeable when you accumulate rewrites on your
> data) I'm not aware of any other tool than the one we developed and
> published on github (link provided in previous mail).
> 
> Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
> OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
> it now and would recommend 4.4.x if it's possible for you and 4.1.x
> otherwise.

Thanks for the information. I wasn't aware things were that bad with BTRFS as
I haven't had much to do with it up to this point.

Cheers,
Brad

> 
> Best regards,
> 
> Lionel

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] exclusive-lock

2016-07-11 Thread Christian Balzer


Hello,

In this context my first question would also be, how does one wind up with
such a lock contention in the first place?
And how to safely resolve this?

Both of which are not Ceph problems, but those of the client stack being
used or of knowledgeable, 24/7 monitoring and management.

Net-split, split-brain scenarios need to either be resolved by:

1. A human being making the correct decision and thus avoiding two clients
accessing the same image (neither Openstack, ganeti nor Opennebula do
offer out of the box safe automatics split brain resolvers).

or

2. A system like Pacemaker that has all the tools and means to both
identify a split brain scenario correctly and do the right thing by
itself.
Which incidentally could also include doing that blacklist thing as a mild
form of STONITH. ^o^


Regards,

Christian

On Mon, 11 Jul 2016 17:30:16 -0400 Jason Dillaman wrote:

> Unfortunately that is correct -- the exclusive lock automatically
> transitions upon request in order to handle QEMU live migration. There
> is some on-going work to deeply integrate locking support into QEMU
> which would solve this live migration case and librbd could internally
> disable automatic lock transitions. In the meantime, before starting
> your second copy of QEMU, you should issue a "ceph osd blacklist"
> command against the current lock owner.  That will ensure you won't
> have two QEMU processes fighting for the exclusive lock.
> 
> On Sat, Jul 9, 2016 at 12:37 PM, Bob Tucker  wrote:
> > Hello all,
> >
> > I have been attempting to use the exclusive-lock rbd volume feature to try
> > to protect against having two QEMUs writing to a volume at the same time.
> > Specifically if one VM appears to fail due to a net-split, and a second copy
> > is started somewhere else.
> >
> > Looking at various mailing list posts and some code patches it looks like
> > this is not possible currently because if a client doesn't have the lock it
> > will request it from the lock holder and the lock holder will always give it
> > up. Therefore the lock will flip back and forth between the clients - which
> > in the case of a regular filesystem (such as xfs) will lead to corruption.
> >
> > Could someone confirm this is the behavior and whether it is possible to
> > protect the volume in this scenario?
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Tier configuration

2016-07-11 Thread Christian Balzer


Hello,

On Mon, 11 Jul 2016 16:19:58 +0200 Mateusz Skała wrote:

> Hello Cephers.
> 
> Can someone help me in my cache tier configuration? I have 4 same SSD drives
> 176GB (184196208K) in SSD pool, how to determine target_max_bytes? 

What exact SSD models are these?
What version of Ceph?

> I assume
> that should be (4 drives* 188616916992 bytes )/ 3 replica = 251489222656
> bytes *85% (because of full disk warning)

In theory correct, but you might want to consider (like with all pools)
the impact of loosing a single SSD. 
In short, backfilling and then the remaining 3 getting full anyway.

> It will be 213765839257 bytes ~200GB. I make this little bit lower (160GB)
> and after some time whole cluster stops on full disk error. One of SSD
> drives are full. I see that use of space at the osd is not equal:
> 
> 32 0.17099  1.0   175G   127G 49514M 72.47 1.77  95
> 
> 42 0.17099  1.0   175G   120G 56154M 68.78 1.68  90
> 
> 37 0.17099  1.0   175G   136G 39670M 77.95 1.90 102
> 
> 47 0.17099  1.0   175G   130G 46599M 74.09 1.80  97
> 

What's the exact error message?

None of these are over 85 or 95%, how are they full?

If the above is a snapshot of when Ceph thinks something is "full", it may
be an indication that you've reached target_max_bytes and Ceph simply has
no clean (flushed) objects ready to evict.
Which means a configuration problem (all ratios, not the defaults, for
this pool please) or your cache filling up faster than it can flush.

Space is never equal with Ceph, you need a high enough number of PGs for
starters and then some fine-tuning.

After fiddling with the weights my cache-tier SSD OSDs are all very close
to each other:
---
ID WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  
18 0.64999  1.0  679G   543G   136G 79.96 4.35 
19 0.67000  1.0  679G   540G   138G 79.61 4.33 
20 0.64999  1.0  679G   534G   144G 78.70 4.28 
21 0.64999  1.0  679G   536G   142G 79.03 4.30 
26 0.62999  1.0  679G   540G   138G 79.57 4.33 
27 0.62000  1.0  679G   538G   140G 79.30 4.32 
28 0.67000  1.0  679G   539G   140G 79.35 4.32 
29 0.69499  1.0  679G   536G   142G 78.96 4.30 
---

>  
> 
> My setup:
> 
> ceph --admin-daemon /var/run/ceph/ceph-osd.32.asok config show | grep cache
> 
>   
Nearly all of these are irrelevant, output of "ceph osd pool ls detail"
please, at least for the cache pool.

Have you read the documentation and my thread in this ML labeled 
"Cache tier operation clarifications"?

> 
> Can someone help? Any ideas? It is normal that whole cluster stops at disk
> full error on cache tier, I was thinking that only one of pools can stops
> and other without cache tier should still work.
>
Once you activate a cache-tier it becomes for all intends and purposes the
the pool it's caching for.
So any problem with it will be fatal.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph v10.2.2 compile issue

2016-07-11 Thread 徐元慧

Hi All

I use the Ceph stable version v10.2.2. When I begin to compile the source
code, I use the make && make install. I am sure that the command make
builds successfully. But the command make install would always appear the
same issue for Installing
/usr/local/lib/python2.7/dist-packages/ceph_detect_init-1.0.1-py2.7.egg
failed.

The detail log is below:

Processing dependencies for ceph-detect-init==1.0.1
Traceback (most recent call last):
  File "setup.py", line 75, in 
'ceph-detect-init = ceph_detect_init.main:run',
  File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
  File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
  File "/usr/lib/python2.7/dist-packages/setuptools/command/install.py",
line 73, in run
self.do_egg_install()
  File "/usr/lib/python2.7/dist-packages/setuptools/command/install.py",
line 96, in do_egg_install
cmd.run()
  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py",
line 381, in run
self.easy_install(spec, not self.no_deps)
  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py",
line 597, in easy_install
return self.install_item(None, spec, tmpdir, deps, True)
  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py",
line 648, in install_item
self.process_distribution(spec, dist, deps)
  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py",
line 694, in process_distribution
[requirement], self.local_index, self.easy_install
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 633, in resolve
requirements.extend(dist.requires(req.extras)[::-1])
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2291,
in requires
dm = self._dep_map
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2277,
in _dep_map
for extra,reqs in split_sections(self._get_metadata(name)):
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2715,
in split_sections
for line in yield_lines(s):
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1989,
in yield_lines
for ss in strs:
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2305,
in _get_metadata
for line in self.get_metadata_lines(name):
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1369,
in get_metadata_lines
return yield_lines(self.get_metadata(name))
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1361,
in get_metadata
return self._get(self._fn(self.egg_info,name))
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1425, in _get
return self.loader.get_data(path)
zipimport.ZipImportError: bad local file header in
/usr/local/lib/python2.7/dist-packages/ceph_detect_init-1.0.1-py2.7.egg

When I remove the packages called ceph_detect_init-1.0.1-py2.7.egg.
The build would be succeed. Did anyone meet same issue with me. I
don`t think I should remove the python package for necessary every
time.


Thank you.
Yuanhui
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Mail Test

2016-07-11 Thread xiongnuwang

I have joined。___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Advice on increasing pgs

2016-07-11 Thread Robin Percy

First off, thanks for the great response David.

If I understand correctly, you're saying there are two distinct costs to
consider: peering, and backfilling. The backfilling cost is a function of
the amount of data in our pool, and therefore won't benefit from
incremental steps. But the peering cost is a function of pg_num, and should
be incremented in steps of at most ~200 (depending on hardware) until we
reach a power of 2.

Assuming I've got that right, one follow up question is: should we expect
blocked/delayed requests during both the peering and backfilling processes,
or is it more common in one than the other? I couldn't quite get a
definitive answer from the docs on peering.

At this point we're planning to hedge our bets by increasing pg_num to 256
before backfilling so we can at least buy some headroom on our full OSDs
and evaluate the impact before deciding whether we can safely make the
jumps to 2048 without an outage. If that doesn't make sense, I may be
overestimating the cost of peering.

Thanks again for your help,
Robin


On Mon, Jul 11, 2016 at 2:40 PM David Turner 
wrote:

> When you increase your PGs you're already going to be moving around all of
> your data.  Doing a full doubling of your PGs from 64 -> 128 -> 256 -> ...
> -> 2048 over and over and letting it backfill to healthy every time is a
> lot of extra data movement that isn't needed.
>
> I would recommend setting osd_max_backfills to something that won't
> cripple your cluster (5 works decently for us), set the norecover,
> nobackfill, nodown, and noout flags, and then increase your pg_num and
> pgp_num slowly until you reach your target.  Depending on how much extra
> RAM you have in each of your storage nodes depends on how much you want to
> increase pg_num by at a time.  We don't do more than ~200 at a time.  When
> you reach your target and there is no more peering happening, then unset
> norecover, nobackfill, and nodown.  After you finish all of the
> backfilling, then unset noout.
>
> You are likely to see slow/blocked requests in your cluster throughout
> this process, but the best thing is to get to the other side of increasing
> your pgs.  The official recommendation for increasing pgs is to plan ahead
> for the size of your cluster and start with that many pgs because this
> process is painful and will slow down your cluster until it's done.
>
> Note, if you're increasing pgs from 2048 to 4096, then doing it in smaller
> chunks of 512 at a time could make sense because of how ceph treats pools
> with a non-base 2 number of pgs.  if you have 8 pgs that are 4GB and
> increase the number to 10 (a non-power of 2) then you will have 6 pgs that
> are 4GB and 4 pgs that are 2GB.  It splits them in half to fill up the
> number of pgs that aren't a power of 2.  If you went to 14 pgs, then you
> would have 2 pgs that are 4GB and 12 pgs that are 2GB.  Finally when you
> set it to 16 pgs you would have 16 pgs that are all 2GB.
>
> So if you increase your PGs by less than a power of 2, then it will only
> work on  that number of pgs and leave the rest of them alone.  However in
> your scenario of going from 64 pgs to 2048, you are going to be affecting
> all of the PGs every time you split and buy yourself nothing by doing it in
> smaller chunks.  The reason to not just increase pg_num to 2048 is that
> when ceph creates each PG it has to peer and you can peer your osds into
> oblivion and lose access to all of your data for a while, that's why the
> recommendation to add them bit by bit with nodown, noout, nobackfill, and
> norecover set so that you get to the number you want and then can tell your
> cluster to start moving data.
> --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Robin
> Percy [rpe...@gmail.com]
> *Sent:* Monday, July 11, 2016 2:53 PM
> *To:* ceph-us...@ceph.com
> *Subject:* [ceph-users] Advice on increasing pgs
>
> Hello,
>
> I'm looking for some advice on how to most safely increase the pgs in our
> primary ceph pool.
>
> A bit of background: We're running ceph 0.80.9 and have a cluster of 126
> OSDs with only 64 pgs allocated to the pool. As a result, 2 OSDs are now
> 88% full, while the pool is only showing as 6% used.
>
> Based on my understanding, this is clearly a placement problem, so the
> plan is to increase to 2048 pgs. In order to avoid significant performance
> degradation, we'll be incrementing pg_num and pgp_num one power of two at a
> time and waiting for the cluster to rebalance before making the next
> increment.
>
> My question is: are there any other steps we can take to minimize
> potential performance impact? And/or is there a way to model or predict the
> level of impact, based on cluster configuration, data placement, etc?
>
> Thanks in advance for any answers,
> Robin
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fuse segfaults ( jewel 10.2.2)

2016-07-11 Thread Yan, Zheng

On Tue, Jul 12, 2016 at 1:07 AM, Gregory Farnum  wrote:
> Oh, is this one of your custom-built packages? Are they using
> tcmalloc? That difference between VSZ and RSS looks like a glibc
> malloc problem.
> -Greg
>

ceph-fuse at http://download.ceph.com/rpm-jewel/el7/x86_64/ is not
linked to libtcmalloc either. open issue
http://tracker.ceph.com/issues/16655

Yan, Zheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mail Test

2016-07-11 Thread Ken Peng

You are welcome. But please don't send test message to a public list. :)

2016-07-12 11:07 GMT+08:00 xiongnuwang :

> I have joined。
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

2016-07-11 Thread Alex Gorbachev

Hi Oliver,

On Friday, July 8, 2016, Oliver Dzombic  wrote:

> Hi,
>
> does anyone have experience how to connect vmware with ceph smart ?
>
> iSCSI multipath does not really worked well.
> NFS could be, but i think thats just too much layers in between to have
> some useable performance.
>
> Systems like ScaleIO have developed a vmware addon to talk with it.
>
> Is there something similar out there for ceph ?
>
> What are you using ?


We use RBD with SCST, Pacemaker and EnhanceIO (for read only SSD caching).
The HA agents are open source, there are several options for those.
Currently running 3 VMware clusters with 15 hosts total, and things are
quite decent.

Regards,
Alex Gorbachev
Storcium


>
> Thank you !
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de 
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Repairing a broken leveldb

2016-07-11 Thread Michael Metz-Martini | SpeedPartner GmbH

Hi,

while rebalancing a drive experienced read errors so I think leveldb was
corrupted. Unfortunately there's currently no second copy which is
up2date so I can forget this pg. Only one pg is affected (I moved all
other pg's away as they had active copies on another osd.

In "daily business" this osd is still running, but crashes when starting
backfilling. [1]. This pg holds meta-data for our cephfs so loosing data
would be painful.

Any ideas how to recover/repair leveldb or at least skip the broken
part? Thanks in advance.

  "up": [
34,
105],
  "acting": [
9],
  "backfill_targets": [
"34",
"105"],
  "actingbackfill": [
"9",
"34",
"105"],

[1] http://www.michael-metz.de/ceph-osd.9.log.gz

-- 
Kind regards
 Michael Metz-Martini

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-11 Thread Dirk Laurenz


and this- after starting the osd manually


root@cephosd01:~# df
Filesystem 1K-blocksUsed Available Use% Mounted on
/dev/dm-0   15616412 1583180  13216900  11% /
udev   10240   0 10240   0% /dev
tmpfs  496564636 45020  10% /run
tmpfs 124132   0124132   0% /dev/shm
tmpfs   5120   0  5120   0% /run/lock
tmpfs 124132   0124132   0% /sys/fs/cgroup
/dev/sda1 240972   33309195222  15% /boot
/dev/sdb1   47161840   35260  47126580   1% /var/lib/ceph/osd/ceph-0
/dev/sdc1   47161840   34952  47126888   1% /var/lib/ceph/osd/ceph-1


what i did not understand is that i would expect ceph-deploy to work 
properly. i just setup all six nodes in a fresh install, and then used 
ceph deploy to install them:


All done from a adminvm:

ceph-deploy new cephmon01 cephmon02 cephmon03
ceph-deploy install cephmon01 cephmon02 cephmon03 cephosd01 cephosd02 
cephosd03

ceph-deploy mon create cephmon01
ceph-deploy mon create cephmon02
ceph-deploy mon create cephmon03
ceph-deploy osd prepare  cephosd01:sdb cephosd01:sdc
ceph-deploy osd prepare  cephosd02:sdb cephosd02:sdc
ceph-deploy osd prepare  cephosd03:sdb cephosd03:sdc
ceph osd tree

and directly afterwards (after seeing 6 OSDs up)

ssh cephosd01 shutdown -r

root@cephadmin:~# cat /etc/debian_version
8.5


Am 12.07.2016 um 00:05 schrieb Dirk Laurenz:


root@cephosd01:~# fdisk -l /dev/sdb

Disk /dev/sdb: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 87B152E0-EB5D-4EB0-8FFB-C27096CBB1ED

DeviceStart   End  Sectors Size Type
/dev/sdb1  10487808 104857566 94369759  45G unknown
/dev/sdb2  2048  10487807 10485760   5G unknown

Partition table entries are not in disk order.
root@cephosd01:~# fdisk -l /dev/sdc

Disk /dev/sdc: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 31B81FCA-9163-4723-B195-97AEC9568AD0

DeviceStart   End  Sectors Size Type
/dev/sdc1  10487808 104857566 94369759  45G unknown
/dev/sdc2  2048  10487807 10485760   5G unknown

Partition table entries are not in disk order.


Am 11.07.2016 um 18:01 schrieb George Shuklin:

Check out partition type for data partition for ceph.

fdisk -l /dev/sdc

On 07/11/2016 04:03 PM, Dirk Laurenz wrote:


hmm, helps partially ... running


/usr/sbin/ceph-disk trigger /dev/sdc1 or sdb1 works and brings osd up..


systemctl enable does not help


Am 11.07.2016 um 14:49 schrieb George Shuklin:

Short story how OSDs are started in systemd environments:

Ceph OSD parittions has specific typecode (partition type 
4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D). It handled by udev rules 
shipped by ceph package:

/lib/udev/rules.d/95-ceph-osd.rules

It set up proper owner/group for this disk ('ceph' instead 'root') 
and calls /usr/sbin/ceph-disk trigger.


ceph-disk triggers creation of instance of ceph-disk@ systemd unit 
(to mount disk to /var/lib/ceph/osd/...), and ceph-osd@ (i'm not 
sure about all sequence of events).


Basically, to make OSD autostart they NEED to have proper typecode 
in their partition. If you using something different (like 
'directory based OSD') you should enable OSD autostart:


systemctl enable ceph-osd@42


On 07/11/2016 03:32 PM, Dirk Laurenz wrote:

Hello,


i'm new to ceph an try to do some first steps with ceph to 
understand concepts.


my setup is at first completly in vm


i deployed (with ceph-deploy) three monitors and three osd hosts. 
(3+3 vms)


my frist test was to find out, if everything comes back online 
after a system restart. this works fine for the monitors, but 
fails for the osds. i have to start them manually.



OS is debian jessie, ceph is the current release


Where can find out, what's going wrong


Dirk

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
h

Re: [ceph-users] Advice on increasing pgs

2016-07-11 Thread Christian Balzer


Hello,

On Tue, 12 Jul 2016 03:43:41 + Robin Percy wrote:

> First off, thanks for the great response David.
>
Yes, that was a very good writeup.
 
> If I understand correctly, you're saying there are two distinct costs to
> consider: peering, and backfilling. The backfilling cost is a function of
> the amount of data in our pool, and therefore won't benefit from
> incremental steps. But the peering cost is a function of pg_num, and should
> be incremented in steps of at most ~200 (depending on hardware) until we
> reach a power of 2.
>
Peering is all about RAM (more links, states, permanently so), CPU and
network (when setting up the links).
And this happens instantaneously, with no parameters in Ceph to slow this
down.

So yes, you want to increase the pg_num and pgp_num somewhat slowly, at
least at first until you have a feel for what you HW can handle.

> Assuming I've got that right, one follow up question is: should we expect
> blocked/delayed requests during both the peering and backfilling processes,
> or is it more common in one than the other? I couldn't quite get a
> definitive answer from the docs on peering.
> 
Peering is a sharp shock, it should be quick to resolve (again, depending
on HW, etc) and not lead to noticeable interruptions.
But YMMWV, thus again initial baby steps.

Backfilling is that inevitable avalanche, but if you start with
osd_max_backfills=1 and then creep it up as you get a feel of what you
cluster can handle you should be able to both keep slow requests at bay
AND hopefully finish within a reasonable sized maintenance window.

Since you're still on Firefly, you won't be getting the queue benefits of
Jewel, which should help with backfilling stomping on client traffic toes
as well.

OTOH, you're currently only using a fraction of your cluster's capabilities
(64 PGs with 126 OSDs), so there should be quite some capacity for this
reshuffle available. 

> At this point we're planning to hedge our bets by increasing pg_num to 256
> before backfilling so we can at least buy some headroom on our full OSDs
> and evaluate the impact before deciding whether we can safely make the
> jumps to 2048 without an outage. If that doesn't make sense, I may be
> overestimating the cost of peering.
> 
As David said, freeze your cluster (norecover, nobackfill, nodown and
noout), slowly up your PGs and PGPs then let the good times roll and
unleash the dogs of backfill.


The thing that worries me the most in your scenario are the already
near-full OSDs.

As many people found out the hard way, Ceph may initially go and put MORE
data on OSDs before later distributing things more evenly.
See for example this mail from me and the image URL in it:
http://www.spinics.net/lists/ceph-users/msg27794.html

Normally my advise would be to re-weight the full (or near empty) OSDs so
that things get a bit more evenly distributed and below near-full levels
before starting the PG increase.
But in your case with so few PGs to begin with, it's going to be tricky to
get it right and not make things worse.

Hopefully the plentiful PG/OSD choices Ceph has after the PG increase in
your case will make it do the right thing from the get-go.

Christian


> Thanks again for your help,
> Robin
> 
> 
> On Mon, Jul 11, 2016 at 2:40 PM David Turner 
> wrote:
> 
> > When you increase your PGs you're already going to be moving around all of
> > your data.  Doing a full doubling of your PGs from 64 -> 128 -> 256 -> ...
> > -> 2048 over and over and letting it backfill to healthy every time is a
> > lot of extra data movement that isn't needed.
> >
> > I would recommend setting osd_max_backfills to something that won't
> > cripple your cluster (5 works decently for us), set the norecover,
> > nobackfill, nodown, and noout flags, and then increase your pg_num and
> > pgp_num slowly until you reach your target.  Depending on how much extra
> > RAM you have in each of your storage nodes depends on how much you want to
> > increase pg_num by at a time.  We don't do more than ~200 at a time.  When
> > you reach your target and there is no more peering happening, then unset
> > norecover, nobackfill, and nodown.  After you finish all of the
> > backfilling, then unset noout.
> >
> > You are likely to see slow/blocked requests in your cluster throughout
> > this process, but the best thing is to get to the other side of increasing
> > your pgs.  The official recommendation for increasing pgs is to plan ahead
> > for the size of your cluster and start with that many pgs because this
> > process is painful and will slow down your cluster until it's done.
> >
> > Note, if you're increasing pgs from 2048 to 4096, then doing it in smaller
> > chunks of 512 at a time could make sense because of how ceph treats pools
> > with a non-base 2 number of pgs.  if you have 8 pgs that are 4GB and
> > increase the number to 10 (a non-power of 2) then you will have 6 pgs that
> > are 4GB and 4 pgs that are 2GB.  It splits them in half to fill up t

[ceph-users] Flood of 'failed to encode map X with expected crc' on 1800 OSD cluster after upgrade

2016-07-11 Thread Wido den Hollander

Hi,

I am upgrading a 1800 OSD cluster from Hammer 0.94.5 to 0.94.7 prior to going 
to Jewel and while doing so I see the monitors being flooded with these 
messages:

2016-07-12 08:28:12.919748 osd.1200 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:12.921943 osd.1338 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:12.923814 osd.353 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:12.939370 osd.1200 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:12.941482 osd.1338 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:12.960100 osd.1338 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:12.979404 osd.1338 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:13.012463 osd.353 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:13.039417 osd.353 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:13.079893 osd.353 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:13.76 osd.575 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:13.135279 osd.353 [WRN] failed to encode map e130549 with 
expected crc
2016-07-12 08:28:13.144697 osd.575 [WRN] failed to encode map e130549 with 
expected crc

This just goes on and on. The flood of messages cause the monitors to start 
consuming a bit of CPU which makes the cluster operate slower.

I am restarting the OSDs slowly and when I stop doing so the messages disappear 
and the cluster operates just fine.

I know that the messages pop up due to a version mismatch, but is there any way 
to suppress them?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Flood of 'failed to encode map X with expected crc' on 1800 OSD cluster after upgrade

2016-07-11 Thread Christian Balzer


Hello,

On Tue, 12 Jul 2016 08:39:16 +0200 (CEST) Wido den Hollander wrote:

> Hi,
> 
> I am upgrading a 1800 OSD cluster from Hammer 0.94.5 to 0.94.7 prior to going 
> to Jewel and while doing so I see the monitors being flooded with these 
> messages:
>
Google is your friend (and so is the NSA):
---
http://www.spinics.net/lists/ceph-devel/msg30450.html
---

That's also one of the reasons that despite only having a fraction of your
or Dan's OSDs I'm not upgrading to 0.94.7...

Christian

> 2016-07-12 08:28:12.919748 osd.1200 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:12.921943 osd.1338 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:12.923814 osd.353 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:12.939370 osd.1200 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:12.941482 osd.1338 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:12.960100 osd.1338 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:12.979404 osd.1338 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:13.012463 osd.353 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:13.039417 osd.353 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:13.079893 osd.353 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:13.76 osd.575 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:13.135279 osd.353 [WRN] failed to encode map e130549 with 
> expected crc
> 2016-07-12 08:28:13.144697 osd.575 [WRN] failed to encode map e130549 with 
> expected crc
> 
> This just goes on and on. The flood of messages cause the monitors to start 
> consuming a bit of CPU which makes the cluster operate slower.
> 
> I am restarting the OSDs slowly and when I stop doing so the messages 
> disappear and the cluster operates just fine.
> 
> I know that the messages pop up due to a version mismatch, but is there any 
> way to suppress them?
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Flood of 'failed to encode map X with expected crc' on 1800 OSD cluster after upgrade

2016-07-11 Thread Wido den Hollander


> Op 12 juli 2016 om 8:47 schreef Christian Balzer :
> 
> 
> 
> Hello,
> 
> On Tue, 12 Jul 2016 08:39:16 +0200 (CEST) Wido den Hollander wrote:
> 
> > Hi,
> > 
> > I am upgrading a 1800 OSD cluster from Hammer 0.94.5 to 0.94.7 prior to 
> > going to Jewel and while doing so I see the monitors being flooded with 
> > these messages:
> >
> Google is your friend (and so is the NSA):
> ---
> http://www.spinics.net/lists/ceph-devel/msg30450.html
> ---
> 

Thanks! I was searching, but never found that thread. Well, not that post in 
that thread.

The messages in my 'ceph -w' are still 20 minutes behind currently. Logging 
about 08:39 while it's 08:56 here right now.

Wido

> That's also one of the reasons that despite only having a fraction of your
> or Dan's OSDs I'm not upgrading to 0.94.7...
> 
> Christian
> 
> > 2016-07-12 08:28:12.919748 osd.1200 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:12.921943 osd.1338 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:12.923814 osd.353 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:12.939370 osd.1200 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:12.941482 osd.1338 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:12.960100 osd.1338 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:12.979404 osd.1338 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:13.012463 osd.353 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:13.039417 osd.353 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:13.079893 osd.353 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:13.76 osd.575 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:13.135279 osd.353 [WRN] failed to encode map e130549 with 
> > expected crc
> > 2016-07-12 08:28:13.144697 osd.575 [WRN] failed to encode map e130549 with 
> > expected crc
> > 
> > This just goes on and on. The flood of messages cause the monitors to start 
> > consuming a bit of CPU which makes the cluster operate slower.
> > 
> > I am restarting the OSDs slowly and when I stop doing so the messages 
> > disappear and the cluster operates just fine.
> > 
> > I know that the messages pop up due to a version mismatch, but is there any 
> > way to suppress them?
> > 
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

68 matches

Mail list logo